Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Tolerating intermittent hardware errors : characterization, diagnosis and recovery Rashid, Layali 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_spring_rashid_layali.pdf [ 3.1MB ]
Metadata
JSON: 24-1.0073826.json
JSON-LD: 24-1.0073826-ld.json
RDF/XML (Pretty): 24-1.0073826-rdf.xml
RDF/JSON: 24-1.0073826-rdf.json
Turtle: 24-1.0073826-turtle.txt
N-Triples: 24-1.0073826-rdf-ntriples.txt
Original Record: 24-1.0073826-source.json
Full Text
24-1.0073826-fulltext.txt
Citation
24-1.0073826.ris

Full Text

Tolerating Intermittent Hardware Errors: Characterization, Diagnosis and Recovery  by Layali Rashid  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Electrical and Computer Engineering)  The University Of British Columbia (Vancouver) April 2013 c Layali Rashid, 2013  Abstract Over three decades of continuous scaling in CMOS technology has led to tremendous improvements in processor performance. At the same time, the scaling has led to an increase in the frequency of hardware errors due to high process variations, extreme operating conditions and manufacturing defects. Recent studies have found that 40% of the processor failures in real-world machines are due to intermittent hardware errors. Intermittent hardware errors are non-deterministic bursts of errors that occur in the same physical location. Intermittent errors have characteristics that are different from transient and permanent errors, which makes it challenging to devise efficient fault tolerance techniques for them. In this dissertation, we characterize the impact of intermittent hardware faults on programs using fault injection experiments at the micro-architecture level. We find that intermittent errors are likely to generate software visible effects when they occur. Based on our characterization results, we build intermittent error tolerance techniques with focus on error diagnosis and recovery. We first evaluate the impact of different intermittent error recovery scenarios on a processor’s performance and availability. We then propose DIEBA (Diagnose Intermittent hardware Errors in microprocessors by Backtracing Application), a software-based technique to diagnose the fault-prone functional units in a processor.  ii  Preface • A version of Chapter 3 has been published in the IEEE Pacific Rim Internationl Symposium on Dependable Computing in 2010 [61]. Also, a version of the same chapter has appeared in the IEEE Workshop on Dependable and Secure Nanocomputing in 2010 [60]. I setup all the experiments and wrote most of the paper. • A version of Chapter 4 has been published in the IEEE International Conference on Quantitative Evaluation of SysTems in 2012 [40]. I setup all the experiments and wrote most of the paper. • A version of Chapter 5 has appeared in the IEEE Workshop on Silicon Errors in Logic - System Effects in 2012 [62]. I setup the experiments and wrote most of the paper.  iii  Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  viii  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ix  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xii  1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.1  Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5  1.2  Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . .  6  1.2.1  Chapter 2: Intermittent Hardware Faults: Background, Modeling and Fault Injection . . . . . . . . . . . . . . . . . .  1.2.2  Chapter 3: Characterizing the Impact of Intermittent Hardware Faults on Programs . . . . . . . . . . . . . . . . . .  1.2.3  1.3 2  6  Chapter 4: Recovery from Intermittent Hardware Faults: Modeling and Evaluation . . . . . . . . . . . . . . . . . .  1.2.4  6  7  Chapter 5: DIEBA: Diagnosing Intermittent Errors by Backtracing Application Failures . . . . . . . . . . . . . . . .  7  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  8  Intermittent Hardware Faults: Background, Modeling and Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv  9  2.1  Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  9  2.2  Fault Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .  13  2.2.1  Micro-architectural Fault Models . . . . . . . . . . . . .  13  2.2.2  System Level Fault Models . . . . . . . . . . . . . . . . .  14  Fault Injection Tool . . . . . . . . . . . . . . . . . . . . . . . . .  17  2.3.1  Fault Injection Implementation . . . . . . . . . . . . . . .  20  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  20  2.3 2.4 3  Characterizing the Impact of Intermittent Hardware Faults on Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  22  3.1  Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . .  24  3.1.1  Terms and Definitions . . . . . . . . . . . . . . . . . . .  24  3.1.2  Example . . . . . . . . . . . . . . . . . . . . . . . . . .  25  3.2  System Description . . . . . . . . . . . . . . . . . . . . . . . . .  27  3.3  Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  30  3.3.1  4  Impact of Intermittent Faults on How Programs Terminate, CDs and IPSs . . . . . . . . . . . . . . . . . . . . . . . .  30  3.3.2  Impact of Error Masking . . . . . . . . . . . . . . . . . .  32  3.3.3  Impact of Intermittent Fault Properties on the Severity of the Faults . . . . . . . . . . . . . . . . . . . . . . . . . .  35  3.4  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  38  3.5  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  39  Recovery from Intermittent Hardware Faults: Modeling and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  41  4.1  System Description . . . . . . . . . . . . . . . . . . . . . . . . .  42  4.1.1  Definitions and Assumptions . . . . . . . . . . . . . . . .  43  4.1.2  Overview of an Error-Tolerant Core . . . . . . . . . . . .  44  4.1.3  Recovery Scenarios . . . . . . . . . . . . . . . . . . . . .  45  4.1.4  SAN Models . . . . . . . . . . . . . . . . . . . . . . . .  47  4.2  Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . .  51  4.3  Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  54  4.3.1  58  Sensitivity Analysis . . . . . . . . . . . . . . . . . . . .  v  5  4.4  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  61  4.5  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  62  DIEBA: Diagnosing Intermittent Errors by Backtracing Application Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  64  5.1  The Role of Diagnosis in Error Tolerance . . . . . . . . . . . . .  66  5.2  Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  67  5.2.1  Overview . . . . . . . . . . . . . . . . . . . . . . . . . .  68  5.2.2  Background on Dynamic Dependency Graphs . . . . . . .  69  5.2.3  Design Choices and Assumptions . . . . . . . . . . . . .  70  DIEBA Implementation . . . . . . . . . . . . . . . . . . . . . . .  71  5.3.1  Reconstruction of Program Execution . . . . . . . . . . .  71  5.3.2  The DIEBA Algorithm . . . . . . . . . . . . . . . . . . .  74  5.3.3  Example . . . . . . . . . . . . . . . . . . . . . . . . . .  76  5.4  System Description . . . . . . . . . . . . . . . . . . . . . . . . .  79  5.5  Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  82  5.5.1  Fault Injection Outcomes . . . . . . . . . . . . . . . . . .  82  5.5.2  DIEBA Accuracy . . . . . . . . . . . . . . . . . . . . . .  82  5.5.3  DIEBA Overhead . . . . . . . . . . . . . . . . . . . . . .  85  5.6  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  90  5.7  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  92  End-to-End Case Study . . . . . . . . . . . . . . . . . . . . . . . . .  93  6.1  System Description . . . . . . . . . . . . . . . . . . . . . . . . .  94  6.2  Overhead of Error Tolerance Techniques . . . . . . . . . . . . . .  95  6.3  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  98  5.3  6  7  8  Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.1  Characterizations Studies . . . . . . . . . . . . . . . . . . . . . . 100  7.2  Recovery Techniques . . . . . . . . . . . . . . . . . . . . . . . . 101  7.3  Diagnosis Techniques . . . . . . . . . . . . . . . . . . . . . . . . 103  7.4  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105  Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 106 vi  8.1  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107  8.2  Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110  vii  List of Tables Table 1.1  Terms and definitions. . . . . . . . . . . . . . . . . . . . . . .  3  Table 2.1  Description of different intermittent failure mechanisms. . . . .  11  Table 2.2  Intermittent faults’ causes and models. . . . . . . . . . . . . .  15  Table 2.3  Parameters used in system level fault models. . . . . . . . . . .  17  Table 2.4  Fault injection locations, the corresponding signal names in SimpleScalar pipeline diagram (Figure 2.2) and locations in SimpleScalar code . . . . . . . . . . . . . . . . . . . . . . . . . .  21  Table 3.1  Code fragment to illustrate IPS, CD, MNS and RR computation.  26  Table 3.2  Fault-injection parameters. . . . . . . . . . . . . . . . . . . .  29  Table 3.3  Simulator configuration parameters. . . . . . . . . . . . . . . .  29  Table 4.1  SAN model parameters. . . . . . . . . . . . . . . . . . . . . .  52  Table 5.1  Code fragment to illustrate the diagnosis algorithm. . . . . . .  77  Table 5.2  Backtracking paths for the example code. . . . . . . . . . . . .  79  Table 5.3  Fault injection parameters. . . . . . . . . . . . . . . . . . . . .  81  Table 5.4  Number/percentage of program crashes or error detections for each benchmark. . . . . . . . . . . . . . . . . . . . . . . . . .  83  Table 5.5  Crash distance in dynamic instructions for each benchmark. . .  89  Table 6.1  Error-tolerance techniques activated during different tolerance phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  viii  99  List of Figures Figure 2.1  Fault model used in this dissertation. . . . . . . . . . . . . . .  Figure 2.2  SimpleScalar pipeline. Names in bold font represent injection  10  locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  21  Figure 3.1  Fault model used in characterizing faults. . . . . . . . . . . .  27  Figure 3.2  Intermittent fault impact on how programs terminate. The results are obtained with confidence interval of 95% and error margin of ±4% for crashes, ±2% for SDCs and ±4%for benign runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  31  Figure 3.3  Crash-distance ranges for crash-causing intermittent faults. . .  32  Figure 3.4  IPS-cardinality ranges for crash-causing intermittent faults. . .  32  Figure 3.5  Distribution of number of correct, faulty and re-used registers at crash time. The results are obtained with confidence interval of 95% and error margin of ±3% for correct registers, ±2% for re-used registers and ±2% for faulty registers. . . . . . . . . .  Figure 3.6  33  Distribution of number of masked and non-masked nodes at crash time. The results are obtained with confidence interval of 95% and error margin of ±3%. . . . . . . . . . . . . . . .  34  Figure 3.7  Effect of intermittent fault’s length on how programs terminate.  35  Figure 3.8  Effect of intermittent fault’s model on how programs terminate.  36  Figure 3.9  Effect of intermittent fault’s location on how integer programs terminate ( astar in this figure). . . . . . . . . . . . . . . . .  37  Figure 3.10 Effect of intermittent fault’s location on how FP programs terminate ( dealII in this figure). . . . . . . . . . . . . . . . . .  ix  37  Figure 4.1  An overview of an error-tolerant core with rollback-only recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  45  Figure 4.2  An overview of an error-tolerant core with core reconfiguration. 46  Figure 4.3  An overview of an error-tolerant core with fine-grained reconfiguration. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  47  Figure 4.4  Our SAN model for a fault-tolerant core. . . . . . . . . . . .  48  Figure 4.5  Fault model used in studying error recovery. . . . . . . . . . .  51  Figure 4.6  Useful work for the different recovery scenarios using three fault models. . . . . . . . . . . . . . . . . . . . . . . . . . .  Figure 4.7  Availability for the different recovery scenarios using three fault models. . . . . . . . . . . . . . . . . . . . . . . . . . .  Figure 4.8  55  Useful work for different ranks when permanently reconfiguring a defective component. . . . . . . . . . . . . . . . . . . .  Figure 4.9  55  56  Useful work for the rollback-only recovery using Weibull fault model with variable MTTF. . . . . . . . . . . . . . . . . . .  58  Figure 4.10 Availability for the rollback-only recovery using Weibull fault model with variable MTTF. . . . . . . . . . . . . . . . . . .  58  Figure 4.11 Useful work for the rollback-only recovery using Weibull fault model and variable MTTC. . . . . . . . . . . . . . . . . . . .  59  Figure 4.12 Availability for the rollback-only recovery using Weibull fault model and variable MTTC. . . . . . . . . . . . . . . . . . . .  60  Figure 4.13 Useful work for the rollback-only recovery using Weibull fault model and variable recovery duration. . . . . . . . . . . . . .  60  Figure 4.14 Availability for the rollback-only recovery using Weibull fault model and variable recovery duration. . . . . . . . . . . . . .  61  Figure 5.1  Overview of our diagnosis technique. . . . . . . . . . . . . .  68  Figure 5.2  An abstract example to show how to record signatures in different erroneous cases. . . . . . . . . . . . . . . . . . . . . .  Figure 5.3  73  The DDG constructed using code fragment in Table 5.1. Light grey nodes represent weak clues, while dark grey nodes repre-  Figure 5.4  sent strong clues. . . . . . . . . . . . . . . . . . . . . . . . .  77  Fault model used in studying error diagnosis. . . . . . . . . .  79  x  Figure 5.5  DIEBA accuracy for different defective units. . . . . . . . . .  84  Figure 5.6  DIEBA accuracy for different error active durations. . . . . .  87  Figure 5.7  DIEBA accuracy for different fault models. . . . . . . . . . .  88  Figure 6.1  Overview of ARM Cortex-A15 MPCore processor with two cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  94  Figure 6.2  End-to-end example of an error tolerant core. . . . . . . . . .  95  Figure 6.3  Performance overhead of different error tolerance phases when running Freqmine. . . . . . . . . . . . . . . . . . . . . . . .  Figure 6.4  96  Performance overhead of different error-tolerance phases when running Vips. . . . . . . . . . . . . . . . . . . . . . . . . . .  xi  97  Acknowledgments I would like to express my deepest appreciation to my mother Dr. Majd Abaza for always standing by me. I would also like to acknowledge the support of my advisors Professor Karthik Pattabiraman and Professor Sathish Gopalakrishnan. It has been an honor working with them. My appreciations also go to my committee member Professor Alan Hu, university examiners Professor Steve Wilton and Professor Ronald Garcia and external examiner Professor Johan Karlsson for their insightful comments and discussions. Finally, I would like to thank Natural Sciences and Engineering Research Council of Canada and the University of British Columbia for their financial support.  xii  Chapter 1  Introduction Reliability of a computer system is the ability to deliver correct data to the user even if the system is possibly built from error-prone software and hardware. A reliable system can either prevent an error from occurring or tolerate the error when it occurs by detecting data corruption and restoring program execution to the correct data state [2]. Reliability is one of the basic requirements for any user, although the criticality of this requirement varies depending on the user’s needs. Users of medical devices for example, expect the devices to be highly reliable and available as human lives depend on the generated data. Users of video players on the other hand, do not usually notice minor corruption of data. Nowadays, the revolution in information technology increases the demands for computer systems that have high performance and at the same time have low power and area overheads. To cope with this demand, the semi-conductor industry creates smaller transistors and doubles the number of transistors in the same area every one year and a half (Moore’s Law) [46]. However, small transistor sizes create more concerns about transistors reliability. This is because smaller transistors are more vulnerable to abnormal operating conditions, process variations and manufacturing defects [43]. Recent field studies have reported high numbers of hardware errors in commodity processors [29, 48]. Hardware faults1 are typically classified according to their frequency of oc1 We follow the standard fault, error, failure terminology in Avizienis et al. [8]. A fault is the physical defect in the device/circuit. A fault might lead to a corruption in data or a failure or it might  1  currence as transient, intermittent and permanent. Transient faults are one-time events that usually last for one cycle and are unlikely to recur in the same location. Transient faults are mainly caused by environmental conditions such as radiation. Permanent faults are reproducible and irreversible faults that occur in the same micro-architectural location. Permanent faults are mainly caused by transistor wearout that last for a long time due to persistent stress conditions such as extreme temperature or design defects [71]. Intermittent faults lie between the two extremes of transient and permanent faults. Intermittent faults occur sporadically at the same location (i.e., microarchitectural component), and persist for one or more (but finite number of) clock cycles [15]. They can occur due to variations in the manufacturing process [43], in-progress wearout and manufacturing residues [15]. Multiple studies (Borkar et al., Constantinescu and Wells et al.) have predicted that future processors will be more susceptible to intermittent faults and that the rates of these faults will increase due to technology scaling [12] [15] [81]. Nightingale et al. found that about 40% of real-world failures in a processor are caused by intermittent faults [48]. Hardware error prevention techniques reduce error occurrence by minimizing process variations [12], regulating voltage [20] or managing temperature [72]. Although these techniques reduce the base rate of intermittent faults, many faults still occur and escape to the software [81]. Hardware error tolerance techniques are usually achieved by using N-modular redundancy across computer nodes2 . The most famous form of this redundancy is Triple Modular Redundancy (TMR), where the same computation is repeated by three computer nodes and a voting mechanism is used to choose the correct data [42]. The main problem with TMR is that it incurs prohibitive energy and area overheads, which makes it unsuitable for deployment in commodity systems. The focus of this dissertation is to build and study efficient mechanisms to mitigate the impact of intermittent errors through detection, discrimination, diagnosis and recovery (we follow the standard definitions of the fault-tolerance terms - see be a benign fault. An error is the corruption of program data due to the fault. An error might initiate error tolerance techniques. A failure is the program crashing, generating erroneous results or hanging as a result of the error. 2 A node is a computer connected to a network.  2  Table 1.1). Table 1.1: Terms and definitions. Term Detect Discriminate Diagnose Recover  Definition to find that an error has affected a program such that the program’s state has been changed erroneously. to distinguish between transient, intermittent and permanent hardware errors. to isolate the physical location of an error, a core or a micro-architectural component in this dissertation. to remedy the effects of an error on program state by restoring the program’s state to the last saved checkpoint and possibly disabling the defective physical component of the processor.  Understanding the characteristics of intermittent faults is a critical step toward building error tolerance techniques. Prior work has analyzed the effects of transient faults [25, 36, 73] and permanent faults [33, 38] on programs. These analyses played a vital role in designing fault-tolerance techniques that mitigate the effects of transient and permanent errors. In this dissertation, we analyze the impact of processor intermittent faults on programs and project our findings on the design of error detection, diagnosis and recovery mechanisms. To achieve this, we subject SPEC2006 benchmarks to an intermittent fault injection campaign at the microarchitecture level. We find that 53% of the intermittent errors lead to program crashes and should therefore be mitigated by the appropriate error tolerance approaches. As intermittent faults are mostly non-benign and are likely to recur (unlike transient faults) the intermittent-error prone hardware component should be identified and a recovery approach should be considered. However, intermittent faults (unlike permanent faults) are non-deterministic and may not appear during testing. Therefore, most existing error tolerance approaches for transient and permanent errors cannot be used for intermittent errors. To find the most effective error tolerance approaches for intermittent errors, we first study error recovery. We do not investigate intermittent error detection approaches because error detection techniques (unlike error diagnosis techniques) do not usually depend on error type [41]. Fur3  ther, designing error diagnosis technique should be preceded by an error recovery study. This is because we want to find the satisfactory diagnosis granularity (i.e., identifying faulty core vs. faulty micro-architectural unit). Recovery from intermittent errors involves rolling the program back to the last stored checkpoint and potentially disabling the faulty component of the processor to prevent future occurrence of the error. Multiple proposals have been introduced about how to selectively disable processor components that experience permanent faults. These proposals suggest using unit replications or redundant pipeline stages [26] [30], borrowing functional units from other cores or migrating execution threads when necessary to other cores [59] [64]. However, it is unclear if component disablement is essential for intermittent errors, and at which granularity this disablement should happen. To find the most effective recovery actions for intermittent errors, we model an error-prone processor that is equipped with error detection, diagnosis and different recovery scenarios upon failures. We show that the frequency of an intermittent error and the relative importance of the defective component in the processor play an important role in finding the recovery action that maximizes the processor’s throughput. We also find that disabling the errorprone micro-architectural unit leads to better throughput than disabling the entire core for some fault rates. Therefore, we choose to pursue error diagnosis at the micro-architecture level. Error diagnosis can be performed in software or in hardware. The main advantage of hardware-based techniques is that they require no changes to the software. However, they incur power and area overheads even if the hardware is fault-free. On the other hand, software-based diagnosis initiates diagnosis only for errors that actually propagate to the software, thus incurring power overhead only during the diagnosis. Therefore, we perform error diagnosis using software-based techniques. We propose a technique to Diagnose Intermittent hardware Errors in microprocessors by Backtracing Application state at the time of a failure (DIEBA). DIEBA starts from the failure dump of a program (due to an intermittent error) and identifies the error-prone functional micro-architectural unit with an accuracy of 70%3 . For the rest of this chapter, we present our contributions (Section 1.1) and lay3 Accuracy here is the percentage of failures in which DIEBA can accurately identify the errorprone functional unit  4  out the outline of this dissertation (Section 1.2).  1.1  Contributions  Our research contributions are as follows: • Characterize the impact of intermittent faults on programs using a microarchitectural simulator. We modify this simulator to serve as a fault-injection tool. We study the relationships between intermittent errors parameters such as location and model on the severity of the error consequences. Further, we project the results of our study on intermittent errors detection, diagnosis and recovery techniques. • Build intermittent fault models that abstract real intermittent faults at the micro-architecture level and at the system level. Intermittent faults are fundamentally different from permanent and transient faults. Therefore, they cannot be accurately modeled using permanent or transient fault models. Due to the lack of information about intermittent faults exact characteristics, it is challenging to model these faults at a high level. Based on the related work in this area [18, 23, 28, 38, 63], we build multiple fault models that abstract physical fault models. At the same time, we prune down the space of system configurations to a manageable set of parameters that we can simulate. • Model and simulate an error-prone system that is equipped with error tolerance techniques to evaluate the throughput of a processor after applying different recovery scenarios. These scenarios include permanent and temporary disablement of cores and micro-architectural units of a processor. Moreover, we study the sensitivity of our results to the models’ parameters. • Introduce a software-based diagnosis technique (DIEBA), which starts from the failure dump of a program (due to an intermittent error) and identifies the faulty micro-architectural unit based on the program’s machine code. DIEBA requires no hardware support, and can be executed on a different core than the one that experienced the error (thus diagnosis does not pause the core’s execution after the failure). 5  1.2  Dissertation Outline  The main research question we ask in this dissertation is: How can we efficiently tolerate intermittent hardware errors using software-based techniques? To answer this question, we break it down into multiple research questions and we dedicate a chapter to answer each of these questions. In this section, we present the outline of the main chapters in this dissertation, highlight the key research questions and summarize the methodology we follow to answer them.  1.2.1  Chapter 2: Intermittent Hardware Faults: Background, Modeling and Fault Injection  The research question that we address in this chapter is: How can we build reasonably accurate intermittent fault models that can be used in feasible simulations? Although RTL fault models are commonly used to accurately model intermittent faults, they are prohibitively expensive at high-level simulations. In this chapter, we provide an overview of intermittent faults’ causes and rates of occurrence. Then we build models at the micro-architecture level that we later use in Chapters 3 and 5. We also build fault models at the system level that we later use in Chapter 4.  1.2.2  Chapter 3: Characterizing the Impact of Intermittent Hardware Faults on Programs  In this chapter, we address the following research question: What is the impact of intermittent hardware faults on programs? To answer this question, we quantify and analyze the effect of intermittent faults on a set of SPEC2006 benchmarks using our micro-architecture level fault injection tool. We find that more than 67% of intermittent faults are non-benign. In particular, 80% of the non-benign intermittent faults result in program crashes and the remaining 20% of the errors result in Silent Data Corruptions. Therefore, the dominant effect of intermittent faults is to crash the program, and it is necessary to build techniques to tolerate intermittent errors that propagate to software.  6  1.2.3  Chapter 4: Recovery from Intermittent Hardware Faults: Modeling and Evaluation  The research question we ask in this chapter is: How can we optimize the throughput of a processor for a given intermittent error rate? In this chapter, we find the recovery approaches that maximize the processor throughput for different error rates and error locations. For errors that require reconfiguration of the processor, we want to find the exact granularity of the reconfiguration. Later in this dissertation, we use this result to guide the design of an intermittent error diagnosis technique. Approach To find the most effective intermittent error recovery, we build a model of an intermittent error-prone chip multiprocessor using Stochastic Activity Networks [66]. This model includes error detection, discrimination (from other types of hardware errors), checkpointing, diagnosis and recovery. Moreover, we use the system level fault models that we build in Chapter 2. Using both the processor and the fault models, we evaluate the effectiveness of different recovery techniques for errors with different frequencies. The recovery scenarios that we considered include (1) program rollback to the last checkpoint only, (2) program rollback to the last checkpoint and permanently disabling the defective component of the processor and (3) program rollback to the last checkpoint and temporarily disabling the defective component of the processor. To quantify the importance of the defective component to the running program we proposed a metric called the component rank, or the maximum percentage of processor throughput that is lost when the corresponding component is disabled upon error recovery.  1.2.4  Chapter 5: DIEBA: Diagnosing Intermittent Errors by Backtracing Application Failures  The next research question we explore is: Can we develop low-overhead, softwarebased diagnosis methods for intermittent errors to identify the error-prone micro-architectural unit in a processor? We propose a software-based technique to Diagnose Intermittent hardware Errors in microprocessors by Backtracing Application state at the time of a failure  7  (DIEBA). DIEBA does not impose area overhead and does not require the error to be reproducible. We focus on faults that occur in the micro-architectural units in a core, and have either resulted in program failure or have been detected through other mechanisms. In the following, we present an overview of how DIEBA works. DIEBA is designed to use software-based deterministic replay of the failing run to identify data values that are affected by an error. We use program instrumentation to ensure that the control flow of the failure run matches that of the fault-free run. Then the technique propagates the erroneous values a few hundreds of instructions prior to the failure location. Finally, it isolates the defective unit by looking for the unit used by most propagation paths. Because DIEBA requires neither online monitoring of the application (except for a small amount of logging to be performed by the application) nor additional testing on the faulty core, it incurs low power and performance overheads during fault-free operation. To the best of our knowledge, DIEBA is the first approach to diagnose intermittent hardware errors without running additional tests or requiring any hardware support. In the remainder of the dissertation, we demonstrate how different phases of error tolerance techniques work together to improve the reliability and productivity of processors in Chapter 6. We survey the related work in this area in Chapter 7. Finally, we conclude and outline our future work in Chapter 8.  1.3  Conclusions  Intermittent errors are emerging as one of the leading causes of hardware failures. Hence, there is a critical need to design optimal diagnosis and recovery techniques for these errors. The goal of this dissertation is to find efficient approaches to mitigate intermittent hardware errors using software-based techniques. In the following chapters, we analyze the impact of intermittent faults on programs. We then evaluate different recovery options for intermittent errors and show that we need to consider the frequency and location of the error before taking a decision in disabling the errorprone component of the processor. We then propose DIEBA, a software-based technique to diagnose intermittent errors that occur in the processor’s functional units and lead to program crashes or error detection.  8  Chapter 2  Intermittent Hardware Faults: Background, Modeling and Fault Injection In this chapter, we define intermittent faults, discuss their root causes and rates of occurrence (Section 2.1). We then present intermittent fault models at the microarchitecture level and at the system level (Section 2.2). Finally, we build our fault injection tool based on the micro-architectural fault models (Section 2.3).  2.1  Background  Definition We define an intermittent fault as one that appears non- deterministically at the same hardware location, and lasts for one or more (but finite number of) clock cycles. This is consistent with definitions in prior work [15, 81]. The main characteristic of intermittent faults that distinguishes them from transient faults is that they occur repeatedly at the same location, and are caused by an underlying hardware defect rather than a one-time event such as a particle strike. However, intermittent faults appear non-deterministically (unlike permanent faults) and only under extreme operating conditions. Similar to Gracia et al. [24], we characterize an intermittent fault using the following parameters (Figure 2.1):  9  • Fault length: the full duration of a fault (tL ). • Fault active duration: the duration at which the fault manifests itself to the instructions that use the defective hardware part (tA ). • Fault inactive duration: the duration at which the fault does not manifest itself to the instructions that use the defective hardware part (tI ). • Fault burst: a set of active and inactive durations (tB ). • Fault location: the physical location of the fault. It consists of a microarchitectural unit and a bit. • Fault model: a representation of the fault at the micro-architecture or system level. Different models represent different root causes. More details are in Section 2.2.  Figure 2.1: Fault model used in this dissertation.  Causes The major cause of intermittent faults is device wearout, or the tendency of solid-state devices to degrade with time and stress. Wearout can be accelerated by aggressive transistor scaling which makes processors more susceptible to extreme operating condition such as voltage and temperature fluctuations [12, 43]. Many well studied transistor failure mechanisms such as dielectric breakdown, negative bias temperature instability (NBTI), hot-carrier injection (HCI) and electromigration [43] occur due to wearout. Table 2.1 defines the major failure mechanisms and their causes. Different wearout mechanisms affect processors in different ways. For example, dielectric breakdown and NBTI manifest themselves as delays in transistor switching time. Moreover, HCI results in a decrease in drain current and slower circuit. Finally, stress migration and electromigration result in 10  Table 2.1: Description of different intermittent failure mechanisms. Failure Cause Dielectric breakdown  Negative bias temperature instability  Hot carrier injection  Electromigration  Manufacturing defects  Description Thin gate oxide results in the gate becoming more vulnerable to high voltages. Manufacturing traps in these gates get invariably charged as current flows in the oxide. These charged traps start conducting large current with time. This current leads to thermal damage to the transistor and creates more traps. Over time, the traps accumulate and create a conductive path between the transistor metal and substrate. This path increases leakage current in the transistor and leads to soft gate-oxide breakdown. With persistent stress conditions more traps formulate, and over time this conductive path becomes a cross section throughout the gate and connects current from the substrate to the metal. This is called hard gate-oxide breakdown [18]. Thin gate oxide and high temperatures may result in the silicon-hydrogen bonds in the interface between the gate oxide and the substrate to break. This releases hydrogen ions and the vacant positions in the gate can formulate holes. These holes change the transistor characteristics by reducing the threshold voltage and reducing the transistor speed [78]. Short transistor channel lengths increase the transistor’s speed but at the same time increase the electrical field in the channel. With high voltages, this may lead to some electrons or holes gaining enough kinetic energy to get injected from the substrate into the gate oxide. This leads to a degradation in the transistor’s threshold voltage [76]. Small wire geometry and high electrical fields may lead the metal atoms to migrate from one place (which may lead to open circuit) and pile up in other places (which may lead to short circuit with other wires) [3]. This includes manufacturing variations (process, voltage and temperature variations) [12] and residues in cells [15]. 11  open and short circuits. Studies have shown that errors generated by NBTI are intermittent because up to 40% of the NBTI damages are reversible after stress removal (typically, high temperatures) [21] [78]. The remaining 60% of the damages are permanent (similar behavior has been observed for HCI) [21]. Time-dependent dielectric breakdown (TDDB) results in intermittent errors because it appears as Soft Dielectric Breakdown (SDB) at nominal operating conditions, then progresses to Hard Dielectric Breakdown (HDB) if the stress conditions persist [18, 28]. In general, in-progress wearout faults are intermittent as they depend on the operating conditions and the circuit inputs. In the long term, these faults may eventually lead to permanent defects. Another cause of intermittent faults is manufacturing defects that escape VLSI testing [15]. Often, deterministic defects are flushed out during this testing and the ones that escape are non-deterministic defects, which emerge as intermittent faults. Finally, design defects can also lead to intermittent faults, especially if the defect is triggered under rare scenarios or conditions [80]. We do not model design defects because they do not usually follow specific patterns, i.e., each defect affects the processor in a different way. Occurrence Rates Few field studies have been conducted to monitor intermittentfaults rates. Constantinescu [15] found that 6.2% of the hardware errors in memory subsystems are intermittent. However, he did not present any data for processor faults. A study by Nightingale et al. [48] analyzed error logs sent by Microsoft Windows Error Reporting program from 950,000 personal computers. They found that, of the hardware errors reported about microprocessors, approximately 39% are intermittent. However, they only consider a small class of processor errors, and only those that cause operating-system crashes. As a result, they under-estimate number of intermittent errors that occur in the field. Fault distribution: Studies that focus on wearout failures found that the distribution of aging failures follow either lognormal [10, 30, 64] or Weibull [68] distributions. These studies focus on the phase in which the aging errors persist for long periods and eventually lead to permanent errors. The focus of this dissertation is  12  on intermittent faults that have not progressed to permanent faults yet.  2.2  Fault Models  We rely on prior work to build approximate intermittent fault models both at the micro-architecture level (Section 2.2.1) and at the system level (Section 2.2.2). The intermittent faults considered in this dissertation are caused by temperature/voltage fluctuations, wearout and manufacturing defects (for micro-architectural models only). We do not consider intermittent faults caused by design defects due to lack of information about their distribution. Moreover, the focus of this dissertation is on faults that occur in the processor. We do not consider faults that occur in the memory and the input/output hierarchy. Memory and input/output hierarchy faults are usually tolerated by Error Correcting Codes and Cyclic Redundancy Checks. We make the following assumptions in our fault models: • At any time, a micro-architectural unit may be affected by at most a single intermittent fault. • At any time, at most a single micro-architectural unit may be affected by an intermittent fault. These two assumptions are justified because intermittent faults are relatively infrequent compared to the typical execution time of many applications.  2.2.1  Micro-architectural Fault Models  In Chapters 3 and 5 we use micro-architectural simulation. Hence, we need to model intermittent faults at this level. This task is challenging because we need to abstract the effects of the faults to the micro-architecture. Unfortunately, to the best of our knowledge, there has not been prior work in this area. Most prior work that model in-progress wearout faults, do so at the RTL-level or gate-levels, where faults manifest themselves as delays in transistor-switching times [35]. However, the need to run real workloads and monitor the execution for large number of cycles makes it impossible for us to use these low-level simulators. Therefore, we come  13  up with a micro-architectural fault model in Table 2.2. We make the following assumptions in the modeling: • In-progress wearout effects (Dielectric breakdown, NBTI, HCI and electromigration) manifest themselves as stuck-at last value (Smolens et al. [74]). We assume that the delay caused by the fault is longer than a clock period; otherwise it will not affect the output of a transistor. • Electromigration faults also manifest themselves as intermittent stuck-at zero/one and dominant 0/1. Although these faults occur in interconnects, we assume that they will ultimately lead to similar faults in the register destinations of the interconnects. • Manufacturing defects manifest themselves as intermittent stuck-at zero/one and dominant 0/1 (Gracia et al. [24]).  2.2.2  System Level Fault Models  In Chapter 4, we study different recovery techniques for intermittent errors using Stochastic Activity Networks. In this section, we rely on prior work to build approximate intermittent fault models at the system level. Moreover, we study the sensitivity of our results to fault models and failure rates (Section 4.3.1). We consider three fault models to characterize intermittent faults. Each fault model represents faults in a processor that is vulnerable to specific kinds of failure. The base fault model represents the case when faults occur rarely and is used by other fault models as a baseline to study the processor’s performance and availability when error tolerance techniques are invoked infrequently. The exponential fault model represents faults in a processor that is sensitive to voltage/temperature fluctuations but does not age. The Weibull fault model represents faults in a processor that is sensitive to voltage/temperature fluctuations and it ages. Hence, the Weibull model is the most realistic model of the three. Both exponential and Weibull fault models are used to compare the impact of the processor’s aging on the processor’s performance and availability in the short term (Chapter 4). In the following, we describe our models in detail and we show the parameters we use in Table 2.3:  14  Table 2.2: Intermittent faults’ causes and models. Fault Mechanism  Causes  Gate-level Models  Micro-architectural Models  Intermittent delay  Intermittent stuck-at-last-value  Intermittent delay  Intermittent stuck-at-last-value  Intermittent delay  Intermittent stuck-at-last-value  Small wires geometry  Intermittent delay  Intermittent stuck-at-last-value  High temperature  Intermittent open  Intermittent stuck-at-zero/one  High current density  Intermittent short  Dominant-0/1 bridging  Intermittent open  Intermittent stuck-at-zero/one  Intermittent short  Dominant-0/1 bridging  Infant-mortality Dielectric breakdown  Thin gate oxide High voltage  Negative bias temperature instability Hot carrier injection 15 Electromigration  Manufacturing defects  Thin gate oxide High temperature Short channel length High Voltage  —  • Base fault model: In this model, we assume that each core experiences faults with inter-arrival times that are modeled according to Homogeneous Poisson Process. More formally, the probability of exactly n faults occuring within an interval of t is [50]: Prn (t) =  e−λt (λt)n n!  where λ is the constant fault arrival rate and it measured as 1/T , and T is the Mean Time To Failure (MTTF) measured in years. Each fault event is modeled as a single fault. There are no fault bursts in this model1 . As we assume that the machine is hardened against temperature and voltage fluctuations and wearout. Therefore, intermittent faults in this model occur rarely. These machines can be fabricated using large transistors, for example. • Exponential fault model: Similar to the base fault model, faults inter-arrival times in each core are modeled using Possion distribution with MTTF of T years. In parallel with the base fault model, during the core’s life time it experiences high temperature and voltage fluctuations, but these fluctuations do not permanently affect the performance. In other words, transistors in this model are sensitive to temperature/voltage swings but they do not wearout. Fault events that result from these hikes occur in the form of bursts of constant duration of X minutes, we model bursts inter-arrival times using Homogeneous Poisson Process with an MTTF of Y hours. During each fault burst, active faults inter-arrival times are also assumed to follow Homogeneous Poisson Process with MTTF measured in Z seconds. The duration of each active fault is measured in W seconds. The random variable for this model is similar to the one described for the base fault model. • Weibull fault model: This model is similar to the base and exponential fault models in that each core is affected by faults that arrive exponentially with MTTF of T years. Also, during the life time of each core, it experiences 1 We define fault burst as the duration of time at which a number of faults may hit a processor. A fault burst in Figure 2.1 (tB ) is represented using a set of active and inactive durations.  16  extreme temperature/voltage instabilities. However, in this model the hikes in temperatures and voltages will lead to transistor wearout [18, 21, 28, 63]. We model aging faults arrival times using Weibull distribution with a shape parameter k of 1.4 [11] (failure rate increases with time) and a variable scale parameter λ depending on the MTTF (which is Y hours) as follows [50]: k  Pr(t; λ , k) = λk ( λt )k−1 e−(t/λ )  Fault events are modeled as bursts of duration X minutes. Similar to the previous model, arrival times of active faults during each burst are modeled using exponential distribution with MTTF of Z seconds. The duration of active faults is measured in W seconds. Table 2.3: Parameters used in system level fault models. Fault Parameter T X  Y Z  W  2.3  Value/Range Comments 6.56 years Found by Nightingale et al. [48]. 15 minutes No field studies available about this. Found experimentally to be an interesting duration. 2-40 hours See Section 4.3.1 for a sensitivity study. 1 second No field studies available about this. Found experimentally to be an interesting duration. 1 second Same as previous.  Fault Injection Tool  Our experiments in Chapters 3 and 5 use a fault-injection tool that we developed. Our fault injection tool is based on the SimpleScalar simulator [14], which is a cycle-accurate microprocessor simulator. To the best of our knowledge, there is no publically available tool that can inject intermittent faults at the micro-architecture level and at the same time provide high observability on the processor state. This is why we developed a fault-injection tool. 17  Many fault-injection tools have been developed both in academia and industry. This is because fault injections are more detailed and hence more accurate than analytical modeling, which by necessity cannot be as detailed to obtain tractable models [79]. A key challenge in our study is to find a simulation environment that is (1) feasible (simulation finishes in reasonable time), (2) accurate (good representation of the actual faults), (3) observable (to monitor the internal program and processor states) and (4) repeatable (to perform replays as needed in Chapters 3 and 5). Traditionally, fault injection tools are classified into [27] (1) hardware fault injection tools, where the fault is introduced by plugging the processor with special hardware components or by exposing the processor to certain physical conditions such as cosmic rays [34] and (2) software fault injection tools where faults are injected by emulating the effects of low-level faults at compile-time or run-time [32]. In prior research projects, logic-level or RTL-level simulations (i.e., softwarebased injections) have been used to study intermittent faults [24, 74]. However, due to the very detailed simulations, only toy-programs can be used in these simulations. Since the goal of our study is to characterize real benchmarks (SPEC2006), we use a relatively fast micro-architectural simulation. At the same time, we implement reasonable fault models to represent intermittent faults at the microarchitecture level (see Table 2.2 for more details about our fault model). We build our fault-injection tool based on the micro-architecture level SimpleScalar simulator (Alpha sim-outorder) [14, 47]. We choose this simulator because it models the majority of the micro-architectural components of interest to our study and is widely used in the computer architecture community. In addition to the features already available in SimpleScalar simulator, our tool adds the following functionalities: Inject  intermittent hardware faults at the micro-architecture level. The user can  specify the following parameters for every fault: (1) Fault start and end cycle. (2) Fault location. It consists of a combination of a micro-architectural component and a bit position within the component’s output. The available components are (a) the fetch unit, both the fetched instruction and the PC, (b) the destination register of the integer ALU, multiplier and divider, the destination register of the 18  FP adder, multiplier, divider, comparator, square root, and FP-to-integer converter, (c) the load-store unit read address, write address and data, and (d) the load-store queue data and address. (3) Fault activity and inactivity durations. (4) Fault models which include Stuck-at-one/zero/last-value and Dominant0/1. Emulate hardware traps. The implemented traps are divide by zero, data overflow, invalid instruction opcode, invalid PC, invalid memory address and invalid memory alignment. This is required for accurate accounting of number of program crashes. Basic SimpleScalar simulator does not have this feature. Record crash dump file in the event of a crash. This file includes the contents of the register file, memory footprint and PC at crash time, hardware-trap type and crash distance2 . Our simulator also records a detailed trace of the running program. The trace consists of a list of executed instructions, with the following information recorded for each instruction: instruction type (unary, binary, jump, branch, load or store), PC, input registers values and output registers values. Replay the execution and analyze the effects of an intermittent error that caused a crash by using a faulty run and the corresponding replay. The output of this analysis includes: (1) propagation set3 , (2) masked nodes set (MNS) and (3) reused registers (RR)4 . Moreover, this analysis can distinguish between a crash (activation of a hardware trap), silent data corruptions (SDC, erroneous data in registers and memory when program finishes) and benign runs (injected faults with no erroneous data). 2 Crash distance is the number of dynamic instructions that execute from the start of the error until  program crashes. More details are available in Chapter 3. 3 Propagation set is the set of data values affected by the error. This includes erroneous registers, memory locations and jumps. More details are available in Chapter 3. 4 Both masked nodes set and reused registers are used to quantify how many erroneous dynamic values are available at the failure time. More details are available in Chapter 3.  19  2.3.1  Fault Injection Implementation  In this section, we show how we inject faults into different units in SimpleScalar simulator. We first give an overview of the structures and functions used by instructions in SimpleScalar then we show the exact locations where faults are injected. In SimpleScalar simulator, instructions are fetched from the instruction cache in ruu fetch function. Each instruction is then dispatched from the fetch data data structure to a dispatch queue in ruu dispatch function. Then it is decoded using macro MD SET OPCODE. During the decode stage, logic/arithmetic and control instructions allocate an entry in Register Update Unit (RUU) circular queue. Further, load-store instructions allocate an entry in the RUU for the effective address computation and save the PC, memory address and data (in case of a store instruction) in the Load-Store Queue (LSQ). Instructions await in the RUU and LSQ until their register or memory operands are ready and a hardware resource(s) is available. When the operands and the resouces are ready, the instructions are forwarded to the execution units or the load-store unit. Instruction execution is done using macros named after the instruction name. After the execution is done, function ruu commit commits the results of the oldest completed entries from the RUU and LSQ to the register file, while store instructions commit their data to the data cache. Finally, function ruu release fu releases the used resources and function ruu writeback walks through pending operand dependencies. If the operands of any pending instruction are ready, the instruction is added to the ready instruction queue. In Figure 2.2 we show a simple diagram of SimpleScalar pipeline. The injection locations are highlighted in bold font. Table 2.4 shows the exact fault injection locations in SimpleScalar code.  2.4  Conclusions  We present an overview of intermittent faults’ definition, causes and rates of occurrences. We find that intermittent faults are caused by a variety of physical causes, and that they have been modeled at the physical or gate level in prior work. However, this makes higher-level experiments with realistic benchmarks prohibitively expensive. Hence, we build reasonable models at the micro-architecture level and system level that capture many properties of intermittent faults. Based on the fault 20  Figure 2.2: SimpleScalar pipeline. Names in bold font represent injection locations. Table 2.4: Fault injection locations, the corresponding signal names in SimpleScalar pipeline diagram (Figure 2.2) and locations in SimpleScalar code . Fault Location Fetch Unit (instruction, PC)  Injection Signal Fetch-Inst, PC  Integer ALU, multiplier, divider and FPU LSU (data, read address, write address) LSQ (PC, address)  Dest-Reg  LSU-Data, Addr  LSU-  LSQ-PC, Addr  LSQ-  Location in SimpleScalar fetch data[fetch head].IR and fetch regs PC, respectively. the destination register of the corresponding instruction macro. the data/memory address of the corresponding instruction macro. lsq->PC and lsq->addr, respectively.  models, we build fault-injection tool at the micro-architecture level. This tool will be used to characterize the impact of intermittent faults on programs in Chapter 3 and evaluate our diagnosis technique in Chapter 5.  21  Chapter 3  Characterizing the Impact of Intermittent Hardware Faults on Programs In this chapter, we highlight the main characteristics of intermittent faults and the implications of these characteristics on error detection, diagnosis and recovery. Further, we study the impact of intermittent faults on programs as a first step towards building software-level mechanisms for isolating the source of intermittent faults (see Section 3.4 for more details). Prior work has analyzed the effects of transient faults [25, 36, 73] and permanent faults [33, 38] on programs. These analyses played a vital role in designing fault-tolerance techniques that mitigate the effects of errors and at the same time minimize performance and area overhead. To the best of our knowledge, ours is the first comprehensive study of intermittent faults using fault injections and a realistic set of benchmarks. Motivation A hardware fault can affect a program in different ways. The fault may be benign (changes on program state are benign), cause silent-data corruption (changes on program state are erroneous) or lead to a hardware trap (i.e., program crashes). Even if an error does lead to a program crash, the crash may not occur  22  immediately after the onset of the error, but only after some amount of time later. In the meantime, the error may propagate and corrupt the program state even more. It is important, in this context, to ask: (1) what is the fraction of intermittent faults that lead to program crashes? (2) For the faults that do lead to crashes, how much do they propagate within their programs before the crashes? (3) How useful is the program state at failure time for the software-based fault diagnosis techniques? We focus on errors that lead to program crashes in this study because our experiments indicate that 80% of non-benign intermittent errors result in program crashes. Further, we focus on intermittent error diagnosis rather than detection and recovery because most diagnosis techniques that already exist in literature for hardware errors assume that the error is reproducible which does not apply to intermittent errors. In this chapter, we answer the questions asked above by subjecting a subset of the SPEC2006 benchmark programs to a fault-injection campaign at the microarchitecture level (Chapter 2)1 . The contributions of this chapter are as follows: (1) Characterizing the impact of intermittent faults on programs through faultinjection experiments and SPEC2006 benchmarks. (2) Evaluating the potential of using software-based fault diagnosis mechanisms for intermittent faults. (3) Studying the relationships between intermittent-faults parameters such as location and model on the severity of the fault consequences. (4) Projecting the results of our study on intermittent-errors detection, diagnosis and recovery. The main results of the study are as follows: • Between 41% and 63% (varied by benchmark) of the faults we injected led to program crashes. Of the remaining fault injections, between 22% and 43% were benign, and only about 14% of injected faults resulted in silent-data corruption, i.e., incorrect output. • 96% of the crash-causing errors lead to crash within one hundred thousand dynamic instructions. 1 A version of this work was published in the Pacific Rim International Symposium on Dependable  Computing [61].  23  • 87% of the crash-causing errors corrupt less than 500 data values before program crash. • 42% of the corrupted data are not masked by other correct data at program crash. We discuss the impact of these results on software-based fault-tolerance techniques (i.e., detection, diagnosis and recovery) in Section 3.4.  3.1  Methodology  In this section, we describe the metrics we use to quantify the actual and masked impact of intermittent faults in programs (Section 3.1.1). Then we demonstrate our evaluation criteria through an example (Section 3.1.2).  3.1.1  Terms and Definitions  We define the metrics and terms used in our experiments: Node is a value produced by a dynamic instruction during program execution. A node can be a data value or a memory address. Further, a node can be read multiple times but is written only once during execution. Faulty run is a program execution during which an intermittent fault is injected. Replay is a re-run of the instructions that executed during the faulty run. No faults are injected during the replay. Crash Node is the node at which a program crashes due to an intermittent error. Crash Distance or CD is the number of nodes that are generated by a program from the start of an intermittent fault until a crash node is reached. Intermittent Propagation Set or IPS is the set of nodes to which an intermittent error propagates until a crash node is reached. Masked Nodes Set or MNS is the set of nodes to which an intermittent error propagates until a crash node is reached. However, these nodes contain correct values because of error masking. Error masking can happen due to (1) logical instructions, such as AND/OR instructions, (2) test and set instructions, such as branch if less than zero instruction (BLTZ) or (3) the fault model does not change the bit value, e.g. a stuck-at-zero in a bit whose value is zero. 24  MNS can be found by comparing the node values in the IPS in the faulty run with the corresponding values in the replay. If an erroneous node has the same value in faulty run as that in the replay, then the erroneous value has been masked. Faulty Registers is the set of registers that have erroneous values when a crash node is reached. Re-used Registers or RR is the number of registers that had erroneous values but were overwritten by correct values before a crash node is reached. RR can be found by RR = |Registers(IPS) − Registers(MNS) − Registers(FaultyRegisters)|, where Registers(s) means the set of destination registers used to store s set of nodes.  3.1.2  Example  We demonstrate how to compute the CD, IPS, MNS and RR of an intermittent error using an example code fragment shown in Table 3.1. The second column of the table shows the node values that result from executing the corresponding instruction in a faulty run (the run affected by the error). The third column shows the node values that result from executing the corresponding instruction during the replay run (error-free run). The last column shows the mapping from instructions to nodes. Note that, for this example only, we assume a loop-free program and hence there is a one-to-one mapping between nodes and instructions. In the example code fragment, elements at indices 15 and 45 of an array (starting at Array Addr) are loaded into registers R4 and R5, respectively (node 1, 2, 4, 5). Then, these two registers are added and the result is stored in R6 (node 6). An immediate value of 77 is stored in R3 (node 3). The result in R6 is then logically ANDed with R3 (node 7). Next, an immediate value of 23 is stored in R1 (node 8). Finally, R6, which is an index in the same array, is used to load the corresponding array-item to R2. Assume that an intermittent error of type stuck-at-zero affects the third bit in the destination register of the first three instructions in Table 3.1. This error results in a crash at node 9 due to invalid memory address. We want to compute CD, IPS, MNS and RR of this error. Recall that IPS is the set of nodes to which an intermittent error propagates.  25  Table 3.1: Code fragment to illustrate IPS, CD, MNS and RR computation. Code Fragment  Node Value at Faulty Run  Node Value at Replay  Node No.  mov R1, #15 mov R2, #45 mov R3, #77 ld R4, R1, Array Addr ld R5, R2, Array Addr add R6, R5, R4 and R7, R6, R3 mov R1, #23 ld R2, R6, Array Addr  11 41 73 681 504 1185 1 23 CRASH  15 45 77 10 9 19 1 23 950  1 2 3 4 5 6 7 8 9  Next, we will follow the error propagation for each of the first three erroneous instructions and add the erroneous nodes that result from the error propagation to a set that will form IPS. As for the error that affects node 1, it includes at least the node at which the error occurs, namely node 1. Node 1 has a successor node 4, node 4 is used to compute node 6. Node 6 is, in turn, used to compute nodes 7 and 9. Hence, IPS so far is {1, 4, 6, 7, 9}. Similarly, we can follow the propagation of node 2 and find that it manifests the error to nodes {2, 5, 6, 7, 9}. While node 3 propagates to nodes {3, 7}. Therefore, IPS = {1, 2, 3, 4, 5, 6, 7, 9} As for CD, it is the number of nodes generated from the start of the intermittent fault at node 1 until the crash node 9, i.e., CD = 8. To find MNS, we compare the values of all nodes that appear in IPS with the corresponding nodes in the replay. We find that all nodes have different values except for node 7, which has the same value in the faulty run and in the replay run. Hence, the error has been masked, by logical AND instruction, and MNS = {7}. To find RR, we check the destination registers for nodes that appear in IPS. We find that according to IPS, R1-7 should include erroneous data. Next, we check MNS and notice that R7 (node 7) has a masked error, therefore, R1-6 should include erroneous data. However, R1 is not in the faulty registers (registers whose values are erroneous at crash time). Formally, RR = |{1, 2, 3, 4, 5, 6, 7} − {7} − {2, 3, 4, 5, 6}| = 1. Therefore, RR = 1.  26  Figure 3.1: Fault model used in characterizing faults.  3.2  System Description  We use fault injections at the micro-architecture level to characterize intermittent faults. In this section, we describe our fault model and our experimental setup. Fault Model  We follow the fault model described in the previous chapter (Chap-  ter 2). However, at the micro-architecture level, we cannot model intermittent faults that recur in hours or days because it is prohibitively expensive. We rather focus on one fault burst, i.e., one set of active and inactive durations that occur within seconds (circled and in red in Figure 3.1). Each fault-injection experiment involves choosing the following fault parameters randomly (Table 3.2): • Fault location: We have injected faults into all micro-architectural units that are modeled in our simulator. We do not consider faults that occur in memory and caches as we assume that these are ECC protected. • Fault start: a single intermittent fault is started at a random cycle within 1 million dynamic instructions. • Fault length: we choose fault lengths that range from 5 to 1,000,000 cycles. • Fault active and inactive durations: due to the limited information known about intermittent errors characteristics, we were not able to find exact numbers about this parameter. Therefore, we experiment with many active/inactive durations that range from 5 cycles to 20,000 cycles. We choose these numbers because voltage fluctuations last from 5 to 30 cycles [31], while temperature fluctuations may last hundreds of thousands of cycles [72](voltage and temperature fluctuations are leading causes of intermittent faults). 27  • Micro-architectural fault model: our models are described in detail in Section 2.1.  Experimental Setup We use our fault injection tool (Section 2.3) to conduct our experiments. Our procedure to perform fault characterization has two steps: (1) running programs with and without injections for each experiment (generate a faulty run and the replay), and (2) analyzing the two runs together to learn about the effects of intermittent faults. We now describe the procedure in more detail. (1) Running benchmarks: For each experiment we run a benchmark twice. In the first run we inject an intermittent fault (faulty run), while in the second run we do not inject faults but rather replay the first run with the same instruction stream (replay run). For both runs, the tool collects program traces and the most recent contents of the register file and memory footprint. For each benchmark program we injected 3000 faults. We compute confidence bounds on the results with 95% confidence. Only one fault is injected in each execution to ensure controllability. We use the standard SimpleScalar simulator’s parameters for the processor and memory configuration (Table 3.3). (2) Analyzing runs: Our tool compares and analyzes trace files, register files and memory footprints collected in the first step. The purpose of this analysis is to find how the run terminates (SDC, benign or crash). In case of a crash the tool finds CD, IPS, MNS and RR. We conducted our experiments on an Intel Xeon E5345 Quad Core 2.33GHz system with 8GB of main memory and 4MB of L2 cache. We use 7 integer and 4 FP benchmarks from the SPEC CPU 2006 suite for our evaluation. We could not compile the rest of SPEC2006 benchmarks to run in our Alpha simulator. Each benchmark was forwarded for 2 billion instructions to remove initialization effects. Then, an intermittent fault is injected. After the injection, the tool runs the benchmark for 1 million instructions2 . We do not run benchmarks to completion (as this is a very time-consuming process in our tool). We rather rely on analyzing the trace files of the faulty run and the corresponding replay to find if a fault results in SDC. 2 This number is reasonable because we find that only 2% of the crash-causing faults have a crash distance of more than one million dynamic instructions (Section 3.3).  28  Table 3.2: Fault-injection parameters. Fault Parameter Location-bit Location-unit  Start cycle Length Activity duration Inactivity duration Micro-architectural Model  Value/Range A bit chosen randomly from 0 to 63 in a micro-architetural unit. Fetch Unit (instruction, PC), integer ALU, multiplier, divider, LSU (data, read address, write address), FPU and LSQ (PC, address). A cycle chosen randomly from 1 to 1,000,000 cycle 5, 50, 1000, 50,000, 100,000, 500,000 or 1,000,000 cycle 5, 50, 100, 500, 10,000 or 20,000 cycle 5, 50, 100, 500, 10,000 or 20,000 cycle Stuck-at-one/zero/last-value and Dominant-0/1  Table 3.3: Simulator configuration parameters. Configuration Parameter ALUs/Multipliers/Dividers/FPUs Fetch/decode/execute/commit rate Branch prediction type Register update unit size Load-store queue size Register file Instruction/Data L1 L1 hit latency L2 (Unified) L2 hit/miss latency  29  Value 1 from each type 1 per cycle perfect 16 8 32 integer regs, 32 FP regs 16KB each 1 clock cycle 256KB 6/18 clock cycles  3.3  Results  In this section, we present the results of characterizing intermittent faults on programs by fault injection. We first study how a program that is affected by an intermittent fault terminates, we do this by classifying the terminations into crashes, SDCs and benign terminations. Since program crashes are the most common effect of an error, we dig more into this category by finding the CD and IPS for the crash-causing errors (Section 3.3.1). Next, we attempt to answer the second research question, by analyzing whether the erroneous data in a program are partially erased by correct data at the time of the crash (note that if erroneous data is completely erased then there will be no crash). This information is critical for any software technique that diagnoses an error by analyzing the program crash dump file. This factor is characterized by MNS and RR (Section 3.3.2). Finally, we study the effect of the various intermittent fault parameters on the severity of the error consequences. These parameters are fault length, fault model and fault location (Section 2.1).  3.3.1  Impact of Intermittent Faults on How Programs Terminate, CDs and IPSs  In this experiment we inject 3000 faults in each benchmark, one at a time, using the parameters in Table 3.2. Program terminations: We find that 73% of the injected faults are activated3 (not shown in figures). Inactivated faults happen due to faults injected in bits that are not used or faults that do not change the bit’s value. Inactivated faults are disregarded in computing the percentages below - this is standard practice in fault injection studies. Out of the activated faults (Figure 3.2), an average of 53% of the errors lead to program crash, while 14% lead to SDC and 33% of the faults were benign (i.e., had no effect on the program). Thus, of the intermittent faults that are nonbenign (i.e., 67% of total faults), 79% result in a program crash. We now attempt to understand those crashes more by studying the values of the CD and IPS in the following experiment. 3 Activated  fault means that the injected bit was actually used by the program.  30  Figure 3.2: Intermittent fault impact on how programs terminate. The results are obtained with confidence interval of 95% and error margin of ±4% for crashes, ±2% for SDCs and ±4%for benign runs. Crash distances: Our results show (Figure 3.3) that an average of 50% of the injected faults crash within 100 dynamic instructions from the start of the error, 22% crash between 100 and 1,000 dynamic instructions, 24% between 1,000 and 100,000 dynamic instructions, 2% crash between 100,000 and 1,000,000 dynamic instructions and the remaining 2% crash after 1,000,000 dynamic instructions. These results show that the dominant effect of an intermittent error is to cause a program crash soon after the error occurrence. This motivates us to focus on crash-causing faults for the rest of this section. Intermittent propagation set: We now plot the cardinalities of the IPS in Figure 3.4. The IPS cardinality corresponds to the number of potentially corrupted nodes in the program, which in turn points to the severity of the failure. We find that an average of 68% of the crashes have IPS cardinalities of less than 100 nodes, 19% have IPS cardinalities between 100-500 nodes, 5% have IPS cardinalities between 500-1,000 nodes and the remaining 8% crash after corrupting more than 1,000 nodes. Therefore, the majority of the crash-causing intermittent errors cause limited change to the internal state of program. Summary: We conclude that the majority of intermittent errors have severe impact on programs (cause program crashes). However, most program crashes take place soon after the error occurrence and there is limited error propagation before 31  Figure 3.3: Crash-distance ranges for crash-causing intermittent faults.  Figure 3.4: IPS-cardinality ranges for crash-causing intermittent faults. the crash. The impact of this behavior on error diagnosis techniques in discussed in Section 3.4.  3.3.2  Impact of Error Masking  Software-based diagnosis techniques are likely to rely (completely or partially) on failure dump or the state of the program at the time of crash/error detection to learn about what caused the error and to find the best way to recover from that error4 . 4 Software-based diagnosis techniques can also collect data about program state during program execution, we do not cover dynamically collected data as it is technique dependent.  32  Software-based error diagnosis techniques analyze clues that an error leaves on the state of a program at the time of failure/error detection. In this subsection, we quantify how many clues are erased prior to the program crash due to masking (MNS) and re-using registers (RR). Re-used registers: We study the number of re-used registers at a crash time, regardless of whether they affected other registers before they were overwritten. For example, if a faulty register #3 is used to update register #12 before it is overwritten with correct data, then we count it as re-used register. The results are shown in Figure 3.5. We find that, on average, 14% of registers  Figure 3.5: Distribution of number of correct, faulty and re-used registers at crash time. The results are obtained with confidence interval of 95% and error margin of ±3% for correct registers, ±2% for re-used registers and ±2% for faulty registers. are re-used, 12% of the registers are faulty because of error propagation, and the remaining 74% of the registers are not modified by the error. Therefore, out of the registers that store erroneous data at some point during program execution, 54% are overwritten with correct data, and the remaining 46% have erroneous data at crash time. These results can be explained by the short crash distances observed in Figure 3.3, which implies that programs affected by intermittent errors do not execute long enough after the error occurrence to overwrite large number of registers. Masked nodes set5 : We find that, on average, 58% of erroneous nodes are 5 Note  that RR and MNS are not inclusive, refer to Section 3.1.1 for more details.  33  masked and the remaining 42% of nodes will contain erroneous data at crash time (Figure 3.6). As we mentioned in Section 3.1.1, nodes masking occurs due to (1) logical instructions, such as AND/OR instructions, (2) test and set instructions, such as branch if less than zero instruction (BLTZ) or (3) the fault model does not change the bit value, e.g. a stuck-at-zero in a bit whose value is zero. Although the majority of the erroneous nodes are masked by the time the program crashes, the absolute number of nodes that are not masked in any run is 60, on average. Therefore, with careful design, software-based diagnosis technique may still be able to use erroneous data to tolerate faults with reasonable accuracy.  Figure 3.6: Distribution of number of masked and non-masked nodes at crash time. The results are obtained with confidence interval of 95% and error margin of ±3%. Summary: We find that 46% of the erroneous registers are not overwritten with correct data, and 42% of the erroneous nodes are not masked. Therefore, much of the data corrupted in a crashed program is intact and software-based diagnosis approaches are possible. Nevertheless, these approaches should be designed carefully to accommodate for the lost “error clues” on the program final state before crash (Chapter 5).  34  3.3.3  Impact of Intermittent Fault Properties on the Severity of the Faults  In this subsection, we study the relationships between the intermittent fault’s length, model and location on the way a program terminates. Since the observations are similar across all benchmarks, in this section we focus only on one SPEC2006 benchmark, namely, astar. Fault length: As for the fault’s length impact (Figure 3.7), we plot the same data that we collected in the previous subsection but we classify it according to the fault length. We note that short faults of 50 cycles lead to small percentage of crashes (34%). The percentage of crashes increases with longer errors until it reaches a threshold of 60% at error length of 50,000 and does not increase afterwards. This implies that the longer the error, the more nodes that are affected by it and the sooner the program crashes until the error length reaches a threshold. After the threshold an intermittent fault behaves more like a permanent fault irrespective of how much longer the fault lasts. We find that the increase in crashes saturated beyond a certain error length due to (1) errors injected into less critical locations, such as infrequently used entries in the load-store queue and (2) error models that are less likely to change the injected bit (e.g., stuck-at-zero). Fault model: In Figure 3.8, we show the same data we collected in the previous  Figure 3.7: Effect of intermittent fault’s length on how programs terminate. subsection, but this time we classify it according to the fault’s model. Our data shows how the percentages of crashes, SDCs and benign runs vary with the fault’s 35  model. We do not see a significant difference in number of crashes across different models. Stuck-at-one/last-value have relatively higher percentage of crashes ( 67%). This is because these two models are more likely to change the injected bit. Note that unused bits have zero by default. Therefore, stuck-at-one fault model will flip the unused bit, while stuck-at-last value will prevent updates to the bit if it is either zero or one. Fault location: We now show the same data we collected in the previous sub-  Figure 3.8: Effect of intermittent fault’s model on how programs terminate. section, but we classify it according to the fault’s location. In general, the location that is used more frequently in a benchmark will be the most vulnerable unit for that particular benchmark. For example, if the benchmark is memory-intensive, the LSU will be one of the critical units for that benchmark. However, we find that some locations are vulnerable for all benchmarks. We report results for one integer benchmark ( astar, Figure 3.9) and one FP benchmark ( dealII, Figure 3.10). Other integer and FP benchmarks show similar behavior to the one described above. In Figures 3.9 and 3.10, we show the activated faults for the locations that result in program crashes. Locations that are not depicted (e.g., FP-to-Integer converter) did not lead to significant number of crashes because these locations generate values that are not critical in programs or they did not lead to activated faults due to very infrequent use. Our data shows that the Fetch-Inst location results in relatively more crashes than other locations for both integer and FP benchmarks (82%, on average). More-  36  over, although FP benchmarks use the FPU unit heavily, most faults injected into the FP units result in benign faults. This is due to application masking which includes faults injected into bits that are not used by the application, especially the higher bits of a 64-bit registers and faults injected into registers used in evaluating logical operations. This result is not surprizing, e.g., Li et al. [38] reported 44% benign faults when permanent faults are injected into FP units.  Figure 3.9: Effect of intermittent fault’s location on how integer programs terminate ( astar in this figure).  Figure 3.10: Effect of intermittent fault’s location on how FP programs terminate ( dealII in this figure). Summary: Increasing intermittent-fault length causes an increase in the percentage of crashes until certain threshold fault length, beyond which the percentage 37  of crashes saturates. Increasing the fault length beyond the threshold will not cause any more crashes. The impact of the fault’s model and the fault’s location is largely determined by how much corruption the corresponding model causes (stuck-at-one causes more corruption than stuck-at-zero, for example) and the criticality of the location at which the fault occurs (IntALU causes more crashes than LSQ-PC for integer benchmarks, for example).  3.4  Discussion  Below, we discuss in more detail the impact of our characterizations on intermittenterror detection, diagnosis and recovery. Detection Our findings suggest that software-based detection of intermittent errors can be efficient because only 4% of intermittent errors propagate extensively (beyond one hundred thousand instructions) and hence require specialized detection techniques. However, since IPS cardinality is less than 500 data values for 87% of the crash-causing errors, one should carefully place error detectors to cover the critical error propagation paths [54]. Moreover, software-anomaly based detection mechanisms that monitor hardware traps as low-cost indications of errors would also be suitable for intermittent errors [38]. Diagnosis Isolating the error-prone micro-architectural components is possible using software-based techniques because (1) the propagation sets for intermittent errors are limited to a few hundreds of dynamic instructions, (2) about 42% of the erroneous data is not masked, and about 46% of the erroneous registers are not overwritten by correct data. This means that many error “clues” that can be found on the program state are intact and can be used in isolating the defective microarchitectural part. However, more work is needed to learn more about how much of the microprocessor state is observable at the software-level and which parts of the processor can be diagnosed using these high-level software-based techniques. We will investigate one software-based diagnosis technique in Chapter 5.  38  Recovery Checkpointing techniques on the order of a few hundreds of thousands of instructions will be effective in recovering from intermittent errors, because the crash distances of these errors are less than a few hundreds of thousands of instructions and hence the error is unlikely to corrupt a checkpoint. While restoring the program to the last checkpoint can be effective, it needs to be accompanied by hardware reconfiguration around the fault-prone component for some types of intermittent errors. We study different recovery scenarios for intermittent faults in the next chapter (Chapter 4) to understand when reconfiguration is required and at which level. Sensitivity In the sensitivity study, we found that both the intermittent fault length and location affect the percentage of faults that result in crashes. In particular, the processor’s front-end is the most vulnerable part of the processor for both integer and FP benchmarks. Therefore, the processor designer can reduce the number of crashes that are caused by intermittent faults by hardening the front-end components (using larger transistors, for example). Further, errors in the LSU/LSQ affect the memory addresses used to read/write data, and hence this unit is vulnerable to intermittent faults and should be protected as well. Further, we found that short faults lead to fewer crashes than longer ones. However, short faults may become longer if the extreme operating condition persist. Therefore, robust error detection techniques for short faults are necessary to avoid data corruption that would happen when the error becomes longer. On the other hand, if the processor is used for non-critical tasks and the application can tolerate limited data corruption, a short intermittent fault can be ignored (when a crash occurs, the program restored is to the last checkpoint) until the error progresses to permanent one.  3.5  Conclusions  In this chapter, we evaluated the impact of intermittent hardware faults on programs through fault-injection experiments. We monitored how the injected programs terminate and measured the crash distance and the error propagation for the failure (i.e., crash) causing errors. These factors are important because the further the 39  point of failure is from the error’s origin and the more the error propagates, the more difficult it is to tolerate this error. We find that the majority of the non-benign intermittent faults cause programs to crash. Further, the crash occurs within a hundred thousand dynamic instructions of the fault’s start for 96% of the crash-causing faults; hence large crash distances are infrequent. Finally, the dynamic data values corrupted by intermittent faults number less than 500 data values for about 87% of the crash-causing faults. In Chapters 4 and 5, we evaluate various recovery and diagnosis schemes based on these results.  40  Chapter 4  Recovery from Intermittent Hardware Faults: Modeling and Evaluation In this chapter, we focus on intermittent error recovery1 , i.e., the action that should be taken after detecting an intermittent error to remedy the effects of the error and prevent its occurrence in the future. Since repairing (or fixing) the defective circuit in commodity processors is prohibitively expensive, recovery from hardware errors primarily consists of error mitigation in software or hardware. Recovery from transient errors consists of rolling back to the last checkpoint and re-executing the program because the error is unlikely to recur. On the other hand, recovery from permanent errors consists of disabling the defective part of the processor since the error persists in that part forever. Intermittent errors exhibit characteristics between these two extremes (appear non-deterministically at the same location). However, recovery from intermittent errors cannot simply be similar to transient errors (for low intermittent fault rates) or permanent errors (for high intermittent fault rates), as we show later in this chapter. The research question we address is: What is the recovery action that maximizes the performance of an intermittent-error prone processor? Towards answer1A  version of this chapter has been published in the International Conference on Quantitative Evaluation of SysTems in 2012 [40].  41  ing this question, we make the following contributions: (1) Build a model of a chip multiprocessor running a parallel application. The model is built using Stochastic Activity Networks [17], and includes error detection, discrimination (from other types of hardware errors), diagnosis and recovery. We model the entire system because we find that configuration parameters in checkpointing, for example, can affect the choice of the recovery option. (2) Simulate the system (which consists of the processor model described in this chapter and fault models described in Chapter 2) to evaluate the performance of a processor after applying different recovery options that include permanent/temporary disablement of cores/micro-architectural units of a processor. Moreover, we study the sensitivity of our results to the models’ parameters in order to understand their effects. Our salient findings are as follows: • The frequency of an intermittent error and the relative importance of the defective part in the processor play an important role in finding the recovery action that maximizes the processor’s performance. Therefore, intermittent faults cannot simply be treated as transient faults or permanent faults without regard to their parameters. • Unlike permanent errors, we find that if a core is exhibiting high intermittent failure rates, then the best recovery action depends on the relative importance of the defective part of the processor. Further, we find that low intermittent failure rates can be tolerated by a program rollback to the last checkpoint. • Contrary to what other researchers have suggested [81], we find that a permanent shutdown of the intermittent error-prone part of the processor results in a slight improvement of the processor’s performance compared to the temporary shutdown of the same part.  4.1  System Description  Our target system is an error tolerant chip multiprocessor. A study of the efficiency of a recovery technique of such a system should not only include a model of the recovery technique, but also a model of all other fault-tolerance techniques at 42  different levels of granularity. This is because these tolerance techniques include design parameters that may affect the choice of the appropriate recovery action. Our model represents the actions of fault detection, discrimination, diagnosis and recovery. We start this section by listing our definitions and assumptions (Section 4.1.1). Then we present an overview of the fault-tolerance techniques that are common to all recovery scenarios (Section 4.1.2). After that we detail the different recovery models that we consider: (1) rollback-only recovery: where no reconfiguration action is conducted. A simple rollback to the last checkpoint is applied upon each error detection. (2) Core-level reconfiguration: where the defective core is disabled upon error detection and discrimination. (3) Unit-level reconfiguration: where a fine-grained reconfiguration at the level of the defective micro-architectural unit is conducted upon error detection, discrimination and diagnosis. Finally, we describe our Stochastic Activity Network (SAN) models in Section 4.1.4 (we use SANs to build the model of the system). SANs have been used since mid 1980s for performance and dependability evaluations. They are stochastic extensions to Petri nets [66]. We choose a SAN model because of its ability to represent both fault effects and performance overhead in the same model.  4.1.1  Definitions and Assumptions  To evaluate the performance of a processor under different recovery scenarios, we use the term useful work [77] and availability. Useful work is the fraction of effective work a processor accomplishes in a certain time duration. It does not include work repeated because of a failure that occurs before saving work to a checkpoint, nor does it include the work done while saving checkpoints, recovering from errors and diagnosing and reconfiguring the processor. More details about how we measure the useful work are available in Section 4.2. Availability is the fraction of time the system is functioning and ready for access, whether it is doing useful or redundant work. A system running a program or storing a checkpoint is considered to be available. We make the following assumptions in the model: • Errors covered are due to intermittent hardware faults only. However, we model a fault discrimination technique to distinguish these failures. 43  • Intermittent failures can occur during checkpoint recording and during processor reconfiguration. • Intermittent failures cannot occur during program rollback to the last checkpoint. They also cannot occur during error diagnosis process. This assumption is reasonable because most diagnosis techniques can be performed using other healthy cores or specialized hardware [62, 64]. We discuss the impact of these assumptions on the applicability of our model to real processors in Section 4.4.  4.1.2  Overview of an Error-Tolerant Core  We use state transition diagrams to describe system evolution. Figure 4.1 describes the lifecycle of a core. Note that this diagram is shown for exposition only, and the actual model is built using SANs. This part of the model is common to all three recovery scenarios. In this system, a multithreaded program runs in an on-chip multiprocessor (Box 1). A coordinated checkpoint is recorded periodically (Box 2). All cores use coordinated checkpointing, i.e., a checkpoint is taken for all cores at the same time [77]. The details of the coordinated checkpointing implementation are out of this chapter’s scope. Coordinated checkpointing implies that all cores on a chip are running one parallel program. If a core fails, the program will be restored to the previous checkpoint (hence, all cores perform recovery) and any work that has not been checkpointed is lost. This model is representative of modern workloads where massively parallel programs utilize all cores concurrently. However, our model is valid even if the application is not parallel, as long as the operating system performs coordinated checkpointing. While the system is running (either executing a program or storing a checkpoint), an intermittent fault can occur in any unit of the processor in any of the cores. If the intermittent fault is activated, i.e., if this fault manifests itself to the program, it can cause the following outcomes (Chapter 3): (1) benign, which means that the program affected by this error will continue running and generating correct results, (2) silent-data corruption, which means that the program will continue running but its output is corrupted by the error. At a later stage this error might be 44  Figure 4.1: An overview of an error-tolerant core with rollback-only recovery. detected by software/hardware detector (Box 3), or (3) the fault activates a hardware trap and the program crashes (e.g., program accesses memory using invalid address) (Box 3). In our model, if an error occurs, is not a benign error and is detected then there is a transition from Box 1 or 2 to Box 3. Otherwise, the program can continue running as usual. When an error is detected or a program crashes then the program rolls back to the last checkpoint to recover from the effects of the error (Box 4).  4.1.3  Recovery Scenarios  We now describe different scenarios for intermittent-error recovery. As mentioned in the previous section, all the models below share the same mechanisms for error detection and checkpointing. Rollback-only recovery: In this basic model (Figure 4.1), no recovery technique is applied other than rolling the program to a checkpoint. Therefore, no error discrimination technique is required. Upon error detection, the system is recovered to the last stored checkpoint (Box 4) and then the system continues running as usual (Box 1). Core-level reconfiguration: In this model (Figure 4.2), upon error detection a fault-discrimination technique is applied. We use the Alpha count-and-threshold error discrimination mechanism proposed by Bondavalli et al. [11] for fault discrimination (Box 5). The recovery action for this particular scenario is to disable  45  Figure 4.2: An overview of an error-tolerant core with core reconfiguration. the core (Box 6). A diagnosis technique is not required in this model since each core has its own error-distinguishing technique which identifies the intermittenterror prone core. The duration of the core disablement is either permanent [53], represented by an infinite loop between Boxes 7 and 8, or temporary [81] where a core enters the same loop for a short period of time. Unit-level reconfiguration: In this model (Figure 4.3), upon error detection, an error-discrimination technique similar to the previous scenario is applied. When an error is identified as intermittent, a fine-grained diagnosis technique is used to isolate the error-prone micro-architectural unit (Box 6). Once the defective unit is identified, a decision is made as to whether the unit must be disabled. If a core can operate without this unit (e.g., there exists a replica of this unit, or the program can be detoured to avoid using it [45]) then the unit is disabled and the core will continue running programs possibly with degraded throughput. In this case, the error discrimination variable for this core is reset. Similar to core-level reconfiguration, the duration of the unit disablement is either permanent, represented by an infinite loop between Boxes 8 and 9, or temporary where a unit enters the same loop for a short period of time. Otherwise, the unit is not disabled and the core continues running the program despite the defective unit.  46  Figure 4.3: An overview of an error-tolerant core with fine-grained reconfiguration.  4.1.4  SAN Models  We build the models discussed in the previous section using Stochastic Activity Networks (SAN) [66]. SANs are a convenient, high-level, graphic abstraction for modeling stochastic systems. We use a single SAN in all our experiments. Different fault models and recovery scenarios are represented by changing parameters in this model. Chip multiprocessor with unit-level reconfiguration and base fault model Our model of a chip multiprocessor is depicted in Figure 4.4. When an application starts execution, it enters the Run place. As the execution progresses, a checkpoint is taken periodically whenever ChkptFreq timed activity fires and then the program is transferred to the Checkpoint place where it stays there for a ChkptDuration duration. At every checkpoint the useful work is incremented at OG8 output gate. The increment depends on how many cores and units are functioning in the processor at the checkpoint time. At the same time, a token in the Error place models the errors that might occur because of the base fault model (fault arrivals follow a Poisson process with a nominal MTTF). When the ErrorFreq timed activity fires, one of two possible outcomes 47  Figure 4.4: Our SAN model for a fault-tolerant core. (or cases) can take place. The top case models the probability of activated errors that either cause crashes (hardware trap or OS exception) or are detected through a software/hardware detector. The bottom case models: (1) inactivated faults, (2) activated faults that do not change the state of the program (benign faults) or (3) activated faults that change the state of the program but are not detected. When the top case executes, a token is moved from the Error place to the Recovery place. Both input gates IG and IG2 ensure that whether the application (represented by a token) is in the Run or in the Checkpoint place, it will be transferred immediately to the Recovery place when ErrorFreq fires. Therefore, there are two tokens in the Recovery place at this point, one token from the Error place and another from either Run or Checkpoint place. The alpha extended place is an array of alpha variables such that each core in the processor has its own alpha variable. A core’s alpha is updated both on every checkpoint and on every crash/error detection to model the intermittent error  48  discrimination mechanism in that core. One of the two tokens in the Recovery place can take one of the following directions: (1) If the error is not identified as intermittent, then input gate IG4 ensures that a token from the Recovery place is forwarded to RecDuration1 timed activity which represents the time needed for an application to roll back to a checkpoint. The second token in Recovery place is not needed and will therefore be deleted. Then the output gate OG2 places a token in the Run place which resembles the application resuming execution from the last checkpoint and losing all the work that has not been checkpointed. (2) If the error is identified as intermittent, then a rollback to the last checkpoint is conducted through input gate IG3 which ensures that a token from the Recovery place is forwarded to RecDuration2 timed activity and then forwarded to the Diagnosis place. The second token in Recovery place is not needed and will therefore be deleted. The timed activity RecDuration2 represents the time needed for an application to roll back to a checkpoint. The DiagDuration timed activity represents the time overhead of a diagnosis technique. Once this activity fires it can have one of two possible outcomes: (1) Top case: the defective unit is diagnosed accurately. If the unit disabled is temporary, then the disabled unit is removed from the set of available resources. In addition, a token is created in TemporaryRemoval place. This token stays there until RemovalDuration timed activity fires. This activity represents the duration of the disablement (either temporary disablement or permanent one, in which case the duration will be infinite). Once the timed activity RemovalDuration fires, the unit is put back into the pool of available resources. In addition, a token is put back in Error1 place (Error1 usage is explained in exponential and Weibull fault models) indicating that an error can happen in the future depending on the error distribution. The time required by the fine-reconfiguration technique is not modeled since this recovery action is invoked infrequently. Last, the alpha variable that corresponds to the reconfigured core is cleared upon diagnosis. (2) Bottom case: the diagnosis process identifies a healthy unit as defective (inaccurate diagnosis). Therefore, a healthy unit is disabled although the intermittent error will still be active. In this case, the disabled unit is removed from the set of available resources until another diagnosis action is conducted. In addition, 49  a token is created in RemovalDuration similar to the top case, but a token is put in the Error1 place immediately. In both cases, once the DiagDuration timed activity fires, a token is created in the Run place (by input gates OG and OG1), which resembles an application that resumes execution after a diagnosis and a potential reconfiguration. Further, a token is put back in Error which models the base fault model. Rollback-only model:  To build this model, input gate IG3 is always disabled.  Therefore, no fault discrimination, diagnosis or reconfiguration action is performed in this model. Core-level reconfiguration model:  To build this model, we do not need a diag-  nosis action since the array of alpha variables serves as a fault discrimination and diagnosis technique (remember that each core has its own alpha entry in the array). Therefore, the timed activity DiagDuration has a zero duration. Moreover, the bottom case in the same activity is always disabled. Exponential and Weibull fault models:  Exponential and Weibull fault models are  built on the top of the model described earlier in this section. The exponential fault model has fault arrivals that follow Poisson process (Figure 4.4 (b)). When the fault is inactive, a token is stored in place ErrorExpInactive, the input gate IG1 ensures that the fault will not be activated if the defective core/unit is disabled. The timed activity ErrorExpFreq fires based on the corresponding MTTF, after which the output gate OG4 creates a token in Error1 place which represents the existence of a defective core/unit. In addition, a token is transferred from place ErrorExpInactive to place ErrorExpActive. This token remains in the ErrorExpActive for the activity duration of the fault which is enforced by timed activity ErrorExpDuration. Once this duration has passed, the output gate OG5 deletes the token in Error1 place and the token from ErrorExpActive place is transferred back to place ErrorExpInactive. The Weibull fault model is very similar to exponential fault model (Figure 4.4 (c)). The main difference between the two models is that the activation times of  50  Figure 4.5: Fault model used in studying error recovery. intermittent faults in Weibull fault model are distributed according to Weibull distribution in timed activity ErrorWeibFreq.  4.2  Experiment Setup  In this section, we describe the intermittent faults that we cover in our simulations. Then we list the parameters we use in our models. Finally, we layout the research questions that we address in the results section. Fault model  Since our simulations are at the system level with little details about  processor components and workloads, we are able to model multiple fault bursts. The faults we model may last for hours and days (the intermittent fault part that we cover is circled and in red in Figure 4.5). Simulations We use the Mobius modeling framework to construct and simulate our SAN models [17]. We evaluate the useful work and availability of the model described in the previous section after 48 hours from the intermittent fault occurrence using Mobius simulations with a confidence level of 95%. The parameters we use are shown in Table 4.1. Research questions In our experiments, we focus on finding answers to the following questions: (1) When should we recover from an intermittent error by a simple program rollback to a checkpoint and when should we disable the defective component? Typically, to recover from a hardware error, the first step is to identify the type of the error (permanent, transient or intermittent). 51  Table 4.1: SAN model parameters. Parameter  Value/Range  (CPU  Comments  time) Base fault rate  Exponentially distributed.  Found by Nightingale et al. [48].  MTTF is 6.56 years Exponential fault model  Exponentially distributed.  -  MTTF is 2 hours Weibull fault model  Weibull  distribution.  We study the sensitivity of our results to this parameter.  MTTF is 1-40 hours 52  Fault rate during bursts  Exponentially distributed.  -  MTTF is 1 seconds Diagnosis duration  2 sec  Conservative duration. DIEBA’s overhead is tens of millisec.  Recovery duration  0-60 sec  Conservative duration. Wang et al. reported 10 min. [77].  Component rank  0%-35%  We study the sensitivity of our results to this parameter.  Meant time to checkpoint (MTTC)  5-60 min  Found experimentally by Plank et al. [58].  Checkpoint duration  30 sec  Same as previous.  Probability that fault is activated  0.75  Found using fault injections (Chapter 3).  Probability that error is detected  0.70  Conservative number for detection coverage [55].  Probability of crash/benign run/SDC  0.53/0.33/0.14  Found using fault injections (Chapter 3).  In our work, we show that to recover from an intermittent error we also need to identify the error rate and importance of the defective part. An error that occurs frequently in a critical processor component (for example) might be tolerated by a simple program rollback. On the other hand, an error that occurs frequently in a part that has a replica should be mitigated by disabling that faulty part. (2) For the errors that are tolerated by disabling the defective component, should the disablement be permanent or temporary? Another interesting error mitigation approach that targets the intermittent-error prone processors is to temporarily disable the defective parts of the processor, to relieve the defective part from the stressful operating conditions [81]. For example, if the processor is experiencing NBTI-related failure, pausing or migrating any workload that is currently running on the processor might reduce the processor’s temperature and hence the processor can “partially”2 self-recover from the failure. However, it is unclear which types of error should be tolerated by temporary disabling the defective components. (3) What is the granularity of the disabled component that maximizes the processor’s performance? Traditionally, a processor that is diagnosed with a hardware error would be replaced with another hot or cold processor (hot/cold swapping) [53, 81]. Another recovery approach is to reconfigure the defective processor and facilitate “graceful degradation” where only the defective core or micro-architectural structure is shutdown [26, 45, 59, 64, 67]. In the latter approach, the remaining operating parts of the processor can be used after the shutdown of the defective part. In this approach, a faulty processor would continue to be used with degraded performance. We show that the granularity of the error location (in addition to the error severity) is an important factor in choosing the right recovery action. To evaluate the relative importance of the disabled part of the processor, we propose a new metric that we call component rank. We use the term component to refer to a micro-architectural unit or a core in a processor. Component rank is the maximum percentage of processor useful work that is lost when the corresponding component is disabled upon error recovery. 2 See  Section 2.1 for more details.  53  To determine a component rank for a micro-architectural unit or a core, one should consider the following factors: 1. Components are assigned a rank of 0% if they (a) have a replication component which is rarely used or (b) can be replaced or repaired using low-cost techniques. 2. Components are assigned a rank of 100% if they are critical for the workload executed by the processor such that the workload cannot be executed without the corresponding component. 3. The rest of the components are assigned ranks that depend on how much useful work will be lost if the component is disabled. To estimate the useful work lost, one can use Amdahl’s law3 , which states that the speedup (or slowdown) of a program depends on how much of this program is improved (or downgraded).  4.3  Results  In our first set of experiments, we consider two recovery scenarios (rollback-only, unit-level reconfiguration with both temporary permanent disablement) for the three fault models described in Section 2.2.2. In this experiment, we set the MTTC to 20 minutes, recovery duration to 30 seconds, MTTF for exponential fault model and Weibull fault model to 2 hours, component rank to 5% and the diagnosis duration to 2 seconds. We assume that the recovery technique can disable the defective component upon successful diagnosis. We plot the useful work in Figure 4.6 and availability in Figure 4.7. In the following discussion, we explain the results obtained from this figure in the form of answers to the first two research questions described in Section 4.2. (1) When should we recover from an intermittent error by a simple program rollback to a checkpoint and when should we disable the defective component? 3 Quantitatively, Amdahl’s law states that if a part  overall speed up is 1/((1 − p) + p/s).  54  p of a program is sped up by s then the program  Figure 4.6: Useful work for the different recovery scenarios using three fault models.  Figure 4.7: Availability for the different recovery scenarios using three fault models. The rollback-only recovery performs well for the base fault model in which there are no bursts of extreme failure rates. The lost useful work (about 4%) is due to time spent saving checkpoints. However, rollback-only recovery performs poorly for exponential fault model and Weibull fault model in which there are bursts of extreme failure rates. The useful work drops to about 63% for both fault models (on average) while the availability is 85% and 82% for the exponential and Weibull fault models, respectively. On the other hand, temporary or perma-  55  Figure 4.8: Useful work for different ranks when permanently reconfiguring a defective component. nent reconfiguration of the defective component produces more useful work than rollback-only recovery by 25% for exponential fault model and 30% for Weibull fault model. (2) For errors that are tolerated by disabling the defective component, should the disablement be permanent or temporary? Temporary and permanent reconfigurations have close gains in terms of useful work and availability. However, permanent reconfiguration of the defective component achieves 2% more useful work than temporary reconfiguration. The availability is 99.9% for temporary reconfiguration and 100% for permanent one. The loss in useful work for the temporary reconfiguration scenario is the result of the overhead of the error distinguishing and restore-to-last-checkpoint mechanisms that are continuously encountered for each error burst. On the other hand, this overhead is encountered only once by the permanent reconfiguration scenario. In the next experiment, we evaluate the impact of the relative importance of the disabled component on recovery by varying the rank of the defective component (see Section 4.2 for more details about component rank). A low component rank may represent a micro-architectural unit, while a high component rank may represent a processor core or a critical micro-architectural unit. We use the permanent reconfiguration recovery scenario since it generates the highest percentage of useful work in the previous experiment. The rest of the 56  experimental setup is similar to the previous experiment. By analyzing the useful work generated by a processor after permanently disabling a defective component with different ranks, we answer the following research question: (3) What is the granularity of the disabled component that maximizes the processor’s performance? The smaller the granularity of the disabled component, the more useful work is achieved after processor reconfiguration. This can be done by disabling the defective micro-architectural unit rather than disabling the entire defective core. For this particular experiment, we find that disabling a component with a rank of 35% or more in the permanent reconfiguration recovery scenario has similar effects on useful work to the rollback-only recovery technique in which the defective component is used together with functioning components (Figure 4.8). We illustrate this result with some hypothetical examples. Example 1: A 4-core processor is used to run floating point-intensive application. This processor has one floating point unit (FPU). Assume that an intermittent error (that follows Weibull fault model) affects the FPU. The FPU has a rank of 100% as the processor is deemed unusable upon FPU disablement. Therefore, a rollback-only recovery scenario is the most beneficial recovery option. Example 2: A 4-core processor in which each core has two load-store units (LSUs) is running a highly parallel memory-intensive program that is using all the 8 LSUs for 60% of the time. An intermittent error that follows Weibull fault model with an MTTF of 2 hours affects one of the LSUs. According to Amdahl’s law, the useful work will be degraded by 8% or 1 − (1/(0.40 + (0.60/0.88))) upon the defective LSU disablement. Hence, the component rank in this case is 7%. Based on Figure 4.8, since 7% is less than the threshold of 35% for this intermittent fault rate, the best recovery option for this case is a permanent reconfiguration. Example 3: Assume that the same processor described in the previous example is used to run a memory-intensive application that uses 4 LSUs for 60% of the time. Moreover, suppose that the processor experiences an intermittent error that affects one of its LSUs. It is highly unlikely that disabling unit will cause any loss of the useful work. Therefore, the component rank in this case is 0% and a permanent reconfiguration recovery scenario is recommended. 57  Figure 4.9: Useful work for the rollback-only recovery using Weibull fault model with variable MTTF.  Figure 4.10: Availability for the rollback-only recovery using Weibull fault model with variable MTTF.  4.3.1  Sensitivity Analysis  In this section, we study the sensitivity of our results to MTTF, MTTC and recovery duration for the rollback-only recovery. Unless otherwise mentioned, the experimental settings are as follows: Weibull fault model is used with an MTTF of 2 hours. MTTC is 20 minutes and the recovery duration is 30 seconds. Mean Time To Failure: In this analysis, we evaluate the useful work and the 58  Figure 4.11: Useful work for the rollback-only recovery using Weibull fault model and variable MTTC. availability of the rollback-only recovery. MTTF in this experiment is varied between 1 to 40 hours (Figures 4.9 and 4.10). Intuitively, the more frequent the error, the less useful work that is accomplished and the less available the system. This is due to the time needed by the recovery and the loss of computations that are not checkpointed. In this experiment, we find that if the fault occurs about more than once a day then the best recovery action is to reconfigure the core. Otherwise, rollback-only is the most efficient recovery choice. As a general rule, if the intermittent error is frequent enough such that its effect on the useful work outweighs the rank of the defective component, then the component should be disabled. Otherwise, the system can continue using the checkpointing strategy to recover from the infrequent failures. Mean Time To Checkpoint: We now plot the useful work and the availability of the rollback-only recovery with variable mean time to checkpoint (Figures 4.11 and 4.12). Although storing a checkpoint every one hour reduces the overhead of checkpointing, it degrades the accomplished useful work. This is because more work is lost when a failure occurs. Moreover, although reducing the MTTC decreases the overhead for each checkpoint due to less amount of data being stored, there is a fixed overhead of each checkpoint that results from the context switches between the running program and the checkpoint mechanism. In our experiments, we rely on previous work (Libckpt [58]) to choose an MTTC of 20 minutes, which 59  Figure 4.12: Availability for the rollback-only recovery using Weibull fault model and variable MTTC.  Figure 4.13: Useful work for the rollback-only recovery using Weibull fault model and variable recovery duration. serves as a balance between the checkpoint frequency and the overhead associated with the mechanism. Therefore, if a processor is encountering frequent bursts of intermittent errors and reconfiguration is not possible, then one should reduce the mean time to checkpoint to increase the processor useful work. Availability is almost the same for all MTTC values. This is because we consider the system to be available if it is running a program or storing a checkpoint.  60  Figure 4.14: Availability for the rollback-only recovery using Weibull fault model and variable recovery duration. Recovery duration: In this experiment, we evaluate rollback-only model with variable recovery duration. We measure the useful work and the availability (Figures 4.13 and 4.14) of the processor. We find that useful work is only marginally affected by the recovery duration, while availability degrades with larger recovery duration. This is because the processor will not be able to run any programs while it is rolling back to the last checkpoint.  4.4  Discussion  To shed some light on the applicability of our model to real processors, we discuss the assumptions we made in our model and the impact of these assumptions on the results we obtain from our study. The assumptions are as follows: (1) The model does not take into account correlated faults, or faults that appear in units that form a “hot spot” in the processor, for example. Note that hot spots that consist of multiple spatially close micro-architectural units usually occur much faster than chip-wide heating [72]. There are multiple factors that result in an intermittent fault triggering diagnosis and recovery techniques: (a) the error has to occur despite the power and temperature management techniques (e.g. HotSpot [72]), (b) the error must manifest itself to the program and bypass the architectural masking, such as two timing errors cancelling each other, (c) the error must propagate to  61  the program, cause erroneous behavior and bypass the instruction-level masking, (d) the error must happen frequently enough to be distinguished as intermittent, and (e) the error must happen frequently enough that it triggers a reconfiguration technique. In our analysis, we model factors b, c, d and e, but we do not model factor a (as we assume that only one micro-architectural unit is affected by an error). If multiple units expose an intermittent error to the program, then this means that more advanced error discrimination and diagnosis techniques should be applied in our model. These techniques may impose more performance overhead and degrade the useful work before reconfiguring the core. Moreover, disabling two or more micro-architectural units may result in the core being unable to function properly so core shutdown may be more useful than unit shutdown in this case. (2) We do not model errors that occur during program rollback to the last checkpoint. Errors that occur during this stage can be detected through error detectors; when an error is detected the program rollback can start over (assuming that the error will eventually stop and the program will not enter into endless loop of error detection and restarting). The effect of this process will be a longer rollback. (3) We model the effects of inaccurate error diagnosis/reconfiguration by modeling the inaccuracies in diagnosis results (this can represent imperfect diagnosis technique or errors that affect a diagnosis/reconfiguration technique). In our model, we assume an error will likely recur after an incorrect diagnosis or reconfiguration. Therefore, the disabled unit will be re-enabled and error tolerance techniques will be applied again.  4.5  Conclusions  In this chapter, we investigate the design of optimal recovery actions for intermittent errors. We model a fault tolerant chip multiprocessor that experiences intermittent hardware faults during its lifetime. We use intermittent fault models at the system level from Chapter 2. We evaluate the chip performance for different recovery scenarios under different intermittent fault models and rates. Although it suffices to know the error type to choose the recovery action for transient or permanent hardware faults, we find that the failure rate and error location are important factors in choosing the appropriate recovery action for intermittent errors. We also  62  show that reconfiguration around components with small ranks (such as microarchitectural units) grant high performance and availability. In the next chapter, we propose error diagnosis technique at the micro-architecture level to facilitate fine-grained reconfiguration around the fault-prone micro-architectural units.  63  Chapter 5  DIEBA: Diagnosing Intermittent Errors by Backtracing Application Failures In the previous chapter, we have shown that to maximize a processor’s throughput, one must choose a recovery strategy based on the relative importance of the microarchitectural component that experienced the error. Further, many methods for hardware reconfiguration have been proposed at the level of micro-architectural units or pipeline stages [45] [59] [13]. Therefore, accurate diagnosis of intermittent errors to the granularity of micro-architectural components is vital for efficient recovery. Transient errors are one-time events, and are unlikely to be encountered again in the same location. Therefore, no diagnosis is required for these errors. Permanent errors, on the other hand, persist in the same location indefinitely, and can be diagnosed by running tests. Intermittent errors lie between the two extremes of transient and permanent errors. Intermittent errors (unlike permanent errors) are non-deterministic and may not appear during testing. This makes them challenging to diagnose. Diagnosis may be performed in software or in hardware. The main advantage of hardware-based techniques is that they can be applied with relatively little effort as they are hidden from the software. However, hardware-based techniques have 64  the following shortcomings. • Incur considerable performance or power overheads even during fault-free operation. This is because hardware techniques require the processor to be continuously monitored or tested as they cannot know apriori when an intermitent fault will occur. For example, periodic testing techniques [16] can incur about 30% performance overhead for running tests. • May initiate recovery actions for faults that do not impact the application. However, faults may affect hardware locations that are not used by software and hence the fault will not affect the program’s state. Therefore, recovery from these faults result in unnecessary performance overhead. • They are often coupled with the design of the processor, and can be difficult to upgrade or port to different processors. Therefore, we choose to focus on software techniques for diagnosing intermittent hardware errors. The research question we address in this work is: “Can one develop softwarebased diagnosis methods for intermittent errors to identify the faulty part in a processor?”. To answer this question, we do the following: • Introduce a software-based diagnosis technique, DIEBA1 , which starts from the failure dump of a program (due to an intermittent error) and identifies the faulty micro-architectural unit. DIEBA requires no hardware support, and can be executed on a different core than the one that experienced the error (thus diagnosis does not pause the core’s execution after the failure) (Section 5.2). • Evaluate the effectiveness of DIEBA in diagnosing intermittent errors through fault-injection experiments. We find that DIEBA can accurately diagnose 70% of errors in three functional units of the processor. These units comprise 57.1% of the total area of OpenSPARC-T1 core [1] [57] and 51.4% of POWER4-like processor [30]. 1 A version of this chapter has appeared in Silicon Errors in Logic - System Effects Workshop in 2012 [62].  65  • Estimate the performance overheads of DIEBA based on the behaviors exhibited by applications when they encounter an intermittent fault. We find that DIEBA slows down software by 18%. Our implementation of DIEBA, covers functional units with accuracy of 70%. DIEBA uses information in the application’s binary code to reconstruct its execution prior to the failure, and attempts to isolate the faulty unit based on error propagation paths in the application. DIEBA does not require any information from the hardware (except for standard information available in a failure dump), nor does it need to perform any tests on the faulty core. To the best of our knowledge, DIEBA is the first approach to diagnose intermittent hardware errors without running additional tests or requiring hardware support. DIEBA, being a software-only approach, however, has some limitations. First, it cannot diagnose errors in the processor’s front-end units such as the instruction scheduler. Second, DIEBA can only resolve the error to the type of the functional unit - it cannot further resolve the error to the specific instance of the unit when the processor has redundant units. This limitation may not be significant because studies have speculated that future processors will have no redundant units for reasons of power efficiency [26]. If the processor has replicated units, then DIEBA needs to be complemented with hardware-based techniques to resolve the erroneous unit.  5.1  The Role of Diagnosis in Error Tolerance  Diagnosis finds the defective component in the processor. This should facilitate the following two goals: (1) Find the recovery action that guarantees maximum processor throughput: In the previous chapter (Chapter 4), we have identified two parameters that are needed to determine if processor reconfiguration is necessary to achieve maximum performance: (a) the relative importance of the defective unit and (b) the rate of the intermittent error. Once we diagnose the defective unit, we can identify its relative importance using a pre-computed table. Therefore, diagnosis results are required to determine the right recovery action. (2) Facilitate the reconfiguration technique: Multiple proposals have been introduced as to how to function a core after disabling the defective part of the 66  processor. All of these proposals assume that there is a technique to identify the malfunctioning part of the processor. If the error is permanent, then many hardware testing techniques can be used for diagnosis (e.g., [70] [37]). However, since intermittent errors are non-deterministic, these hardware-based techniques need to be turned on for a long time to diagnose the error. DIEBA, on the other hand, is a software-based approach that can diagnose an error in the processor backend without any hardware support. Different reconfiguration techniques are designed for different types of architectures. Here we summarize the main reconfiguration proposals: (a) Architectures with unit replications or redundant pipeline stages: if the defective part of the core has a replica, then the core can be reconfigured around the defective part using a network of pipeline stages, for example. Under this scenario the core will continue running with potentially degraded performance [26] [30]. Identifying the defective unit or pipeline stage is essential before performing this reconfiguration. In addition, the diagnosis technique should be able to distinguish the defective replica. DIEBA in its current implementation cannot do this differentiation because of software limited visibility to the hardware state. (b) Architectures without unit replications: this reconfiguration scenario can be applied to cores with no unit/pipeline stage redundancies. Multiple options are available to use a core that is missing a unit or a pipeline stage: (i) the functioning parts of the core can be borrowed by other cores, (ii) program threads that use the defective units can be migrated to other cores [59] [64] or (iii) programs running on this core can be detoured, i.e., the software running on the defective core is altered (using binary translation) so as not to use the defective unit, if possible [45]. For architectures that do not have replications, DIEBA in its current implementation can be used to identify the faulty functional unit, as this information should be sufficient to perform reconfiguration.  5.2  Approach  We start with an overview of DIEBA (Section 5.2.1), describe the construction of the Dynamic Dependence Graph (DDG), which is an essential structure used by DIEBA (Section 5.2.2), and then discuss the design choices made in DIEBA  67  Figure 5.1: Overview of our diagnosis technique. (Section 5.2.3).  5.2.1  Overview  Figure 5.1 provides an overview of DIEBA. DIEBA is triggered by an intermittenterror detection or program failure (Figure 5.1, Box 2); the technique uses program information at the time of error detection or failure, along with some additional information, to identify faulty functional units. DIEBA’s Inputs (Figure 5.1, Box 3) are: (1) program binary, (2) inputs that were provided to the program including any sources of non-determinism introduced by the operating system, (3) the crash dump file of the program consisting of the register file contents, memory dump and the program and instruction counters2 (these are available in standard core-dump files), and (4) program signature (see Section 5.3.1). DIEBA’s Output (Figure 5.1, Box 5) is the functional unit where the intermittent error occurred. In the current version of DIEBA we support arithmetic and logical units (both integer and floating-point), in addition to the load-store unit. For most fine-grained reconfiguration techniques (e.g., Stagenet [26], Core Cannibalization [64]) it is sufficient to resolve the fault to functional units/pipeline stage. Technique At a high level, DIEBA attempts to reconstruct the execution of the program prior to the error detection or failure. Because we have found experimentally that most intermittent errors lead to program crashes within a few thousands 2 The instruction counter keeps track of the number of dynamic instructions executed in the program, and is available as a performance counter in x86 architectures.  68  of instructions from the start of an error (Section 5.5), it suffices to reconstruct the last few thousands of instructions of the program prior to the failure (the number of instructions is a configurable parameter of the technique). Based on the reconstruction of the program execution, DIEBA can infer the portions of a program’s state that were corrupted by the error (by computing the difference of the two states). DIEBA then backtraces the corrupted data by following the program’s dynamic dependence graph (DDG), which is a representation of the dependencies among the program’s dynamic instructions (Section 5.2.2) to identify the error-propagation paths in the program. The construction of the DDG is illustrated in Section 5.2.2. To build a DDG of a program, DIEBA needs two aspects of a program’s execution to be recorded. First, DIEBA needs a log of all sources of non-determinism in the program (including its inputs) in order to faithfully reproduce its behavior. Second, DIEBA needs to log the control flow of the program prior to its failure, because an intermittent error can modify the program’s control flow. This is achieved by maintaining a program signature. Based on these two elements, DIEBA reconstructs the program’s execution on a different (reliable) core to generate what we call a fault-free run (the means of obtaining a reliable core are explained in Section 5.2.3). Finally, DIEBA attempts to identify the faulty functional unit based on data propagation paths that were affected by the error. This is based on mapping instructions to the corresponding functional units in the processor.  5.2.2  Background on Dynamic Dependency Graphs  A dynamic dependency graph (DDG) is a representation of data flow in a program. It is a directed acyclic graph where graph nodes or vertices represent values produced by dynamic instructions during program execution. In effect, each node corresponds to a dynamic instance of a value-producing program instruction. Dependencies among nodes result in edges in the DDG. A code fragment (Table 5.1) and its corresponding DDG (Figure 5.3) are provided as an example. We will use the same example later to illustrate DIEBA’s operation. DDGs were originally proposed by Agrawal and Horgan for understanding and analyzing program behavior [5]. We have used DDGs to model intermittent  69  error propagation in programs in this dissertation (Chapter 3). We found that most intermittent errors result in crashes a few thousands of instructions between the start of the error and the program crash. We also found that only a few hundreds of data values in the program were affected by a crash-causing fault (before it crashes). Therefore, the crash state is only partially corrupted by the error and, hence, one can reconstruct the state of the program from the crash state. Note that no earlier study has used DDGs in error diagnosis.  5.2.3  Design Choices and Assumptions  Design choices DIEBA is executed on a different core than the one that experienced failure. This is because we do not want the diagnosis process to be affected by the intermittent fault that could potentially recur. Further, because the fault is intermittent, the fault-prone core can continue to be used while the diagnosis is taking place on another core. Therefore, DIEBA does not perturb the fault-prone core. We assume the availability of a non-faulty core to run DIEBA. One way to achieve this is to run two cores (dual modular redundancy, DMR), so that an error in one core will be caught by the other, as long as both cores do not experience the same error. This does not incur additional performance and power overheads in the fault-free case. We assume, as Li et al. [37] have done, that only in the infrequent event of a crash or error detection is the DMR mode invoked3 . Assumptions In addition to assuming the availability of non-faulty cores, the main assumptions in DIEBA are: Low error detection latency In designing DIEBA, we assume the presence of error-detection mechanisms that can detect a fault within a few thousands of instructions before it causes a failure. This is important to bound the sizes of the memory-buffers DIEBA uses to collect control-flow signatures of the program’s execution as we will explain later (Section 5.5). 3 It is also possible to run DIEBA at a completely different location by sending the system state at failure over a network connection.  70  No simultaneous faults We assume that at most one functional unit of the processor is fault-prone during a given execution of the program. This assumption is justified because intermittent faults are relatively infrequent compared to the typical execution time of many applications, and hence are unlikely to affect multiple functional units. Fault locations We only consider faults in the processor’s load-store unit, the integer arithmetic and logic unit (ALU) and the floating point unit (FPU) and not in the units comprising the processor’s front-end such as the fetch and decode units. As a software-based technique, DIEBA does not have enough observability to diagnose errors in such units. Moreover, we assume that all the memory is protected using parity or ECC, and hence does not experience software-visible errors.  5.3  DIEBA Implementation  We start this section by explaining how we reconstruct the program’s execution prior to the failure. This is a crucial step for the DIEBA technique. Next, we discuss the details of DIEBA’s implementation (Section 5.3.2). Then we show an example to illustrate DIEBA’s operation (Section 5.3.3).  5.3.1  Reconstruction of Program Execution  As mentioned earlier, reconstructing the execution of a program prior to its failure imposes two monitoring requirements: first, we need to identify and reproduce faithfully all sources of non-determinism (e.g., user inputs and interrupts). Second, we need to capture the control flow prior to the failure.We address these two aspects below. Non-determinism To enable execution replay of a program, all sources of non-determinism in the program should be recorded by logging the associated events during the original execution (i.e., record phase). These events include inputs, interrupts, messages exchanged with other processors and inter-leavings among threads (for multi71  threaded programs). There has been significant work on performing deterministic replay in multi-processor and multi-core systems to support software debugging and fault-tolerance [19] [82]. Replay techniques may be implemented in software or hardware. Software-based systems such as ReVirt require operating system or virtual machine support [19], while hardware based systems such as Flight Data Recorder (FDR) [82] piggyback on the hardware cache-coherence protocol to replay software running on multiple processors. FDR logs old memory state when memory is updated, records only subset of races between threads and logs interrupt content and the instruction count at which the interrupt occur. As a result, FDR can replay programs deterministically in the full-system environment with low performance overhead of less than 2%. Such low overhead allows FDR to be enabled all the time. DIEBA relies on replay techniques such as FDR to reconstruct program execution and facilitate diagnosis. Control flow Intermittent faults can affect branch or jump instructions in a program, in which case the control flow of the affected program will deviate from the correct one. However, we need to capture the control flow of the program prior to its failure in order to reconstruct its execution after the failure. We therefore instrument the program’s binary to record a log of the last n basic blocks (BBs)4 . This instrumentation is added to the program’s executable, and does not require its source code. The idea of program signatures is not new. For example, signatures have been used by Alkhalifa et al. to detect control-flow errors [6]. We use signatures not to detect errors but to replay a program’s control-flow. We partition a program into a set of BBs, and add a store instruction at the end of each BB to save the BB identifier in a buffer. The BB identifier is a unique identifier for each BB in the program. BB identifiers are stored in a ring buffer in memory. In case of a failure, this buffer will be written to the crash dump file. This buffer is circular since only the most recent identifiers are needed for the diagnosis technique. This is because 4 The parameter n depends on the available memory space. More details are available in Section 5.5.  72  Figure 5.2: An abstract example to show how to record signatures in different erroneous cases. most intermittent faults cause programs to crash soon after they occur, so only the last few thousands of instructions and hence, the last few thousands of control-flow instructions (CFIs) will be analyzed. We discuss the performance and memory overheads incurred to store the signature in Section 5.5.3. While the above instrumentation will record the control-flow of the program under a fault-free case, it does not work in the presence of faults. This is because the fault may cause the program to elide store instructions that record BBs’ identifiers, which may lead to an incorrect/incomplete signature. However, in most cases it is possible to reconstruct a program’s control flow posteriori, even with control flow errors5 . Control flow in the presence of errors Consider the abstract program in Figure 5.2, where a circle represents a BB, and an arrow represents a transition between two BBs. In an error-free environment (case A), there is a jump from the end of BB #1 to the beginning of BB #2, and the identifiers of both BBs are stored in the buffer using a store instruction at the end of each BB. In the following, we explain how DIEBA keeps track of which BBs have executed for three different erroneous cases (B, C and D): Case B An erroneous jump occurs to the start of BB #3. The store instruction at the end of the BB #3 saves the BB identifier. Therefore, we know that the 5 We  assume that the BB identifiers are protected with error-correcting codes, so that they are not corrupted by the error.  73  end of BB #1 is followed by an instruction in BB #3. Case C An erroneous jump occurs to the middle of BB #3. The store instruction at the end of the BB #3 saves the BB identifier which is an indicator that this BB has executed. Case D An erroneous jump occurs to the end of BB #3. If the destination address leads to the BB identifier store instruction in BB #3 then the BB identifier will be saved to the buffer as in previous cases. However, if this jump leads to the CFI after the store instruction, then the BB identifier will not be saved to the buffer. This case results in missing the faulty jump in BB#1 that led to the end of BB #3. However, missing this BB does not affect the accuracy of our technique since BB #3 does not modify program state (because the erroneous jump is immediately followed by a CFI). Thus, we see that in all cases, we can reconstruct the control flow of the program using the signatures. Although we have explored the basic algorithm where each BB is instrumented with a store instruction, more efficient tracing techniques are available .  5.3.2  The DIEBA Algorithm  We now describe the details of the DIEBA algorithm. The algorithm attempts to reconstruct the execution of the program based on the information gathered in Section 5.3.1. We call this the fault-free run of the program, as it is executed on a non-faulty core (see Section 5.2.3). DIEBA then compares the state of the faultfree run at the failing instruction with the state in the failing run. All differences in state are marked as strong clues; we know for certain that these elements were corrupted by the error. The algorithm then attempts to trace back from the strong clues using the program’s data dependencies (see Section 5.2.2) to identify the set of all program elements that may have been corrupted by the error. We call these weak clues, as these are based on heuristics, and so we cannot be certain that they were corrupted by the error. Finally, the algorithm maps the corrupted propagation paths to the functional units they used in their execution in order to identify the unit that was likely responsible for the corruption. 74  The detailed algorithm is as follows: (1) Create a fault-free run of the failed program: In this step, we replay the execution of the failed program on a non-faulty core (using the information gathered in Section 5.3.1) as follows: (1) When a non-deterministic data item is read by the program, this data is substituted with the value recorded by the replay tool, and (2) when the program executes a control-flow instruction (CFI), the target address of this instruction is compared with the value recorded in the program’s signature (if an entry exists for the CFI). If the address does not match, then the CFI’s target in the fault-free run is substituted with the address stored in the signature. This makes the fault-free run mimic the control flow of the failing run. In addition, the mismatched CFI is added to the set of strong clues considered by the technique (as it was corrupted by the fault). The fault-free run is terminated when the instruction address and the dynamic count of instructions match that of the failing run, i.e., at the dynamic instruction that crashed the program. (2) Capture the trace of the fault-free run and construct the DDG: The PC, instruction type (e.g., branch, add) and operands are logged to a trace file for each instruction. Note that this recording is done during diagnosis time only, and does not affect the running program in any way. The Dynamic Dependence Graph (DDG) is constructed from the trace file. (3) Find the erroneous registers and memory data in the program: Compare the final register file and the memory state of the fault-free run with the corresponding register file and memory state of the failure run to find the set of mismatched registers and memory locations. These locations are also added to the set of strong clues. For convenience, we use the term strong clue to refer to both the members of the set and to the last dynamic instructions (or nodes) that modify them in the program. (4) Compute backward propagations sets: Use the DDG constructed in Step 2 to find all error propagation paths6 that result in changes to strong clues. This can be done by traversing the predecessors of the instructions that directly affect the erroneous registers, memory locations and CFIs. Such precise backtracing is possible because we use the dynamic trace (Step 2) to construct the DDG, and 6A  propagation path is a set of instructions that have data dependencies among them.  75  hence have complete information about the execution. (5) Prune the propagations paths: In the previous step, some of the weak clues found in propagation paths are marked as erroneous just because they have been used as operands in instructions that generated incorrect results, although these operands were not affected by the fault. DIEBA attempts to find these correct operands by checking if they have contributed to some other correct computation(s) in the program, and if so, eliminates them from the corresponding propagation path. This is a heuristic as it is possible that the error is masked by other computations, and hence we may miss some erroneous data. Nonetheless, we find that this heuristic works well in practice. (6) Identify the defective unit: In this step, we identify the functional unit that likely experienced the defect as being faulty, by analyzing the error propagation paths and identifying the functional units used by the instructions in the paths. We consider arithmetic/logic instructions (both integer and floating point) and loadstore instructions. We count the number of paths in which a particular unit is used, by associating a counter with the unit. The unit with the highest usage count across all propagation paths is then labeled as defective. In a few cases, the highest count is associated with more than one unit, we report these cases in the results section. The idea of using a counter to infer the defective unit is similar to that of Bower et al. [13].  5.3.3  Example  We illustrate the operation of the DIEBA algorithm with the example introduced earlier (in Section 5.2.2). Table 5.1 shows the assembly code fragment corresponding to the example. Consider an intermittent fault that affects the Integer ALU, such that three instructions that use the ALU are erroneous; 0x40d628, 0x40d638 and 0x40d648 (nodes 5, 7 and 9). This error causes the jump instruction at 0x40d678 to branch to an invalid code address, which leads to a crash. For this example, we assume that the program is loop-free, and hence there is a one-to-one correspondence between node numbers and addresses in the DDG. Figure 5.3 shows the DDG corresponding to the code fragment. For simplicity, we have numbered the nodes starting from  76  Figure 5.3: The DDG constructed using code fragment in Table 5.1. Light grey nodes represent weak clues, while dark grey nodes represent strong clues. 1. Also, the DDG only includes the instructions that appear in this example. Table 5.1: Code fragment to illustrate the diagnosis algorithm. Node 1 2 3 4 5 6 7 8 9 10 11 12 13  Address 0x40d608 0x40d610 0x40d618 0x40d620 0x40d628 0x40d630 0x40d638 0x40d640 0x40d648 0x40d650 0x40d658 0x40d660 0x40d668 0x40d670 0x40d678  Instruction addiu $gp[28],$gp[28],-23296 lw $s0[16],0($sp[29]) addiu $s1[17],$gp[28],42 addiu $sp[29],$sp[29],-40 addu $s2[18],$zero[0],$v1[3] sw $s2[18],-32396($gp[28]) addiu $a1[5],$t4[12],-24 sw $a0[4],-32362($gp[28]) addu $a2[6],$a1[5],$v1[3] jal 401900 addu $a1[5],$s1[17],$s0[16] sw $a1[5],-32400($gp[28]) addu $a2[6],$a2[6],$s2[18] addiu $sp[29],$sp[29],-40 jr $a2[6]  We illustrate the algorithm step-by-step on the example. In Step 1, the failed program is re-executed on a fault-free processor, and the fault-free run is constructed. The fault-free run is terminated at PC 0x40d678 as the program crashes at this instruction. For simplicity, we assume that there are no faulty branches or jumps in the replay. 77  During the re-execution, a trace file is gathered and its DDG is constructed (Step 2). The registers contents and the memory state of the fault-free run is compared with that of the failing run (Step 3). The technique finds that register #18 is erroneous, because it is updated at PC 0x40d628, which is directly affected by the error. Moreover, register #6 is erroneous because it is also affected directly by the error at PC 0x40d648. Therefore, the set of erroneous registers for this example are #18 and #6. Although register #5, on the other hand, is directly affected by the error, it cannot be found by DIEBA at crash time because it is overwritten with correct data in instruction 0x40d658. By comparing the memory states of the failure run and the fault-free run, the technique finds that the data that has been stored in memory address -32396($gp[28]) is erroneous (the error propagates to -32396($gp[28]) through register #18 which was affected directly by the error at PC 0x40d628). Thus, the strong clues identified in this example constitute registers #18 and #6 and memory address -32396 ($gp[28]). These correspond to the DDG nodes: 5, 12 and 6, in dark grey Figure 5.3. In Step 4, the technique traverses the predecessors of the strong clues (i.e., nodes 5, 12 and 6) and backtraces them until the start of the code fragment (the backtracing window is limited by a few thousands of instructions before the failure location or the start of the program, whichever comes first). The data produced by these instructions are the weak clues. The traversed nodes appear in light grey colour in Figure 5.3. Table 5.2 shows the erroneous propagation paths identified by the technique. For example, nodes 5 and 9 are used to compute node 12. There are two propagation paths corresponding to this node (rows 2 and 3 of the table). Node 5 is considered a path on its own because it is where register #18, which is a strong clue, was last modified. Note that immediate values are not represented by nodes in DDG, and hence do not appear in the table. Due to the simplicity of the example, we cannot show the pruning of the propagation paths (Step 5). The final step in the diagnosis (Step 6) is to identify the defective unit. We map each propagation path that appears in Table 5.2 to the functional units used by the path’s instructions. DIEBA maintains a counter for each unit based on the number of paths in which it is used. The Integer ALU’s counter will be 3 (used in 3 paths) and the Load-Store Unit’s (LSU’s) counter will be 1 (used in one path). Therefore, 78  Table 5.2: Backtracking paths for the example code. Propagation Path in Nodes 5 1 → 6 and 5 → 6 5 → 12, 9 → 12 and 7 → 9  Units Used Integer ALU Integer ALU and LSU Integer ALU  DIEBA correctly concludes that Integer ALU is the defective unit in this example.  5.4  System Description  We use fault injection campaign at the micro-architecture level to evaluate DIEBA’s accuracy. We now describe our fault model and the experiment setup of our simulations. Fault model In our simulations, we emulate one active duration of an intermittent fault (circled and in red in Figure 5.4). This is because we assume that an active intermittent fault is likely to cause a failure and hence the diagnosis technique is invoked within this active duration.  Figure 5.4: Fault model used in studying error diagnosis. Each fault injection experiment involves choosing the following fault parameters randomly (Table 5.3): • Fault location: We inject the destination register in the integer ALU, multiplier, divider and the floating point (FP) units. Also, we inject the data stored/ loaded and the memory address (each of these locations represents a different injection) in the Load-Store Unit (LSU) . • Fault start: a single intermittent fault is started at a random cycle within 1 million dynamic instructions. 79  • Fault active duration: We choose active durations that range from 5 to 20,000 cycles because voltage fluctuations last from 5 to 30 cycles [31], while temperature fluctuations may last for many thousands of cycles [72]. • Micro-architectural fault model: our models are described in detail in Section 2.1.  Experiment setup We perform intermittent fault injections using a micro-processor simulator running a benchmark program, applied DIEBA to all the failing runs of the program, and compared the diagnosed unit with the fault injected unit. Our fault injection tool is described in Section 2.3. This modified simulator injects faults into specific micro-architectural units, collects crash dump files and performs program replays. We did not implement an independent deterministic replay technique; we rely on the simulator to record and replay sources of nondeterminism. We use seven integer and four floating point (FP) benchmarks from the SPEC 2006 suite for our evaluation. Each benchmark is forwarded for 2 billion instructions to remove initialization effects, and then a single intermittent fault (which may span multiple cycles) is injected at a random cycle within the next 1 million instructions7 . After the injection, we monitor the benchmark for another million instructions for the following events, (1) the benchmark terminates with a hardware trap, thus leading to a crash, or (2) the number of data values corrupted by the fault crosses a pre-determined threshold (500 in our experiments). The first event signifies a program failure, while the second event signifies a likely detection of the error by a software-based error detection mechanism. This is based on prior work that has shown that errors that have a high fanout in a program often lead to application crashes [54]. The experiment consisted of the following two phases. Injection phase For each benchmark program we inject a total of 3500 faults. For each fault, the parameters in column 1 of Table 5.3 are chosen based on their value ranges in column 2. Only one fault is injected in each execution to ensure controllability (we assume that no more than one unit is defective in each injection). 7 We  do not use SimPoint as its value in reliability based studies is unclear.  80  If the fault results in a program crash or an error is detected (based on the threshold value), then the failure dump consisting of the register file, memory footprint and the program counter is captured at the time of the crash, in addition to the number of instructions that were executed by the program. Table 5.3: Fault injection parameters. Fault Parameter Location bit Location unit  Start cycle Duration Micro-architectural Model  Value/Range A bit chosen randomly from 0 to 63 in a micro-architectural unit. Integer ALU, multiplier, divider, LSU (data, read address, write address), FPU A cycle chosen randomly from 1 to 1,000,000 5, 50, 100, 500, 10,000 or 20,000 cycle Stuck-at-one/zero/last-value and Dominant-0/1  Diagnosis phase We extract the failure dumps of all injected faults that result in a failure, and run DIEBA on the extracted dumps. We log the control flow of the program internally within the simulator to emulate the recording of program signatures. DIEBA’s output falls into one of the following categories: (1) Load Store Unit (LSU), (2) Integer ALU and (3) FPU. LSU includes three injection locations: load-store data, load read address and store write address. We combine these locations into one because of the following: (1) instructions that change memory addresses reside in the same propagation paths as instructions that change the corresponding data, (2) most propagation paths that process memory contain both load and store instructions in the same path and (3) most reconfiguration techniques at the micro-architecture level will replace the entire LSU unit (rather than parts of LSU) if they find it defective. The integer ALU includes integer adder, logical operations, multiplier and divider. Finally, FPU includes FP adder, multiplier and divider, as most of the crashes that occurred in the FPU were due to injections in the FP adder.  81  5.5  Results  We first discuss the consequences of injecting intermittent faults into our benchmarks (Section 5.5.1). Then we present the accuracy of DIEBA (Section 5.5.2). Next, we estimate the online and offline performance and memory overheads of DIEBA based on characteristics of the applications (Section 5.5.3).  5.5.1  Fault Injection Outcomes  We start by quantifying the impact of intermittent faults on how programs terminate. This is important because we use program crashes and error detections to trigger the diagnosis technique in our experiments. We inject faults into each benchmark program and monitor it for the presence of either crashes or error detections (Table 5.4). We find that about 73% of the injected errors are activated (errors manifest themselves to the program). Out of the activated faults, 33% are benign (do not affect the program’s state), 53% cause program crashes, and the remaining 14% lead to Silent-Data Corruptions (SDC). Using our detectors, we are able to detect 7% of the activated faults that cause SDCs. Therefore, of the activated faults, we find that about 60% lead to crashes or detections. These are the cases to which DIEBA is applied in Section 5.5.2.  5.5.2  DIEBA Accuracy  To evaluate DIEBA’s accuracy, we run DIEBA when an injected fault causes program crash or error detection. DIEBA attempts to identify the functional unit into which the fault was injected. We classify DIEBA’s output into three categories: (1) “Correct” where DIEBA uniquely identifies the injected unit, (2) “Correct with false positives” where DIEBA identifies the injected unit along with some other units as possibly defective (recall that we inject only one unit in each run), and (3) “Incorrect” where DIEBA identifies a different unit(s) than the one that is injected.  82  Table 5.4: Number/percentage of program crashes or error detections for each benchmark. Benchmark  mcf  gcc  bzip2  astar  perlbench  dealII  soplex  sjeng  hmmer  milc  lbm  Number  1274  1586  1570  1532  1696  1202  1475  1283  1538  14429  1630  Percentage  48%  68%  60%  59%  67%  55%  56%  55%  70%  60%  59%  83  Figure 5.5: DIEBA accuracy for different defective units. Figure 5.5 shows the results of the diagnosis for Integer ALU and LSU for the integer benchmarks, and for the Integer ALU, LSU and FPU for FP benchmarks. The results are expressed as a percentage of the total number of crashes/detections for each benchmark. From Figure 5.5, we find that DIEBA successfully diagnoses 87% of the crashes/detected errors, on average, across all benchmarks - this includes the correct (70%) outcome and correct with false-positives outcomes (17%). There are two reasons for the false positives: (1) overwritten evidence: program state corrupted by an intermittent error might be overwritten with correct data, which means that some “evidence” an error leaves on a program’s state are erased, and (2) masked evidence: erroneous data might be masked at the program level by instructions such as AND and OR. Further, our simulation infrastructure is unable to ensure 100% replay determinism, and hence there are small offsets in the stack pointer for some of the runs. This also accounts for some of the missed diagnoses. We note that our technique is able to achieve 70% accuracy despite the lack of deterministic replay. We examine the sensitivity of DIEBA’s accuracy to the crash distance, error active duration (Figure 5.6) and fault model (Figure 5.7) below. Crash Distance (CD) effect We study the correlation between the CD and the accuracy of DIEBA (figure not shown). Recall that CD is the number of dynamic instructions that execute from the start of the error until the program crashes or 84  an error is detected. Unlike the fault active duration and fault model, CD is not an independent parameter that we can control. We find that shorter CDs correlate weakly with increases in the diagnosis accuracy. For example, DIEBA has an accuracy of 60% for the gcc benchmark, which has a CD of 6365 instructions, compared to an accuracy of 72% for perlbench, which has a CD of 371 instructions. Error active duration effect: In general, very small error durations of 5 cycles or less result in less accurate diagnosis (62% accuracy, on average) compared to longer errors (71% accuracy, on average)(Figure 5.6). This is because errors that last only a few cycles result in fewer error clues (erroneous registers, memory locations and branches). Because DIEBA relies on such clues for backtracing, its accuracy is lower for shorter errors. Fault model effect: Figure 5.7 shows DIEBA’s results classified based on the error micro-architectural model. We find that, for the programs considered, the fault model does not affect DIEBA’s accuracy considerably.  5.5.3  DIEBA Overhead  We estimate DIEBA’s online and offline performance overheads, and the memory overheads required by the program signatures. Online performance overhead: The runtime overhead of DIEBA depends on two factors: (1) data collected during the record phase, which is required to facilitate a deterministic replay, and (2) store instructions required to save program signature and hence ensure that the control flow of a fault-free run matches that for a failure run. Different replay techniques incur different amounts of online performance overhead. Assuming that Flight Data Recorder replay tool is used, then the overhead of the record phase is about 2% or less [82]. As for the second factor, we estimate the online time overhead by finding the number of executed BBs in each benchmark, assuming that there is one store instruction for each BB and computing the percentage of additional instructions that execute after adding these instructions. We find that, on average, 16% more instructions will need to be executed to store the program signature. Later in this section, we show that the store instructions are highly likely to hit in the L1 cache, and are hence unlikely to incur additional cache miss latencies.  85  Therefore, the overall online performance overhead of DIEBA is 18%. Later in the discussion section (Section 5.6), we propose to turn on the signatures of a program and the replay recorders only after error detection, this way this overhead is encountered only when diagnosis is needed. Offline performance overhead: The offline time requirements of the diagnosis procedure depend on three factors: (1) replaying the execution to construct a fault-free run, (2) building the DDG, and (3) finding erroneous data and performing the backtracing analysis. We do not consider the overhead of the replay technique. The overhead associated with the latter two factors depends on the size of the backtracing window (increases with increasing window size), but does not depend on the size of the program. Therefore, the offline performance overhead of the technique can be configured based on the size of the backtracing window. In our experiments, we assume a window size of 20,000 instructions (see below) and measure the time taken to construct the DDG and execute the DIEBA algorithm. We find that the time taken is at most 70 milliseconds for all the benchmarks (on an Intel Xeon E5345 Quad Core 2.33GHz system with 8GB of main memory) which is negligible. Thus, the offline overhead is dominated by the replay overhead. Online memory overhead: To estimate the online memory overhead we experimentally evaluate the average number of instructions executed by the application after an intermittent error, but before a crash or detection. The goal is to provide an estimate of the memory size of the ring buffer required to store a program signature in memory. We use this buffer to maintain n recent Basic Block (BB) identifiers, where n is an integer that represents number of entries in the buffer. We measure the crash distance (CD) of the activated intermittent errors, or the number of dynamic instructions that execute from the injection of the fault to the final crash, for those injections that lead to a crash (Table 5.5). We find that the maximum average CD is 10235 (for mcf), and the average CD across all benchmarks is 3688. Based on these results, we conservatively choose the size of the backtracking window to be 20,000 instructions. Since the average BB size measured in our experiments is 6 instructions, n will be around 20, 000/6 = 3333.  86  87  Figure 5.6: DIEBA accuracy for different error active durations.  88  Figure 5.7: DIEBA accuracy for different fault models.  Table 5.5: Crash distance in dynamic instructions for each benchmark. Benchmark  mcf  gcc  bzip2  astar  perlbench  dealII  soplex  sjeng  hmmer  milc  lbm  CD  10235  6365  879  492  371  4512  2319  2206  960  558  1172  89  To estimate the memory overhead of the ring buffer, we assume that each BB identifier occupies 32 bits, or one machine word. This is sufficient to uniquely identify more than a billion BBs in a program (because we want the BB IDs to be protected, we set aside two bits for Hamming codes). The memory used by the signature buffer is n × 32 bits. Since we chose n = 3333, a 3333-entry buffer will consume about 3333 × 32 bits, which is 10.4KB or 1/6th of a typical 64KB L1 data cache. Therefore, we could reserve some cache lines in L1 for this buffer, avoiding misses.  5.6  Discussion  Intuitively, a software-based approach can only use the register-file contents, the memory footprint and the program counter at the time of the crash to find which instructions have been affected by the error (either directly or by error propagation from other instructions). Once these instructions are identified (by backtracing erroneous data, for example) different approaches can be used to diagnose the error. In Chapter 3, we find that about 42-46% of the erroneous state is available in the crash dump at the time of the failure, suggesting that software-based techniques have substantial potential. For the rest of this section, we discuss the potential and limitations of software techniques. Coverage Software techniques are often limited to errors in the processor backend, i.e., functional units and load-store units, as errors in the processor’s front-end are pervasive in the software, and cannot be easily localized. Functional units and load-store units comprise 57.1% of the total area of OpenSPARC-T1 core [1] [57] and 51.4% of POWER4-like processor [30] excluding caches. Second, software-based approaches cannot distinguish the erroneous unit among a set of replicated units, as the information about which unit has been used during program execution is not available to the software. Therefore, software-based approaches need to be complemented with hardware-based approaches that expose this information to the software, or else they can only be applied to simple cores which do not have replicated units. However, if the processor consists of simple 90  cores then identifying the unit type should suffice. Overhead  Software-based techniques that require program instrumentation or re-  play techniques, such as DIEBA, can eliminate the overhead that is encountered before the occurrence of errors by following these suggestions: • Overhead of the replay tool: a replay tool does not need to be running all time to replay a program. It is rather enabled on demand, such as when a crash occurs, such tool is called “an offline replayer” [82]. Triggering a replay tool by a crash means that DIEBA can run starting from the next crash. A replay tool is also responsible of re-constructing the architecture state to the “initial replay state”. This includes all registers, IO state, memory footprint, TLBs. • Overhead of executing additional instructions: these instructions can be instrumented in the program when an error is detected using binary translation. For example, fastBT [56] reported an overhead of 6% for SPEC2006 benchmarks when the code is translated. Therefore, DIEBA will not impose overhead on fault-free software because of signature instructions. Rather, the online overhead is estimated to be 24%, on average (16% signature instructions + 2% replay recorders + 6% binary translated code) and will only be encountered when an intermittent error is detected until the error is diagnosed  Simultaneous errors  Software techniques cannot distinguish between multiple  intermittent errors that occur simultaneously. However, intermittent errors happen relatively infrequently compared to the typical execution time of many applications, and hence intermittent errors are unlikely to affect multiple functional units. Benign faults  A software-based diagnosis technique cannot diagnose faults that  do not impact the application. In most cases, it may not be necessary to react to faults that do not impact the application. However, in some cases, such faults may be indicators of future problems or they may adversely impact performance (e.g., temperature hotspots). 91  Despite these limitations, we believe software-based diagnosis approaches such as DIEBA have significant advantages over hardware-based approaches.  5.7  Conclusions  In this chapter, we study whether software-based techniques, represented by DIEBA in this work, can diagnose intermittent errors. Starting from the failure dump, DIEBA can isolate the faulty functional unit in the processor that was likely responsible for the failure. DIEBA can run without using any hardware support and does not run any additional tests on the core. We experimentally evaluate the accuracy of DIEBA and find that it can accurately diagnose 70% of the injected faults. We estimate that DIEBA has 18% performance overhead (during diagnosis) and 10.4KB online memory overhead. In the next chapter, we study how different phases of error tolerance techniques work together to mitigate intermittent errors through an end-to-end case study.  92  Chapter 6  End-to-End Case Study In the previous two chapters we studied two phases of error tolerance: recovery and diagnosis. Other phases include error detection, discrimination and reconfiguration. In this chapter, we discuss how error tolerance techniques that we proposed in this dissertation together with existing error tolerance techniques can improve the overall system reliability. The primary goal is to give the reader a sense of the overheads of different error tolerance phases. We achieve this through a thought experiment using a set of a real-world processor and two workloads. Any quantitative conclusions drawn in this chapter are obtained from simplified modeling and were not gathered empirically. We use a state-of-the-art processor, ARM Cortex-A15 MPCore [7], which is based on ARMv7-A architecture. This processor was first released in 2012 and is projected to be used in smartphones, mobile computing, low-power servers and more. We assume that this processor is equipped with techniques for intermittent error detection, discrimination, diagnosis and recovery. We also assume that an intermittent error occurs in the load-store unit in this processor while it is running a parallelized memory-intensive workload. We then explain how different phases of error tolerance techniques operate together. Finally, we estimate the performance overhead for each phase.  93  Figure 6.1: Overview of ARM Cortex-A15 MPCore processor with two cores.  6.1  System Description  We describe the processor, the workloads and the fault model we use in our example. Processor The ARM Cortex-A15 MPCore processor can have between one and four cores. In this example, we assume a dual-core processor (Figure 6.1). Each core has limited redundancy, with one micro-architectural unit of each type except for two integer ALU pipelines and three decoders. This processor is fabricated using 32nm/28nm technology and functions with clock rate up to 2.5GHz. It has 2MB L2 cache and 64KB L1 cache per core. Workload  The workload in this example consists of two benchmarks from the  PARSEC suite [9]. The first is Freqmine, which is a data mining benchmark that has 50% of its instructions as reads/writes. This benchmark mines a set of transactions. The second is Vips, which is an image processing benchmark that has 27% of the instructions as reads/writes. This benchmark processes an 2662x5500 pixel image. Freqmine signifies an example of a workload where hardware reconfigura-  94  tion is not beneficial for performance. Vips signifies an example workload where hardware reconfiguration is beneficial for performance. More details about how we reach this conclusion are available later in this chapter. Fault model We use our Weibull fault model from Chapter 4. In this fault model, the fault burst (a number of active and inactive durations that last for a few minutes) occurs according to Weibull distribution with MTTF of 2 hours. Faults in each burst occur according to a Poisson process with rate of 1/second and each active duration lasts for 1 second. More details on how we choose these parameters and how our results are sensitive to them are available in Section 4.3.  Figure 6.2: End-to-end example of an error tolerant core.  6.2  Overhead of Error Tolerance Techniques  We start by discussing how different phases of fault tolerance operate in general (Figure 6.2), then we quantify the performance overhead for the processor running the workloads. In this section, we rely on other works for error detection, discrimination and recovery. We use DIEBA for error diagnosis (Chapter 5), and results from our recovery analysis to estimate checkpointing and program restoration overheads (Chapter 4). Figure 6.2 shows an overview of how fault tolerance phases operate. In this diagram, a program that runs error free is represented by Box 1. When an error occurs, this error might trigger a detector [41] [44] (Box 2). Usually, error detection 95  techniques are independent of the error type. Upon error detection, the program is rolled back to the last checkpoint, and all the work that was not checkpointed is lost. Since our workloads are parallelized, we consider coordinated checkpoints only in this diagram. Next, an error discrimination technique is used to identify the error type in order to run the suitable error tolerance techniques [11] (Box 3). Once an error is identified as intermittent1 a diagnosis technique, such as DIEBA, is run (Box 4)2 . When the defective component of the processor is diagnosed a decision is made about whether to disable this component (Box 5). If a decision is made to disable the defective component then the processor is reconfigured (Box 6).  Figure 6.3: Performance overhead of different error tolerance phases when running Freqmine. Assume that the Load-Store Unit (LSU) in Core 1 is affected by an intermittent error represented by the fault model described above. Core 1 has only one LSU, after successful error diagnosis both cores will be sharing the functional LSU in Core 23 . Figures 6.3 and 6.4 estimate the performance overhead of error tolerance techniques running concurrently with a workload. We show six different bars in each figure, each of which represents the overhead of one or more fault toler1 If the error is transient then we do not need to diagnose the faulty component or reconfigure around it. This is because the error is unlikely to recur. If the error is permanent then the same fault tolerance phases can be used. 2 In the diagram, we assume that a diagnosis technique needs to be enabled for sometime in order to accurately identify the faulty component. 3 We assume that the cores can be reconfigured and that there is an efficient way to share LSU among the two cores.  96  Figure 6.4: Performance overhead of different error-tolerance phases when running Vips. ance techniques, depending on whether an error has been detected, differentiated as intermittent and diagnosed correctly. Table 6.1 shows the techniques that are activated during each tolerance phase and the breakdown of each estimate. In the following, we explain how we obtain the overhead estimates. • Error Detection We assume the use of the work by Sahoo et al. [65] on utilizing program invariants as error detectors to estimate detection overhead (2%). • Checkpointing and Recovery We assume the use of the work by Plank et al. [58] and our work in modeling recovery scenarios in Chapter 4 to estimate the overhead of storing a checkpoint (4%) and restoring program state in case of error detection (34%). • Error Discrimination We assume the use of the Alpha count-and- threshold error discrimination mechanism proposed by Bondavalli et al. [11]. This method is based on executing a few mathematical equations every time an error is detected, we estimate the overhead of executing these equations to be relatively small or 0%. • Error Diagnosis The overhead of the diagnosis technique is estimated using DIEBA’s overhead (Chapter 5) in addition to 100% overhead of the replay on Core 2 required by DIEBA. 97  • Healthy-unit Disablement We assume that the overhead of disabling a healthyunit is 10%. This number is an example of the overhead encountered when a non-critical micro-architectural unit is disabled. • Faulty-unit Disablement The overhead of disabling the LSU for each workload is found using “component rank” (Chapter 4) as follows: – Component rank for Freqmine is 33% or 1 − (1/(0.50 + (0.50/0.50))). – Component rank for Vips is 21% or 1 − (1/(0.73 + (0.27/0.50))). The overhead when an intermittent error occurs, but is not diagnosed (Bar 2) represents the overhead of detection and recovery techniques and is the same for both Freqmine and Vips. This bar represents the overhead of the rollback-only recovery scenario in Chapter 4. However, Freqmine uses the error-prone LSU more than Vips does and hence, disabling this LSU has more impact on Freqmine than it has on Vips (Bar 4). As a result, reconfiguring the processor around the faulty LSU is feasible for Vips only, while Freqmine relies on restoring its state to the last checkpoint every time an intermittent error is detected. To summarize, error detection, error discrimination, diagnosis at the microarchitecture level and choosing efficient recovery strategy are all necessary to ensure that reliable processors have high throughput.  6.3  Conclusions  In this chapter, we discuss an end-to-end scenario of an intermittent error-prone processor that is equipped with all phases of error tolerance. We show the performance overhead of running different phases of error tolerance while the processor is running realistic workloads. Depending on how much the defective component of the processor is used by the workload, the most effective recovery action may involve hardware reconfiguration. We show that all error tolerance actions that involve detecting an error, diagnosing its source, computing the rank of the faulty component and performing the appropriate recovery actions work together to ensure that the processor has high throughput and at the same time generate correct data.  98  Table 6.1: Error-tolerance techniques activated during different tolerance phases. Tolerance Phase  Techniques Activated  Fault free Error not distinguished  Detection and Checkpointing Detection, recovery and distinguishing Detection, recovery and diagnosis Detection, checkpointing and faulty-unit disablement Detection, recovery, distinguishing and healthy-unit disablement Detection, recovery, diagnosis and healthy-unit disablement  Error distinguished Correct reconfiguration Incorrect reconfiguration error not distinguished Incorrect reconfiguration error distinguished  99  Overhead Breakdown 2% + 4% 2% + 34% + 0% 2% + 34% + 124% 2% + 4% + 10% 2% + 34% + 0% + 10% 2% + 34% + 124% + 10%  Chapter 7  Related Work In this chapter, we survey related work in characterizing, diagnosing and recovery from hardware errors. We show that intermittent fault tolerance is a relatively new field with many challenges. These challenges arise mainly due to the nondeterministic nature of intermittent faults and to the lack of fault models that approximate the faults behavior at high levels and hence facilitate studies and evaluations. We also highlight our contributions relative to related works.  7.1  Characterizations Studies  A survey of previous work on hardware fault characterization is provided in this section. We show that research in intermittent errors characterization is still in its infancy. 1. Intermittent faults: Pan et al. [49] proposed the Intermittent Vulnerability Factor (IVF) to characterize the sensitivities of different micro-architectural units to intermittent faults. While analytical models for vulnerability estimation are useful, they grossly overestimate the impact of faults on programs as they do not consider the end-to-end behavior of the program, unlike fault injection, which our technique is based on. For example, Wang et al. [79] have shown that the analytical models overestimate soft error vulnerability by 2.6x compared to fault injections. Gracia et al. [24] studied the behavior of intermittent faults in a VHDL model of a commercial micro-controller using toy benchmarks. They found that  100  intermittent fault’s length1 is the most influential variable in error propagation. However, they did not consider the impact of intermittent faults on programs executing on the processor, which is important for developing software-based fault tolerance mechanisms. 2. Permanent faults: Li et al. [38] performed a fault injection study similar to ours for permanent faults. They found that 95% of the activated faults can be detected by monitoring for a hardware trap, hang or excess operating system calls. The remaining 5% are mainly benign faults that do not affect the program state. They have also found that 86% of the detected faults can be detected within 100K instructions. Karimi et al. [33] injected transient and permanent faults into the control logic of an RTL simulator augmented with a functional simulator but they do not consider intermittent faults. 3. Transient faults: The effect of transient faults on programs is a well-studied topic [25, 36, 73]. For example, Gu et al. [25] conducted massive fault-injection campaigns into the Linux kernel to study the impact of transient faults. They found that programs do not necessarily crash immediately but often continue executing as the error propagates. They also showed that 40% of crash-causing transient errors crash within 10 cycles. Summary Unlike permanent faults, intermittent faults do not persist definitely but rather occur non-deterministically. Moreover, unlike transient faults, intermittent faults tend to recur in the same location. Therefore, the results of permanentor transient-faults characterization cannot be generalized to intermittent faults. For example, in our characterization study we find that 53% of the activated intermittent faults lead to crashes, compared to 95% of the activated permanent faults that lead to crashes [38]. This is because permanent faults persist forever and are more likely to change the program state.  7.2  Recovery Techniques  The fault recovery scenarios we study in Chapter 4 are inspired by recovery proposals in literature. For example, Schuchman et al. [67] (Rescue) and Shivakumar 1 Fault  length is the full duration of the fault or tL in Figure 2.1  101  et al. [69] proposed to remedy permanent faults by exploiting redundancy at the microarchitecture level to compensate for the disabled error-prone units. They both targeted hard faults. Romanescu et al. [64] proposed core cannibalization in which cores that are identified to be permanently defective are cannibalized to pipelinestage parts. Then the functional parts of a defective core serve as a “supply of replacements” by other fault-free cores. They do not consider the impact of error detection, diagnosis and checkpointing on the performance and availability of the chip. Powell et al. [59] proposed core salvaging, in which the defective units in a core are disabled. The defective core falls into other healthy cores whenever the disabled units are needed. Meixner el al. [45] proposed to disable the defective units and to use application detouring. Detouring is to convert instructions that use the defective units to equivalent instructions that use healthy units only, where possible. This conversion is done using binary translation layers. Both core salvaging [59] and program detouring [45] target permanent faults. Wells et al. [81] proposed to recover from intermittent errors by disabling the core and letting it cool down for sometime. Their rationale is that some intermittent faults might be induced by fluctuating temperature and voltage. Hence, if a core is disabled for a while then its temperature and voltage would stabilize. Srinivasan et al. [30] used RAMP [75] to model two recovery scenarios for aging-related errors. The first scenario consists of spare micro-architectural redundancy that will only be used when other units fail. The second recovery scenario involves using the existing (not spare) micro-architectural redundancy within a core and degrading performance gracefully as units fail. They also consider hybrid models of both scenarios. They found that spare units can be used for systems where performance can be sacrificed to enhance reliability, while the second scenario should be used in systems where reliability can be sacrificed to enhance performance. They also found that a hybrid model of both scenarios can result in best enhancements to reliability. Our work is different in that: (1) we build an endto-end model that includes error discrimination, detection, diagnosis and recovery while they focused on error recovery only and (2) we add a model that suspends the offending part of the processor temporarily or permanently while they focused on permanent shutdown of the processor units only. Bondavalli et al. [11] proposed a count-and-threshold based model to discrim102  inate between intermittent and transient faults. Their model assumed Weibull distributed intermittent faults and bursty transient fault intervals. Later, the authors extended their fault-discrimination model by adding models for fault diagnosis and recovery for nodes (a number of processors) in distributed systems [68]. They assumed one recovery scenario for intermittent faults, which is to remove the defective node. In our work, we build on [11] by exploiting their count-and-threshold model. However, we consider different models for diagnosis and recovery at the processor level. Summary Error discrimination and detection has been studied by Bondavalli et al. [11] while unit reconfiguration has been studied by Srinivasan et al. [30] and Aggarwal et al. [4]. Also, error recovery techniques have been studied for permanent [45, 59, 64] and intermittent faults [81]. Ours is the first study that combines all these models at the chip multiprocessor level and evaluates the overall impact of the recovery choice on the performance of the system. This is important in improving the performance of real-world processors.  7.3  Diagnosis Techniques  Diagnosis has been well explored in distributed systems. However, there has not been much work on diagnosis at the micro-architecture level, which is DIEBA’s goal. Diagnosis at this level is challenging because low-level information about the processor need to be analyzed. We discuss prior work at the micro-architecture level in diagnosis and testing, and classify them into four broad areas. Hardware trace buffers A number of techniques have proposed the use of hardware buffers to capture the state of the processor as it executes [51, 52]. These techniques have been deployed during post-silicon debugging of processors. They record detailed information about the instructions executed by the processor. In the event of a failure or error detection, the recorded information is scanned out and analyzed to isolate the root cause of the error. The main issue with these techniques is that they require hardware-based recorders to be implemented in the processor which impose area and energy overheads. Continuous monitoring and testing Built-in-self-test techniques (BIST) attempt to test the processor’s functional units when they are not being used [70]. Because  103  the tests are run in the background, this approach incurs very low latency and achieves high coverage. Bower et al. used checker cores and collected fine-grained information about the flow of instructions through a processor’s pipeline to isolate the functional unit that caused the error [13]. They used heuristics to determine when a component is faulty, based on the number of times the component is used by faulty instructions. The main problem with continuous monitoring techniques is that they add hardware to check the processor functionality, which increases the design complexity and the processor’s area. Further, these techniques are specific to the microarchitecture they are designed for, and often require considerable effort to port to a different micro-architecture. Periodic online testing Periodic testing approaches execute hardware tests periodically, to find errors during the processor’s operation [16] [39]. While running these tests, the application executing on the processor is suspended and the tests are executed for a short period of time. As a result, these techniques can incur performance overheads even during fault-free operation. An issue with all testing techniques (both continuous and periodic) is that they assume that the fault appears during at least one of the testing phases. This assumption may not hold for intermittent faults, as they are non-deterministic in nature, and may be triggered under circumstances that occur only during the application’s execution. Application-specific diagnosis Application-specific techniques aim to invoke diagnosis only for faults that have the potential to affect the application [37, 57]. Pellegrini and Bertacco [57] monitors the usage of processor’s functional units by the application, and runs tests only for the units that are used. Because they focus testing effort on specific units, they incur lower overheads than generic techniques. Like all testing techniques, however, they assume that the fault occurs during the testing period. Li et al. [37] diagnoses permanent faults by monitoring the application for abnormal events (e.g., too many TLB misses), and replaying it from a checkpoint if these events occur. During replay, their technique activates hardware structures to record processor-level information, and gather a trace of the faulty execution. The trace is then shipped to another core, and the diagnosis is carried out there by comparing the trace to a fault-free execution trace. The authors also 104  assume that the fault is reproducible during diagnosis, which is true for permanent faults, but not intermittent faults. Summary We see that existing techniques either require the fault to reproducible, or need additional hardware support. DIEBA is an application-specific technique as it only considers errors that affect the application. However, it differs from the above techniques in that it does not require the fault to be reproducible, nor does it require hardware support to gather additional information about the processor.  7.4  Conclusions  In this chapter, we show that there is a need for intermittent fault characterization studies to guide the design of intermittent fault tolerance techniques (similar to transient and permanent faults). For example, Li et al. carried out a characterization study of permanent faults at the micro-architecture level [38] before they built a diagnosis technique for permanent errors [37]. We performed such a study in Chapter 3 focusing on the implications for error diagnosis. Further, we explore related work in error recovery approaches. All the works suggest interesting directions to remedy the impact of a defective core in a processor [30, 45, 59, 64]. However, none of these works provides an end-to-end study of different recovery strategies. We performed such a study in Chapter 4 to choose the most effective recovery action for intermittent faults. Finally, we survey a range of error diagnosis approaches that include adding hardware buffers to the processor [51], performing periodic/continuous testing [16, 39, 70] and performing application-specific testing [37, 57]. We show that it is not straightforward to apply these techniques to intermittent faults (which are not deterministically reproducible) and at the same time maintain low performance and area overheads during fault-free execution. Our diagnosis technique, DIEBA, uses little to no additional hardware and collects information during fault-free operation to diagnose the faulty functional unit.  105  Chapter 8  Conclusions and Future Work The continued scaling of CMOS devices has led to transistor sizes that are in the nanometer regime. This size reduction suggests that individual transistors will soon consist of a small number of atoms, making it difficult to ensure their reliability1 . Future processors will therefore be susceptible to high rates of hardware faults due to process variations, device wearout and manufacturing defects [43]. Hardware faults are classified according to their frequency of occurrence to transient, intermittent and permanent faults. Prior work has mainly focused on building fault tolerance and avoidance techniques for transient and permanent faults. However, recent studies have shown that intermittent faults cause up to 40% of hardware failures in real-world machines. Further, intermittent faults are different than transient faults in that they recur in the same location, and they are different than permanent faults in that they occur non-deterministically. Therefore, it is not straightforward to apply studies and techniques that have been designed to address permanent or transient faults to intermittent ones. In this dissertation, we focus on understanding and building intermittent fault tolerance approaches. We summarize our findings in Section 8.1 and layout our future work directions in Section 8.2. 1 Experiments  on a single-atom transistor has already been reported in research labs [22].  106  8.1  Conclusions  We first introduced intermittent faults, their causes and rates in Chapter 2. Since only little is known about the exact characteristics of these faults, we built approximate fault models at the micro-architecture and system levels. Our fault models are inspired by related work on permanent fault models and logic-level models of intermittent faults. We then incorporated these models into our fault injection tool (Chapter 2). To understand the impact of intermittent faults on programs, we performed fault injections at the micro-architecture level using a realistic set of benchmarks (Chapter 3). We then analyzed the results of our fault injections. We found that the predominant impact of an intermittent fault on software is a program crash. We also found that the crash-causing errors corrupt less than a few hundreds of data values before program crash. Moreover, we found that about 40% of these corrupted data values are not masked by correct data at the time of program crash. This suggests that part of the corrupted program state is intact at the crash time and can be used to learn about the failure causes. As intermittent faults predominantly have non-benign effects on software and these faults are likely to recur, intermittent fault recovery techniques are necessary. In Chapter 4, we evaluated the impact of different intermittent error recovery scenarios on the processor’s performance and availability. To achieve this, we modeled a system that consists of a fault-tolerant processor using a Stochastic Activity Network. Further, we used our system-level intermittent fault models that we built in Chapter 2. We simulated our processor and fault models using the Mobius tool [17]. We found that the frequency of the intermittent error and the relative importance of the error location play an important role in choosing the recovery action that maximizes the processor’s performance and availability. We also found that reconfiguration around small processor components such as micro-architectural units grant high performance and availability. In Chapter 3 we found that some of the corrupted program state can be used to build software-based technique to learn about the source of the failure. Also, in Chapter 4 we found that disabling the micro-architectural component of the processor guarantees high throughput. Using the previous two findings, we pro-  107  posed DIEBA, a diagnosis technique at the micro-architecture level. DIEBA is a software-based technique designed to Diagnose Intermittent hardware Errors in microprocessors by Backtracing Application state at the time of a failure (Chapter 5). DIEBA analyzes the program state after a program crash or error detection to identify the fault-prone functional unit. We evaluated DIEBA using our micro-architectural simulator and found that it can successfully diagnose 70% of the errors. However, DIEBA cannot diagnose intermittent faults that occur in the front-end of the processor or in the replications of functional units. This is because software has only limited visibility on the hardware state and DIEBA is softwarebased technique.  8.2  Future Work  In the future, the following directions are possible extensions of this dissertation: • Intermittent faults are non-deterministic in nature. This is why most diagnosis techniques for permanent errors cannot be applied to intermittent ones. One may then wonder if we can create a technique to predict when an intermittent error would occur, and hence eliminate this non-deterministic nature. If we can predict the occurrence of intermittent errors, then existing techniques used to diagnose permanent errors can be used to diagnose intermittent ones. These techniques can guarantee high accuracy and high coverage diagnosis. Intermittent errors may be predicted by recording some information about the processor the first time an intermittent error occurs. This information may include the current temperature, voltage, processor’s state and/or basic information about the characteristics of the running program. By re-creating these conditions, we can preemptively “reproduce” an intermittent error and diagnose it even before the program fails. • To recover from intermittent errors or failures the program state is restored to the last checkpoint and the hardware is possibly reconfigured. In our characterization study (Chapter 3), we found that when a program crashes the error has a relatively short crash distance and a small intermittent propagation set. One may therefore ask if we can “patch” the program state instead of losing 108  all the work that has not been checkpointed. For example, by “correcting” the instruction that causes the failure using some heuristics, the program can continue execution with some data corruptions (DCs). If we use welldesigned heuristics, these DCs can be minimized and the program would generate data with “enough” accuracy. Both of these ideas assume that the user of the fault-prone processor has the flexibility to use some unreliable data while the processor is mitigating the intermittent error. Hence, these solutions are acceptable when reliability can be sacrificed for better performance. This can eliminate the need to add expensive fault avoidance and tolerance techniques (in terms of area and power) and facilitate sufficient reliability to many users who are not using their processors to do critical computations.  109  Bibliography [1] Opensparc t1 processor. http://www.opensparc.net/opensparc-t1/index.html. → pages 65, 90 [2] Ieee standard computer dictionary: A compilation of ieee standard computer glossaries. Institute of Electrical and Electronics Engineers, 1990. → pages 1 [3] J. Abella and X. Vera. Electromigration for microarchitects. ACM Computer Survey, 42(2):9:1–9:18, 2010. → pages 11 [4] N. Aggarwal, P. Ranganathan, N. Jouppi, and J. Smith. Configurable isolation: building high availability systems with commodity multi-core processors. SIGARCH Comput. Archit. News, 35:470–481, 2007. → pages 103 [5] H. Agrawal and J. Horgan. Dynamic program slicing. ACM SIGPLAN Notices, 25(6):246–256, 1990. → pages 69 [6] Z. Alkhalifa, V. Nair, N. Krishnamurthy, and J. Abraham. Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems, 10(6):627–641, 1999. → pages 72 [7] ARM. Cortex-a15 mpcore. technical reference manual. revision: r3p2. http://www.arm.com/products/processors/cortex-a/cortex-a15.php. → pages 93 [8] A. Avizienis, J. Laprie, and B. Randell. Dependability and its threats: A taxonomy. Int. Federation for Information Processing, 156:91–120, 2004. → pages 1 [9] C. Bienia. Benchmarking modern multiprocessors. PhD Dissertation-Princeton University. → pages 94 110  [10] J. Blome, S. Feng, S. Gupta, and S. Mahlke. Self-calibrating online wearout detection. Int. Symp. on Microarchitecture, pages 109 – 122, 2007. → pages 12 [11] A. Bondavalli, S. Chiaradonna, F. Giandomenico, and F. Grandoni. Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Transactions on Computers, pages 230–245, 2000. → pages 17, 45, 96, 97, 102, 103 [12] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, and A. Keshavarzi. Parameter variations and impact on circuits and microarchitecture. Design Automation Conf., pages 338–342, 2003. → pages 2, 10, 11 [13] F. Bower, D. Sorin, and S. Ozev. Online diagnosis of hard faults in microprocessors. ACM Transactions on Architecture and Code Optimization, 4(2), 2007. → pages 64, 76, 104 [14] D. Burger and T. Austin. The SimpleScalar tool set, version 2.0. Computer Architecture News, 25(3):13–25, 1997. → pages 17, 18 [15] C. Constantinescu. Intermittent faults and effects on reliability of integrated circuits. Reliability and Maintainability Symp., pages 370–374, 2008. → pages 2, 9, 11, 12 [16] K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco. Software-based online detection of hardware defects: Mechanisms and architectural support and evaluation. Intl. Symp. on Microarchitecture, pages 97–108, 2007. → pages 65, 104, 105 [17] T. Courtney, D. Daly, S. Derisavi, V. Lam, and W. Sanders. The mobius modeling environment. Intl. Multiconf. on Measurement, Modelling and Evaluation of Computer Communication Systems, (781), 2003. → pages 42, 51, 107 [18] M. Depas, M. Heyns, and P. Mertens. Soft breakdown of ultra-thin gate oxide layers. European Solid State Device Research Conf., 25:235–238, 1995. → pages 5, 11, 12, 17 [19] G. Dunlap, D. Lucchetti, M. Fetterman, and P. Chen. Execution replay for multiprocessor virtual machines. Intl. Conf. on Virtual Eexecution Environments, pages 121–130, 2008. → pages 72  111  [20] D. Ernst, N. S. Kim, S. Das, S. Pant, T. Pham, R. Rao, C. Ziesler, D. Blaauw, T. Austin, T. Mudge, and K. Flautner. Razor: A low-power pipeline based on circuit-level timing speculation. Intl. Symp. on Microarchitecture, pages 7–18, 2003. → pages 2 [21] M. Ershov, S. Saxena, H. Karbasi, S. Winters, and S. M. et al. Dynamic recovery of negative bias temperature instability in p-type metaloxidesemiconductor field-effect transistors. Appl. Phys. Lett, 83(8), 2003. → pages 12, 17 [22] M. Fuechsle, J. Miwa, S. Mahapatra, H. Ryu, S. Lee, O. Warschkow, L. L. Hollenberg, G. Klimeck, and M. Simmons. A single-atom transistor. Nature Nanotechnology, (7):242246, 2012. → pages 106 [23] J. Gracia, D. Gil-Tomas, L. Saiz-Adalid, J. Baraza, and P. Gil-Vicente. Experimental validation of a fault tolerant microcomputer system against intermittent faults. Intl. Conf. on Dependable Systems and Networks, pages 413 – 418, 2010. → pages 5 [24] J. Gracia, J.Baraza, L. Saiz-Adalid, and P. G. Vicente. Analyzing the impact of intermittent faults on microprocessors applying fault injection. IEEE Design and Test of Computers, 1:1–7, 2011. → pages 9, 14, 18, 100 [25] W. Gu, Z. Kalbarczyk, R. Iyer, and Z. Yang. Characterization of linux kernel behavior under errors. Intl. Conf. on Dependable Systems and Networks, pages 459–468, 2003. → pages 3, 22, 101 [26] S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. The stagenet fabric for constructing resilient multicore systems. Intl. Symp. on Microarchitecture, pages 141 – 151, 2008. → pages 4, 53, 66, 67, 68 [27] M. Hsueh, T. Tsai, and R. Iyer. Fault injection techniques and tools. IEEE Computer, 30(4):7582, 1997. → pages 18 [28] Y. Huang, T. Yew, W. Wang, Y.-H. Lee, R. Ranjan, N. Jha, P. Liao, J. Shih, and K. Wu. Re-investigation of gate oxide breakdown on logic circuit reliability. Reliability Physics Symp., pages 2A.4.1 – 2A.4.6, 2011. → pages 5, 12, 17 [29] A. Hwang, I. Stefanovici, and B. Schroeder. Cosmic rays dont strike twice: Understanding the nature of dram errors and the implications for system design. Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 111–122, 2012. → pages 1 112  [30] P. B. J. Srinivasan, S.V. Adve and J. Rivers. Exploiting structural duplication for lifetime reliability enhancement. Intl. Symp. on Computer Architecture, 33, 2005. → pages 4, 12, 65, 67, 90, 102, 103, 105 [31] R. Joseph, D. Brooks, and M. Martonosi. Control techniques to eliminate voltage emergencies in high performance processors. Intl. Symp. on High-Performance Computer Architecture, page 79, 2003. → pages 27, 80 [32] G. Kanawati, N. Kanawati, and J. Abraham. Ferrari: a flexible software-based fault and error injection system. IEEE Transactions on Computers, 44(2):248 – 260, 1995. → pages 18 [33] N. Karimi, M. Maniatakos, A. Jas, and Y. Makris. On the correlation between controller faults and instruction-level errors in modern microprocessors. Intl. Test Conf., pages 1–10, 2008. → pages 3, 22, 101 [34] J. Karlsson, J. Arlat, , and G. Leber. Application of three physical fault injection techniques to the experimental assessment of the mars architecture. Intl Working Conf. Dependable Computing for Critical Applications, pages 150–161, 1995. → pages 18 [35] K. Kim, R. Jayabharathi, and C. Carstens. Speedgrade: an rtl path delay fault simulator. Asian Test Symp, pages 239 – 243, 2001. → pages 13 [36] S. Kim and A. Somani. Soft error sensitivity characterization formicroprocessor dependability enhancement strategy. Intl. Conf. on Dependable Systems and Networks, pages 416 – 425, 2002. → pages 3, 22, 101 [37] M. Li, P. Ramachandran, S. Sahoo, S. Adve, V. Adve, and Y. Zhou. Trace-based microarchitecture-level diagnosis of permanent hardware faults. Intl. Conf. on Dependable Systems and Networks, pages 22–31, 2008. → pages 67, 70, 104, 105 [38] M. Li, P. Ramchandran, S. Sahoo, S. Adve, V. Adve, and Y. Zhou. Understanding the propagation of hard errors to software and implications for resilient system design. Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 265–276, 2008. → pages 3, 5, 22, 37, 38, 101, 105 [39] Y. Li, S. Makar, and S. Mitra. Casp: Concurrent autonomous chip self-test using stored test patterns. Conf. on Design and automation and test in Europe, pages 885–890, 2008. → pages 104, 105 113  [40] L.Rashid, K.Pattabiraman, and S.Gopalakrishnan. Intermittent hardware errors recovery: Modeling and evaluation. International Conference on Quantitative Evaluation of SysTems, 2012. → pages iii, 41 [41] G. Lyle, S. Chen, K. Pattabiraman, Z. Kalbarczyk, and R. Iyer. An end-to-end approach for the automatic derivation of application-aware error detectors. Intl. Conf. on Dependable Systems and Networks, pages 584–589, 2009. → pages 3, 95 [42] R. Lyons and W. Vanderkulk. The use of triple-modular redundancy to improve computer reliability. IBM J. Res. Dev., 6(2):200–209, 1962. → pages 2 [43] J. McPherson. Reliability challenges for 45nm and beyond. Design Automation Conf., pages 176–181, 2006. → pages 1, 2, 10, 106 [44] A. Meixner and D. Sorin. Error detection using dynamic dataflow verification. Intl. Conf. on Parallel Architecture and Compilation Techniques, pages 104 – 118, 2007. → pages 95 [45] A. Meixner and D. Sorin. Detouring: Translating software to circumvent hard faults in simple cores. Intl. Conf. on Dependable Systems and Networks, pages 80–89, 2008. → pages 46, 53, 64, 67, 102, 103, 105 [46] G. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1), 1998. → pages 1 [47] H. Najaf-Abadi. Simplescalar ported to alpha/linux. http://hhnajafabadi.s3website-us-east-1.amazonaws.com/mase-alphalinux.htm. → pages 18 [48] E. Nightingale, J. Douceur, and V. Orgovan. Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer pcs. European Conf. on Computer Systems, pages 343–356, 2011. → pages 1, 2, 12, 17, 52 [49] S. Pan, Y. Hu, and X. Li. Ivf: Characterizing the vulnerability of microprocessor structures to intermittent faults. Conf. on Design, Automation and Test in Europe, pages 238–244, 2010. → pages 100 [50] A. Papoulis. Probability, random variables, and stochastic processes. Fourth Edition, 2002. → pages 16, 17 [51] S. Park, T. Hong, and S. Mitra. Ifra: Instruction footprint recording and analysis for post-silicon bug localization in processors. IEEE Transactions 114  on Computer-Aided Design of Integrated Circuits and Systems, 28: 1545–1558, 2009. → pages 103, 105 [52] S. Park, A. Bracy, H. Wang, and S. Mitra. Blog: Post-silicon bug localization in processors using bug localization graphs. Design Automation Conf., pages 368–373, 2010. → pages 103 [53] I. Parulkar, T. Ziaja, R. Pendurkar, A. D’Souza, and A. Majumdar. A scalable, low cost design-for-test architecture for ultrasparc chip multi-processors. Int. Test Conf., pages 726 – 735, 2002. → pages 46, 53 [54] K. Pattabiraman, Z. Kalbarczyk, and R. Iyer. Application-based metrics for strategic placement of detectors. Pacific Rim Intl. Symp. on Dependable Computing, pages 75–82, 2005. → pages 38, 80 [55] K. Pattabiraman, G. Saggese, D. Chen, Z. Kalbarczyk, and R. Iyer. Automated derfivation of application-specific error detectors using dynamic analysis. IEEE Transactions on Dependable and Secure Computing, 8(5), 2011. → pages 52 [56] M. Payer and T. Gross. Generating low-overhead dynamic binary translators. Haifa Experimental Systems, (3), 2010. → pages 91 [57] A. Pellegrini and V. Bertacco. Application-aware diagnosis of runtime hardware faults. Intl. Conf. on Computer-Aided Design, pages 487–492, 2010. → pages 65, 90, 104, 105 [58] J. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under unix. Usenix Winter Technical Conf., pages 213–223, 1995. → pages 52, 59, 97 [59] M. Powell, S. Gupta, and S. Mukherjee. Architectural core salvaging in a multi-core processor for hard-error tolerance. Intl. Symp. on Computer Architecture, 37:93–104, 2009. → pages 4, 53, 64, 67, 102, 103, 105 [60] L. Rashid, K. Pattabiraman, and S. Gopalakrishnan. Towards understanding the effects of intermittent hardware faults on programs. Workshop on Dependable and Secure Nanocomputing, 2010. → pages iii [61] L. Rashid, K. Pattabiraman, and S. Gopalakrishnan. Formal diagnosis of hardware transient errors in programs. SELSE Workshop-Silicon Errors in Logic-System Effects, 2010. → pages iii, 23  115  [62] L. Rashid, K. Pattabiraman, and S. Gopalakrishnan. Dieba: Diagnosing intermittent errors by backtracing application failures. Silicon Errors in Logic - System Effects, 2012. → pages iii, 44, 65 [63] V. Reddy, A. Krishnan, A. Marshall, J. Rodriguesz, S. Natarajan, T. Rost, and S. Krishnan. Impact of negative bias temperature instability on digital circuit reliability. Reliability Physics Symp., pages 248 – 254, 2002. → pages 5, 17 [64] B. Romanescu and D. Sorin. Core cannibalization architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults. Intl. Conf. on Parallel Architectures and Compiliation, pages 43–51, 2008. → pages 4, 12, 44, 53, 67, 68, 102, 103, 105 [65] S. Sahoo, M. Li, P. Ramachandran, S. Adve, V. Adve, and Y. Zhou. Using likely program invariants to detect hardware errors. Intl. Conf. on Dependable Systems and Networks, pages 70–79, 2008. → pages 97 [66] W. Sanders and J. Meyer. Stochastic activity networks: formal definitions and concepts. Lectures on formal methods and performance analysis, pages 315–343, 2002. → pages 7, 43, 47 [67] E. Schuchman and T. Vijaykumar. Rescue: A microarchitecture for testability and defect tolerance. Intl. Symp. on Computer Architecture, pages 160–171, 2005. → pages 53, 101 [68] M. Serafini, A. Bondavalli, and N. Suri. Online diagnosis and recovery: On the choice and impact of tuning parameters. IEEE Transactions on dependable and secure computing, 4(4):295 – 312, 2007. → pages 12, 103 [69] P. Shivakumar, S. W. Keckler, C. R. Moore, , and D. Burger. Exploiting microarchitectural redundancy for defect tolerance. Intl.Conf. on Computer Design, pages 481 – 488, 2003. → pages 102 [70] S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra low-cost defect protection for microprocessor piplines. Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 41(11):73–82, 2006. → pages 67, 103, 105 [71] D. Siewiorek and R. Swarz. Reliable computer systems: Design and evaluation. A K Peters/CRC Press, 1998. → pages 2  116  [72] K. Skadron, M. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan. Temperature-aware microarchitecture: Modeling and implementation. ACM Transactions on Architecture and Code Optimization, 1(1):94–125, 2004. → pages 2, 27, 61, 80 [73] J. Smolens, B. Gold, J. Kim, B. Falsafi, J. Hoe, and A. Nowatzyk. Fingerprinting: Bounding soft-error detection latency and bandwidth. Symp. on Architectural Support for Programming Languages and Operating Systems, pages 224–234, 2004. → pages 3, 22, 101 [74] J. Smolens, B. Gold, J. Hoe, B. Falsafi, and K. Mai. Detecting emerging wearout faults. Workshop On Silicon Errors in Logic-System Effects, 2007. → pages 14, 18 [75] J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The case for lifetime reliability-aware microprocessors. ACM SIGARCH Computer Architecture News, 32(2), 2004. → pages 102 [76] W. Tisdale, K. Williams, B. Timp, D. Norris, E. Aydil, and X. Zhu. Hot-electron transfer from semiconductor nanocrystals. Science Magazine, 328(5985):1543–1547, 2010. → pages 11 [77] L. Wang, K. Pattabiraman, Z. Kalbarczyk, R. Iyer., L. Votta, C. Vick, and A. Wood. Modeling coordinated checkpointing for large-scale supercomputers. Int. Conf. on Dependable Systems and Networks, pages 812 – 821, 2005. → pages 43, 44, 52 [78] N. Wang, A. Mahesri, and S. J. Patel. Material dependence of hydrogen diffusion: implications for nbti degradation. Int. Electron Devices Meeting. IEDM Technical Digest, page 691 695, 2005. → pages 11, 12 [79] N. Wang, A. Mahesri, and S. J. Patel. Examining ace analysis reliability estimates using fault-injection. Int. Symp. on Computer Architecture, page 460 469, 2007. → pages 18, 100 [80] C. Weaver and T. Austin. A fault tolerant approach to microprocessor design. Int. Conf. on Dependable Systems and Networks, pages 411 – 420, 2001. → pages 12 [81] P. Wells, K. Chakraborty, and G. Sohi. Adapting to intermittent faults in multicore systems. Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 255–264, 2008. → pages 2, 9, 42, 46, 53, 102, 103 117  [82] M. Xu, R. Bodik, and M. D. Hill. A flight data recorder for enabling full-system multiprocessor deterministic replay. Int. symp. on Computer architecture, pages 122–135, 2003. → pages 72, 85, 91  118  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073826/manifest

Comment

Related Items