Understanding and Improving the Error Resilience ofMachine Learning SystemsbyZitao ChenB.Eng., China University of Geosciences (Wuhan), 2018A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2020c© Zitao Chen, 2020The following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:“Understanding and Improving the Error Resilience of Machine Learning Sys-tems”, submitted by Zitao Chen in partial fulfillment of the requirements for thedegree of Master of Applied Science in Electrical and Computer Engineering.Examining Committee:Karthik Pattabiraman, Electrical and Computer EngineeringSupervisorPrashant Nair, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractWith the massive adoption of machine learning (ML) applications in HPC domains,the reliability of ML is also growing in importance. Specifically, ML systems arefound to be vulnerable to hardware transient faults, which are growing in frequencyand can result in critical failures (e.g., cause an autonomous vehicle to miss anobstacle in its path). Therefore, there is a compelling need to understand the errorresilience of the ML systems and protect them from transient faults.In this thesis, we first aim to understand the error resilience of ML systems un-der the presence of transient faults. Traditional solutions use random fault injection(FI), which, however, is not desirable for pinpointing the vulnerable regions in thesystems. Therefore, we propose BinFI, an efficient fault injector (FI) for findingthe critical bits (where the occurrence of faults would corrupt the output) in theML systems. We find the widely-used ML computations are often monotonic withrespect to different faults. Thus we can approximate the error propagation behav-ior of an ML application as a monotonic function. BinFI uses a binary-search likeFI strategy to pinpoint the critical bits. Our result shows that BinFI significantlyoutperforms random FI in identifying the critical bits of the ML application withmuch lower costs.With BinFI being able to characterize the critical faults in ML systems, westudy how to improve the error resilience of ML systems. It is known that whilethe inherent resilience of ML can tolerate some transient faults (which would notaffect the system’s output), there are critical faults that cause output corruptionin ML systems. In this work, we exploit the inherent resilience of ML to protectthe ML systems from critical faults. In particular, we propose Ranger, a tech-nique to selectively restrict the ranges of values in particular network layers, whichiiican dampen the large deviations typically caused by critical faults to smaller ones.Such reduced deviations can usually be tolerated by the inherent resilience of MLsystems. Our evaluation demonstrates that Ranger achieves significant resilienceboosting without degrading the accuracy of the model, and incurs negligible over-heads.ivLay SummarySoft errors are growing in prevalence in computing systems due to the shrinkingfeature size. Machine learning (ML) systems are known to be vulnerable to softerrors, and can fail as a result. In this thesis, we start with analyzing the math-ematical property of common ML functions, and find many of them exhibit themonotonic property with respect to different faults. Thus the error propagation be-havior can be approximated as a monotone function, based on which we propose abinary-search like fault injector to efficiently identity the critical bits in the systems(where the occurrence of fault corrupts the output). Next, we propose a techniqueto selectively restrict the range of values in the ML systems. This is to dampen thedeviations caused by faults so that they can be tolerated by the ML system itself.This thus improves the error resilience of ML systems.vPrefaceThis thesis is the result of work carried out by myself, in collaboration with mysupervisor, Dr. Karthik Pattabiraman, Dr. Guanpeng Li and Dr. Nathan De-Bardeleben. All chapters are based on the work published in the 2019 InternationalConference for High Performance Computing, Networking, Storage, and Analysisand another work in submission. I was responsible for conceiving the ideas, de-signing and conducting experiments, compiling the results and writing the paper.My advisor was responsible for overseeing the project, providing feedback andwriting parts of the paper. Guanpeng and Nathan helped with the analysis and pro-vided insight over the course of the project.• Zitao Chen, Guanpeng Li, Karthik Pattabiraman, and Nathan DeBardeleben.2019. BinFI: An Efficient Fault Injector for Safety-Critical Machine LearningSystems . In The International Conference for High Performance Computing,Networking, Storage, and Analysis (SC ’19), November 17–22, 2019. Accep-tance Rate: 20.9%• Zitao Chen, Guanpeng Li, Karthik Pattabiraman “Ranger: Boosting Error-Resilienceof Deep Neural Networks through Range Restriction” https://arxiv.org/abs/2003.13874viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Soft Errors in Machine Learning Systems . . . . . . . . . . . . . 11.2 Motivation and Approach . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Understanding the Error Resilience of ML. . . . . . . . . 21.2.2 Improving the Error Resilience of ML. . . . . . . . . . . 31.3 Contributions and Summary . . . . . . . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 AVs Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 7vii2.3 TensorFlow framework . . . . . . . . . . . . . . . . . . . . . . . 82.4 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Fault Injection Tool . . . . . . . . . . . . . . . . . . . . . . . . . 93 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Assessing the Error resilience of ML . . . . . . . . . . . . . . . . 103.2 Accelerating Fault Injection . . . . . . . . . . . . . . . . . . . . 103.3 Enhancing the Error Resilience of ML . . . . . . . . . . . . . . . 113.4 Approximating Computing in ML . . . . . . . . . . . . . . . . . 123.5 Value Truncation in ML . . . . . . . . . . . . . . . . . . . . . . . 123.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 BinFI: An Efficient Fault Injector for Safety-Critical Machine Learn-ing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.1 Error Propagation Example . . . . . . . . . . . . . . . . . . . . . 154.2 Monotonicity and Approximate Monotonicity . . . . . . . . . . . 164.3 Common ML Algorithms and Monotonicity . . . . . . . . . . . . 184.4 Error Propagation and (Approximate) Monotonicity . . . . . . . . 224.5 Binary Fault Injection - BinFI . . . . . . . . . . . . . . . . . . . 234.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 254.6.2 RQ1: Identifying Critical Bits. . . . . . . . . . . . . . . . 284.6.3 RQ2: Overall SDC Evaluation . . . . . . . . . . . . . . . 324.6.4 RQ3: Performance Overhead . . . . . . . . . . . . . . . . 334.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.7.1 Inaccuracy of BinFI . . . . . . . . . . . . . . . . . . . . 354.7.2 Effect of Non-monotonicity . . . . . . . . . . . . . . . . 354.7.3 Discussion for Multiple Bit-flips Scenario . . . . . . . . . 364.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Ranger: Boosting Error Resilience of Deep Neural Networks throughRange Restriction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.1 Intuition behind Range Restriction . . . . . . . . . . . . . . . . . 395.2 Selective Range Restriction . . . . . . . . . . . . . . . . . . . . . 40viii5.3 Evaluation for Ranger . . . . . . . . . . . . . . . . . . . . . . . . 435.3.1 RQ1: Effectiveness of Range Restriction. . . . . . . . . . 465.3.2 RQ2: Accuracy. . . . . . . . . . . . . . . . . . . . . . . . 495.3.3 RQ3: Overhead. . . . . . . . . . . . . . . . . . . . . . . 505.3.4 RQ4: Effectiveness of Ranger under Reduced PrecisionData Type. . . . . . . . . . . . . . . . . . . . . . . . . . 515.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4.1 Trade-off between Accuracy and Resilience. . . . . . . . 525.4.2 Design Alternative for Ranger . . . . . . . . . . . . . . . 535.4.3 Limitations of Ranger . . . . . . . . . . . . . . . . . . . 555.4.4 Discussion for Multiple Bit-flips Scenario . . . . . . . . . 555.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 576.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.2.1 Evaluation on End-to-end Self-driving Platforms . . . . . 586.2.2 Extend to Other Safety-critical ML Applications . . . . . 596.2.3 Address the Remaining Critical Faults . . . . . . . . . . . 59Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60ixList of TablesTable 4.1 Major computations in state-of-art DNNs . . . . . . . . . . . . 19Table 4.2 ML models and datasets for evaluation . . . . . . . . . . . . . 27Table 4.3 Number of critical bits in each benchmark. . . . . . . . . . . . 31Table 4.4 Precision for BinFI on identifying critical bits. . . . . . . . . . 32Table 4.5 Overall SDC deviation compared with ground truth. Deviationshown in percentage (%), and error bars shown at the 95% con-fidence intervals for random FI. . . . . . . . . . . . . . . . . . 33Table 5.1 DNN models and datasets used for the evaluation of Ranger . . 45Table 5.2 Accuracy of the original DNN and the DNN protected withRanger. + indicates accuracy improvement. Higher is better. . 50Table 5.3 Computation overhead (FLOPs) of Ranger. M stands for Mil-lion; B stands for Billion. . . . . . . . . . . . . . . . . . . . . 51Table 5.4 Accuracy difference of the Dave model with different restrictionbounds. Lower is better. . . . . . . . . . . . . . . . . . . . . 54xList of FiguresFigure 2.1 Example of TensorFlow graph and how fault is injected intothe graph. Fault is injected by duplicating the TensorFlowgraph, in which the customized operators support fault injec-tion capability. The nodes in blue represent the original nodesin the graph, while the nodes in red are those added by Ten-sorFI for fault injection [24]. . . . . . . . . . . . . . . . . . . 9Figure 4.1 An example of error propagation in kNN model (k=1), faultoccurs at the tf.add operator - line 2. . . . . . . . . . . . . . . 15Figure 4.2 An example of how a ML model exhibits monotonicity for dif-ferent inputs. Assume the computation is a simple one-timeconvolution (inner-product). . . . . . . . . . . . . . . . . . . 18Figure 4.3 Illustration of binary fault injection. Critical bits are clusteredaround high-order bits. . . . . . . . . . . . . . . . . . . . . . 24Figure 4.4 Single-bit flip outcome on the self-controlled steering systems.Blue arrows point to the expected steering angles and red ar-rows are faulty outputs by the systems. . . . . . . . . . . . . 29Figure 4.5 Single-bit flip outcome on the classifier models. Images at thefirst row are the input images, those at the second row are thefaulty outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 4.6 Recall of critical bits by different FI techniques on four safety-critical ML systems. ranFI−1 is random FI whose FI trial isthe same as that by allFI; FI trial for ranFI− 0.5 is half ofthat by allFI. ranFI ∼ 0.2 takes the same FI trials as BinFI. 30xiFigure 4.7 FI trials (overhead) for different FI techniques to identify crit-ical bits in AlexNet and VGG11. . . . . . . . . . . . . . . . . 34Figure 4.8 Numbers of FI trials by BinFI and exhaustive FI to identifycritical bits in VGG11 with different datatypes. . . . . . . . . 35Figure 4.9 Recall and precision for BinFI on ML models with non-monotonicfunctions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 5.1 Example of a fault resulting in misclassification and how Rangerenables fault correction by dampening the error to be toleratedby the DNN. Darker colors represent larger activation values.Assume label A is the correct label and B is the incorrect label. 39Figure 5.2 Work flow of Ranger. . . . . . . . . . . . . . . . . . . . . . . 40Figure 5.3 Range of value observed in each ACT layer using differentamount of data on the VGG16 network (13 ACT layers in to-tal). A total of 186056 images (around 20% of the whole train-ing set) were used. . . . . . . . . . . . . . . . . . . . . . . . 46Figure 5.4 SDC rates of the original classifier models and enhanced mod-els with Ranger. For the models using ImageNet dataset, weprovide the results from top-1 and top-5 accuracy. Error barsrange from±0.04% to±1.46% at the 95% confidence interval.Lower is better. . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 5.5 SDC rates of the original steering models and enhanced mod-els with Ranger. SDC is defined by thresholding different de-grees of deviation to the correct steering angles (i.e., 15, 30, 60,and 120 degrees). Error bars range from ±0.03% to ±1.24%at the 95% confidence interval. Lower is better. . . . . . . . . 47Figure 5.6 Relative SDC rate reduction in DNNs following the approachin Hong et al. [46] and Ranger. Error bars range from±0.12%to ±1.38% at the 95% confidence interval. Higher is better. . . 49Figure 5.7 SDC rate of DNNs using 16-bit fixed-point data type. Errorbars range from ±0.04% to ±1.33% at the 95% confidenceinterval. Lower is better. . . . . . . . . . . . . . . . . . . . . 52xiiFigure 5.8 SDC rates for Dave model with different restriction bounds.Error bar range from ±0.14% to ±1.39% at 95% confidenceinterval. Lower is better. . . . . . . . . . . . . . . . . . . . . 54xiiiList of AbbreviationsDNN Deep Neural NetworkFI Fault InjectionHPC High Performance ComputingML Machine LearningSDC Silent Data CorruptionxivAcknowledgmentsI would like to first thank my advisor Dr. Karthik Pattabiraman. If it weren’tyou, I would not have been able to have such a wonderful time at UBC, where Ifailed, I learned, and I grew up. It is your consistent support and valuable guidancethroughout my Master journey that leads me to where I am today. I shall neverforget them and carry them forward to motivate myself as a better researcher.In addition to my advisor, I would like to thank my thesis examining commit-tee, Dr. Sathish Gopalakrishnan and Dr. Prashant Nair, for their valuable feedbackon this thesis.I would like to thank Dr. Guanpeng Li and other lab colleagues in the Depend-able Systems Lab for their thoughtful discussions and help throughout my researchendeavors.Finally, I want to express my gratitude to my family, whose unconditional sup-port is the reason I can be here today.xvChapter 1Introduction1.1 Soft Errors in Machine Learning SystemsIn the recent past, there has been wide adoption of deep neural networks (DNNs)in the High-Performance Computing (HPC) domain [43, 63, 74, 94]. One of thedefining characteristics of DNNs is the humongous amount of data they use, whichrequires large-scale HPC resources. Typical DNNs are composed of a large numberof identical neurons that are highly parallel in nature [36], which maps naturally toparallel processors such as multi-core CPUs, GPUs, or even specialized hardwareaccelerators. These hardware accelerators are mini-supercomputers in their ownright. For example, the Nvidia Orin system, which is a DNN accelerator is able todeliver 200 TOPS (trillion operations per second) [15].As DNNs are deployed on HPC systems, their reliability becomes important[27, 80]. In particular, HPC systems are vulnerable to hardware transient faults(i.e., soft errors), which are growing in frequency due to shrinking feature size(e.g., a HPC system could experience 10 to 20 soft errors weekly [62]). Transientfaults typically manifest as bit-flips in the system and can be induced by high-energy particle strikes (e.g., alpha particles), transistor variability, thermal cycling,and malicious attacks [46]. This is important because HPC applications require notonly high performance, but also high fidelity results. For example, in the health-care domain, DNNs have been trained to extract meaningful information from thelarge amount of historical data, and hence assist in precision medical diagnosis1(e.g., detection of atrial fibrillation) [68, 93, 94].Another emerging example are Autonomous Vehicles (AVs), in which DNNsare used to provide end-to-end autonomy. An AV system can be considered as amoving data center that constantly aggregates the sensory data from the surround-ings, from which the system needs to provide prompt and reliable results to safelynavigate the vehicles. In AVs, hardware transient faults could result in safety vio-lations such as causing the AV system to miss obstacles in its path [57]. Further,many of these applications have to adhere to standards, e.g., the ISO 26262 stan-dard [12] for AVs requires that there should be no more than 10 FIT (Failure inTime), which translates to 10 failures in a billion hours of operation. Therefore, itis important to understand the error resilience of ML systems under the presenceof soft errors and design efficient techniques to protect them from soft errors.1.2 Motivation and ApproachThis thesis focuses on two important research problems:• Understanding the error resilience of ML - identifying the critical bits in MLsystems. Critical bits are those bits in the systems, where the occurrence of atransient fault could result in output corruption (e.g., misclassification).• Improving the resilience of ML - protecting the ML systems from soft errors(particularly the critical faults). We refer to those faults that lead to outputcorruption as critical faults.1.2.1 Understanding the Error Resilience of ML.Challenge: A well-established approach to experimental resilience assessment israndom fault injection (FI), which works by randomly sampling from a set of faultlocations, and injecting the faults into the program to obtain a statistically signifi-cant estimate of their overall resilience. Random FI has been used on ML applica-tions for overall resilience assessment [57, 70], but it is unfortunately not suitablefor identifying the critical bits in the program due to two reasons. First, becauserandom FI relies on statistical sampling, it is not guaranteed to cover all the critical2bits. Second, the critical bits are often clustered in the state space, and randomsampling is unable to find them (e.g., Fig. 4.3).The only known approach to identify the critical bits of a ML program is ex-haustive FI, which involves flipping each bit of the program, and checking if itresulted in an output corruption. Unfortunately, exhaustive FI incurs huge perfor-mance overheads, as only one fault is typically injected in each trial (for control-lability). The time taken by exhaustive FI is therefore directly proportional to thenumber of bits in the program, which can be very large. Thus there is a need for anefficient approach to identity the critical bits in ML systems.Approach: To address the above challenge, we propose an efficient fault in-jection technique, which can identify the critical bits in ML applications, and alsomeasure the overall resilience, with reasonable overheads. The key insight of ourapproach is that the functions used in ML applications are often tailored for spe-cific purposes. For example, for a network to classify an image of a vehicle, theML computations would be designed in a way that they can produce larger re-sponse upon the detection of the vehicular feature in the image, while keepingthe response to irrelevant features small. This results in the functions exhibitingmonotonicity based on how compatible is the input with the target (e.g., vehicu-lar features), and the composition of these ML functions can be approximated asa single monotonic composite function. The monotonicity of the function helpsus prune the FI space and efficiently identify the critical bits. Analogous to howbinary search on a sorted array has an exponential time reduction compared tolinear search, our approach results in an exponential reduction in the FI space ofML applications to identify the critical bits, compared to exhaustive FI. Therefore,we call our approach Binary fault injection or BinFI (in short).1.2.2 Improving the Error Resilience of ML.Challenge: Traditional methods to protect systems from soft errors use redun-dancy at the hardware level (e.g., Dual-Modular Redundancy). However, suchtechniques are prohibitively expensive, and are impractical to be used in this do-main. For example, the cost-per-unit and computation performance are of signifi-cant importance in the HPC domain. Duplicating hardware components adds to the3total cost of the system, and requires synchronization and voting. DNN-specifictechniques have been proposed to enhance the error resilience of DNNs [57, 61,79]. However, they either suffer from significant false positives (e.g., raise an alarmwhen there is no fault present), or require significant implementation effort (Sec-tion 3). Therefore, there is a compelling need for an efficient technique to boost theerror resilience of DNNs.Approach: To mitigate the critical faults in ML systems, we propose Ranger, atechnique to selectively restrict the ranges of values in specific DNN layers, therebyreducing the large value deviations due to critical faults. Because DNNs are oftennot resilient to large value deviations, range restriction dampens the large valuedeviations to be tolerated by the inherent resilience of DNNs, i.e., DNNs are stillable to generate correct outputs. Ranger can mitigate critical faults based on twoproperties found in DNNs: 1) the monotonicity of operators in DNNs applications(see Chapter 4); 2) the inherent resilience of DNNs to insignificant errors [90] (seeChapter 5 for more details). Note that the values that exceed the restriction boundare not limited to those corrupted by critical faults. For the values that are corruptedby faults, Ranger can restrict these values to prevent SDCs. However, for normalvalues (uncorrupted by the fault), range restriction due to Ranger will not degradethe accuracy of the original model. Ranger is an automated transformation appliedon existing DNN models and can significantly reduce the instances of SDCs. Thus,existing DNN applications (such as those used in AVs) can benefit from Ranger,without major additional effort needed such as retraining of the DNNs.1.3 Contributions and SummaryThe main contributions of this thesis are as follows:• Present a fault injection technique to assess the fine-grained error resilienceof ML systems. We analyze the common operations used in ML applicationsand identify that many of them exhibit monotonicity. Based on the analysis,we identify the clustering pattern of critical faults. We propose BinFI, afault injection technique to find the critical bits, and also measure the overallresilience of the program.• Introduce an efficient technique for improving the error resilience of ML4systems. We propose Ranger, a technique to selectively restrict the rangesof values in DNNs, and transform large values into smaller ones. BinFIidentifies 99.56% of the critical bits with 99.63% precision in the evaluatedsystems, which significantly outperforms random FI with much lower costs.Thus, large deviations caused by critical faults will be restricted, and can betolerated by DNNs to generate correct outputs (due to the inherent resilienceof DNNs). Ranger significantly enhances the resilience of the DNNs models- it reduces the SDC rates from 14.92% to 0.44% (in classifier DNNs), andfrom 37.24% to 2.49% (in the self-driving car DNNs); 2) does not degradethe accuracy of all the evaluated models and 3) incurs modest computationoverheads.Our findings thus provide not only an analytical understanding of the errorresilience in ML applications (due to transient faults), but also a usable tool toefficiently identify the vulnerable regions in the systems (our artifact is availableat https://github.com/DependableSystemsLab/TensorFI-BinaryFI). In addition, wedesign an efficient technique to protect the ML systems from soft errors. This thuscontributes to building error-resilient applications in the ML domain.5Chapter 2BackgroundWe start by providing a brief introduction of deep learning, then a brief overviewof the ML framework we use. We finally present the fault model we assume andthe fault injection tool we use.2.1 Deep LearningDeep learning is a field of artificial intelligence that typically leverages DNNs toaddress problems in both classification and regression. A typical DNN uses mul-tiple layers to progressively extracts high level features from the raw input. Suchhigher level abstraction of the input data is called feature map and is extracted topreserve the meaningful information from each layer.In this work, we primarily consider convolution neural networks (CNNs), aclass of DNNs application and it is widely used in HPC domains, many of whichare safety-critical applications [21, 43, 53, 63, 74, 93, 94, 94]. The primary com-putation usually occurs in the convolution (Conv) layer that extract the underlyingpatterns. The convolution function applies to filter on a receptive field on the inputfeature map (one region in the input) and then slide the kernel to the next receptivefield in the input feature map to extract the underlying visual characteristics andgenerate useful output feature map for the next layer.The results from the convolution layer are then fed to the activation (ACT) layerin which the ACT function is used to determine how the neuron should be activated6(e.g., performing non-linear transformation). The output of it is then used as theinput for the following layer. CNNs may also include a local or global poolinglayers to streamline the underlying computation. It is typically used to reduce thedimensions of the data by combining the outputs from different neurons into a sin-gle neuron in the next layer. Max-Pooling is a commonly used pooling function toextract the largest neuron value from a cluster of neurons, while Average-poolingreports the average neuron value. Normalization layer is usually added to facilitatethe training procedure to improve the network’s accuracy [49]. Batch normaliza-tion is commonly used and it normalizes the activations in a network across a mini-batch [49]. A small number of fully-connected (FC) layers might also be stackedprior to the output layer.A DNN model typically goes through 2 phases: 1) training phase in which themodel is trained to learn particular task; 2) inference phase where the model is usedfor actual deployment.2.2 AVs RequirementsAutonomous vehicles (AVs) are complex systems that use ML to integrate datafrom various electronic components (e.g., LiDAR) and deliver real-time driving de-cisions. AVs entail several requirements: (1) high throughput (e.g., large amountsof data must be processed as they arrive from the sensors [1, 2]), and (2) low la-tency (e.g., apply the brakes upon the detection of a pedestrian in front of the vehi-cle within a few ms) [1, 5]. These requirements present significant challenges forreliability in AV applications.As mentioned, AVs reliability is mandated by stringent regulation - no morethan 10 FIT as governed by ISO 26262 safety standard. There are two kinds offaults that can cause this standard to be violated [5], namely (1) systematic faults,and (2) transient faults. The former are caused by design defects in hardware andsoftware, while the latter by cosmic rays and electromagnetic disturbances. Sys-tematic faults in AVs can be mitigated at design time, and have been well exploredin the literature [66, 76, 89]. Hardware transient faults, on the other hand, needruntime mitigation and are less studied in the context of AVs. FIT due to soft er-rors is orders of magnitude higher than the 10 FIT requirement [57]. Therefore, it7is important to study soft errors’ effect on AV’s reliability.2.3 TensorFlow frameworkIn this thesis, we consider ML programs written in TensorFlow, which is among themost popular ML framework in use today [18]1. The main advantage of Tensor-Flow is that it abstracts the operations in a ML program as a set of operators and adataflow graph - this allows programmers to focus on the high-level programminglogic.There are two main components in the TensorFlow dataflow graph: (1) oper-ator is the computational unit (e.g., matrix multiplication) and tensor is the dataunit. Users can build the ML model using the built-in operators or define theircustomized operators.2.4 Fault ModelIn this study, we consider transient faults in the hardware (i.e., soft errors) thatoccur randomly during the inference phase of the DNNs. We consider inferencephase because DNNs are usually trained once, while the inference task is per-formed repeatedly in deployment, and are hence much more likely to experiencefaults during their lifetime. We assume faults arise in the processor’s data path(ALUs and pipeline registers); and faults in main memory, cache and register fileare protected by ECC or parity [37]. This is in line with previous reliability stud-ies [20, 34, 58]. In addition, we assume that faults do not arise in the control logicof the processor, as that constitutes only a small fraction of the total area of theprocessor. We only consider activated faults as masked faults do not affect theprogram’s execution.We also assume that at most one fault occurs per program execution, becausesoft errors are relatively rare events given the typical inference time of DNNs. Thisalso follows the fault models in prior studies [20, 25, 32, 58, 91]. Finally, we injectsingle bit flips in the software implementation of the DNN, as a transient fault isoften manifested as a single bit flip at the software level [22], and studies have1Our approach is also applicable for applications using other ML frameworks such as PyTorch,Keras, etc., as they are similar to TensorFlow in structure.8X+X+Inject faultOriginal TF graph Duplicated TF graph(for fault injection)Figure 2.1: Example of TensorFlow graph and how fault is injected into thegraph. Fault is injected by duplicating the TensorFlow graph, in whichthe customized operators support fault injection capability. The nodesin blue represent the original nodes in the graph, while the nodes in redare those added by TensorFI for fault injection [24].shown that multiple bit-flip errors result in similar fault propagation patterns assingle bit-flip errors that cause SDCs [23, 77]. Though we primarily follow thesingle bit-flip model, we also discuss our work in the context of multiple bit-flipscenario (in Section 4.7.3 for BinFI and Section 5.4.4 for Ranger respectively).2.5 Fault Injection ToolIn this thesis, we use TensorFI 1, an open-source FI tool for TensorFlow-basedprograms that allows faults to be injected directly into the TensorFlow graph [24].The main advantage of TensorFI is that it is a generic FI tool that allows to performinjection experiments on a wide-range of TensorFlow-based ML programs. Weshow an example of TensorFI’s operation in Fig. 2.1. The main idea of TensorFI isto duplicate the TensorFlow graph using a set of customized operators to performthe fault injection so as not to interfere with the original graph. As the details of theoperations are abstracted from the program in TensorFlow, and are also platform-specific, we inject faults directly to the output values of operators in the graph -this is in line with the FI performed in prior work [24, 25, 57, 79].9Chapter 3Related WorkWe classify related work into five broad areas.3.1 Assessing the Error resilience of MLLi et al. [57] build a fault injector to randomly inject transient hardware faultsin ML application running on specialized hardware accelerators. Using the injec-tor, they study the resilience of the program under different conditions by varyingdifferent system parameters such as data types. A recent study designs a DNN-specific FI framework to inject faults into real hardware and studies the trade offbetween the model accuracy and the fault rate [70]. Santos et al. [29] investigatethe resilience of ML under mixed-precision architectures by conducting neutronbeam experiments. These papers measure the overall resilience of ML systems us-ing random FI, while BinFI identifies the bits that can lead to safety violations inthese systems. Unfortunately, random FI achieves very poor coverage for identify-ing the critical bits in ML systems.3.2 Accelerating Fault InjectionGiven the high overhead of FI experiments, numerous studies have proposed toaccelerate the FI experiments by pruning the FI space or even predicting the errorresilience without performing FI [35, 41, 58, 78]. Hari et al. [41] propose Relyzer,a FI technique that exploits fault equivalence (an observation that faults that prop-10agate in similar paths are likely to result in similar outputs) to selectively performFI on the pilot instructions that are representative of fault propagation. In followup work, the authors propose GangES, which finds that faults resulting in the sameintermediate execution state will produce the same faulty output [78], thus pruningthe FI space further. Li et al. [58] propose Trident, a framework that can predictthe SDC rate of the instructions without performing any FI. The key insight is thatthey can model the error propagation probability in data-dependency, control-flowand memory levels jointly to predict the SDC rate. However, this technique is notapplicable in ML-based programs. Because numerical change in the final outputdoes not always constitute an SDC in ML and the ML program does not rely on theexactness of the outcome (e.g., deviations can be tolerated as long as the predictionlabel remains correct), whereas Trident flags an SDC as any deviation observed atthe program’s output. In summary, none of these studies are tailored for ML pro-grams and their focus is on measuring the overall resilience of the system, unlikeBinFI that can identify the critical bits.3.3 Enhancing the Error Resilience of MLSeveral techniques have been proposed to enhance the error resilience of DNNs [46,57, 71, 79? ]. For example, Li et al. [57] leverages the value spikes in the neuronalresponses as the symptoms for fault detection. However, this technique has a false-positive rate of over 30% [57], and program re-execution is required upon the de-tection of a fault. This is not desirable for time-critical systems such as AVs wherethe systems need to generate real-time predictions, and re-execution might causedelay in response. Similarly, Schorn et al. [79] builds a supervised learning modelto distinguish between benign and critical faults in DNNs, as well as perform errorcorrection. However, their technique requires extensive fault injection (FI), whichis particularly time-consuming for large DNNs. For example, for a single inputin the VGG16 model, there could be over 30 million values that can be corruptedby faults. Therefore, performing FIs to obtain a comprehensive training set is pro-hibitively expensive for large DNNs (such large DNNs are not studied in [79]).Zhao et al. [95] propose a checksum technique to protect the ML from faults atthe convolution layers. However, it does not account for the faults arising in other11areas of the DNN, and the fault propagating into multiple neurons. Mahmoud etal. [61] use statistical fault injection to identify vulnerable regions in DNNs, andselectively duplicate the vulnerable computations for protection. However, theirapproach incurs high computational overheads (due to the use of duplication), andalso provides only limited protection coverage. For example, they find that dupli-cation with 30% overhead only achieves 40% ∼ 70% of error coverage in mostof the models (the best coverage is 90% in the SqueezeNet model). In contrast,Ranger mitigates an average of 97% of the critical faults in all the classifier models(achieves a 34X reduction in SDC) with only 0.5% performance overhead, and noaccuracy loss.3.4 Approximating Computing in MLMany studies have used approximate computing techniques to lower the compu-tation and energy demands for DNNs [40, 51, 90]. Leveraging the resilience ofDNNs to inexactness in the computations, studies propose to identify those neu-rons that have low impact on the final output, and either replace them with an ap-proximate low-precision design [90] or even remove them altogether [40]. WhileRanger also exploits the inherent resilience of DNNs, we restrict the ranges of val-ues in selected DNN layers rather than find sensitive neurons in the DNNs likeprior studies.3.5 Value Truncation in MLValue truncation has been widely used in the ML domain for various purposes(e.g., performance [36, 55]; robustness to outliers [92]; privacy [50, 82]. Gradi-ent clipping is used to address gradient explosion during the training of DNN byrescaling the gradients to a certain range [36]. This is because exploding gradientcould result in an unstable network and limiting the magnitude of the gradient canresolve this problem [36]. Truncated gradient is also used to induce sparsity in thelearned weights of the online learning algorithm [55]. In addition, the loss functionof the model can also be applied with truncation to improve the robustness of thelearning algorithm [92]. Value truncation can also be applied to enable privacy pre-serving ML in the input space during the pre-processing step [50] or to the gradient12values [82] (usually also coupled with adding some random noise).Unlike the above papers, we use range restriction in Ranger for protectingagainst transient faults, based on the observation that abnormally large interme-diate values in the hidden layers often indicate potential output corruption due totransient fault. This is based on the characteristics of critical faults induced bythe monotone property observed in many DNNs [25]. In particular, we exploitthe inherent resilience of DNNs to mitigate critical faults while also maintainingthe accuracy of the original model, and leveraging the value dependency betweenlayers for selective range restriction.3.6 SummaryThe growing frequency of transient faults as well as the vulnerability of ML sys-tems to transient faults have led to a plethora of studies that focus on understand-ing and improving the resilience of ML systems. Random fault injection, a widelyused approach that is still adopted by many studies [29, 57, 70], is not desirable foridentifying the critical faults in the systems. This thesis analyzes the unique faultpropagation pattern in ML programs to characterize the critical faults and proposean efficient technique for identifying the critical bits in the systems.To enhance the reliability of ML systems, prior efforts though provide someuseful insights, but either fall short of providing sufficient error coverage or re-quire significant implementation effort. This thesis leverages the properties thatare unique to the ML model, and proposes an efficient fault mitigation techniquethat achieves both high error coverage and incurs only modest overheads.13Chapter 4BinFI: An Efficient Fault Injectorfor Safety-Critical MachineLearning SystemsIn this chapter, we describe BinFI, an efficient fault injector for identifying thecritical bits in ML systems. We consider Silent Data Corruption (SDC) as a mis-match between the output of a faulty program and that of a fault-free programexecution. For example, in classifier models, an image misclassification due tosoft errors would be an SDC. We leave the determination of whether an SDC is asafety violation to the application, as it depends on the application’s context (seeSection 4.6). Unlike traditional programs, where faults can lead to different controlflows [41, 58, 65], faults in ML programs only result in numerical changes of thedata within the ML models (though faults might also change the execution time ofthe model, this rarely happens in practice as the control flow is not modified).Critical bits are those bits in the program where the occurrence of a fault wouldlead to an SDC (e.g., unsafe scenarios in safety-critical applications). Our goalis to efficiently identify these critical bits in ML programs without resorting toexhaustive fault injection into every bit.We first provide an example of how faults propagate in ML program in Sec-tion 4.1. We then define the terms monotonicity and approximate monotonicity thatwe use throughout the paper. Then we present our findings regarding the mono-14tonicity of the functions used in common ML algorithms in Section 4.3, so that wecan model the composition of all the monotonic functions involved in fault prop-agation either as a monotonic or approximately monotonic function (Section 4.4).Finally, we show how we leverage the (approximate) monotonicity property to de-sign a binary-search like FI algorithm to efficiently identify the critical bits, inSection 4.5.4.1 Error Propagation ExampleThe principle of error propagation in different ML models is similar in that a tran-sient fault corrupts the data, which becomes an erroneous data, and will be pro-cessed by all the subsequent computations until the output layer of the model. Inthis section, we consider an example of error propagation in the k-nearest neighboralgorithm (kNN), in Fig. 4.1 (k=1). The program is written using TensorFlow, andeach TensorFlow operator has a prefix of tf (e.g., tf.add). We use this code as arunning example in this section.We assume that the input to the algorithm is an image (testImg), and the outputis a label for the image. Line 1 calculates the negative value of the raw pixel intestImg. The program also has a set of images called neighbors, whose labels arealready known; and the goal of the program is to assign the label from one of theneighbors (called nearest neighbor) to the testImg. Line 2 computes the relativedistance of testImg to each of neighbors. Line 3 generates the absolute distancesand line 4 summarizes the per-pixel distance into a total distance. Line 5 looks forthe index of the nearest neighbor, whose label is the predicted label for testImg.Figure 4.1: An example of error propagation in kNN model (k=1), fault oc-curs at the tf.add operator - line 2.Assume a fault occurs at the add operator (line 2) and modifies its outputrelativeDistance. If each image’s dimension is (28,28), then the result from the15tf.add operator is a matrix with shape (|N|,784), where |N| is the number of neigh-bors and each vector with 784 elements corresponds to each neighbor. If the ithimage is the nearest neighbor, we have disi < dis j,∀ j ∈ |N|, i 6= j, where disi is thedistance of the ith image to the test image. The fault might or might not lead to anSDC - we consider both cases below.SDC: The fault occurs at (i,y),y ∈ [1,784] (i.e., corresponding to the nearestneighbor), which incurs a positive value deviation (e.g., flipping 0 to 1 in a positivevalue). The fault would increase disi, which indicates a potential SDC as disimight no longer be the smallest one among dis j, j ∈ |N|. Similarly, the fault at( j,y),y ∈ [1,784] incurs a negative deviation and may result in an SDC.No SDC: The fault at (i,y),y ∈ [1,784] incurs a negative value deviation, thusdecreasing disi. This will not cause SDC since disi is always the nearest neighbor.Similarly, a fault incurring positive deviation at ( j,y),y ∈ [1,784] would always bemasked.4.2 Monotonicity and Approximate MonotonicityWe now define the terms monotonicity and approximate monotonicity that we usein this paper.• Non-strictly monotonic function: A non-strictly monotonic function is eithermonotonically increasing: f (xi)≥ f (x j),∀xi > x j; or monotonically decreasing:f (xi) ≤ f (x j),∀xi > x j. We say a function has monotonicity when it is non-strictly monotonic.• Approximately monotonic: We call a function approximately monotonic in or-der to approximate (assume) a function as a monotonic function. For example,we can approximate a piece-wise monotone function (where the function ex-hibits different monotonicity in different intervals [59]) as a monotonically in-creasing or decreasing function. Assume f (x) = 100∗max(x−1,0)−max(x,0)is monotonically increasing when x > 1, but not when x ∈ (0,1). Hence, weapproximate (assume) f (x) as a monotone function and thus we call it is ap-proximately monotonic.Error propagation (EP) function: We define EP function, which is a com-16posite function of those functions involved in propagating the fault from the faultsite to the model’s output. For example, there are MatMul (matrix multiplica-tion) and ReLu (rectified linear unit [64]) following the occurrence of the fault,in this case EP function is the composite function of both functions: EP(x) =ReLu(MatMul(xorg−xerr,w)), where w is the weight for the MatMul computation,xorg and xerr are the values before and after the presence of fault. Thus xorg−xerr isthe deviation caused by the fault at the fault site. Note that we are more interestedin the deviation caused by the fault rather than the value of the affected data (xerr).Thus, the input to EP is the bit-flip deviation at the fault site and the output of EPis the outcome deviation by the fault at the model’s output.The main observation we make is that most of the functions in ML model aremonotonic, as a result of which the EP function is either monotonic or approxi-mately monotonic. The main reason for the monotonicity of ML functions (andespecially DNNs) is that they are designed to recognize specific features in theinput. For example, the ML model in Fig. 4.2 is built for recognizing the imageof digit 1, so the ML computation is designed in a way that it will have strongerresponses to images with similar features as digit 1. Specifically, the ML com-putation would generate larger output, if the image has stronger features that areconsistent with the target. In Fig. 4.2, the output of the three inputs increases asthey exhibit higher consistency (values in the middle column for each input) withthe ML target. And the final output by the model is usually determined by the nu-merical magnitude of the outputs (e.g., larger output means higher confidence ofthe image to be digit 1).The EP function can be monotonic or approximately monotonic, based on theML model. This (approximate) monotonicity can help us to prune the fault injec-tion space of the technique. For simplicity, let us assume the EP function is mono-tonic. We can model the EP function as: |EP(x)| ≥ |EP(y)|,(x > y 0)∪ (x <y 0), x,y are faults at the bits of both 0 or 1 in the same data (Section 4.4 hasmore details). We can therefore expect the magnitude of the deviation caused bylarger faults (in absolute value) to be greater than those by smaller faults. Further,a larger deviation at the final output is more likely to cause an SDC. Based on theabove, we can first inject x. If it does not lead to an SDC, we can reason that faultsfrom lower-order bits y will not result in SDCs without actually simulating them,171 0 30 0 24 0 90 4 00 4 00 4 00 8 00 8 00 8 01 10 11 10 01 10 0output =8Weak responseoutput = 120Medium responseoutput = 240Strong responseInput ML target(digit 1)Output MonotonicConvolution(inner-product)Input-1Weak featureInput-2Medium featureInput-3Strong feature !"" !"# !"$ !""%"" + …+!$$%$$!$" !$# !$$…%"" %"# %"$%$" %$# %$$…Figure 4.2: An example of how a ML model exhibits monotonicity for differ-ent inputs. Assume the computation is a simple one-time convolution(inner-product).as these faults would have smaller outcomes.In practice, the EP function is often approximately monotonic (rather thanmonotonic), especially in real-world complex ML models such as DNNs. Ourapproach for pruning the fault injection space remains the same. This leads usto some inaccuracy in the estimation due to our approximation of monotonicity.Nonetheless, we show later in our evaluation that this inaccuracy is quite small.4.3 Common ML Algorithms and MonotonicityIn this section, we first survey the major ML functions within the state-of-art MLmodels in different domains, and then discuss their monotonicity. Most of thesemodels are comprised of DNNs.Image classification: LeNet [56], AlexNet [54], VGGNet [83], Inception [85–87], ResNet [44]. Some of these networks [44, 54, 83, 85] are the winning archi-tectures that achieve the best performance on the ILSVRC challenges [28] from2012 to 2017.Object detection: Faster-RCNN [75], YoLo [73], DarkNet [72]. These net-works produce outstanding performance in object detection (e.g., can detect over9000 categories [72]) and we are primarily interested in the ML thread for classifi-cation as these networks include both object localization and classification.Steering models: Nvidia DAVE system [21], Comma.ai’s steering model [8],18Table 4.1: Major computations in state-of-art DNNsBasic Conv; MatMul; Add (BiasAdd)Activation ReLu; ELu;Pooling Max-pool; Average-poolNormalization Batch normalization (BN);Local Response Normalization (LRN)Data transformation Reshape; Concatenate; DropoutOthers SoftMax; Residual functionRambo [16], Epoch [11], Autumn [4]. These models are the popular steering mod-els available in the open-source community 1 2 and are used as the benchmarks inrelated studies [66, 89].Health care and others: Detection of atrial fibrillation [93]; arrhythmia detec-tion [68]; skin cancer prediction [31]; cancer report analysis [94]; aircraft collisionavoidance system [52].Many of these tasks are important in safety-critical domains (e.g., self-drivingcars, medical diagnosis). We are interested in those computations for calculatingthe results (e.g., those in hidden layers). Note that errors in the input layer such asduring the reading the input image are out of consideration. The computations aresummarized in Table 4.1.We next discuss how most of the computations used in the above tasks aremonotonic.• Basic: Convolution computation (Conv) is mainly used in the kernels in DNNsto learn the features. Conv is essentially an inner-product computation: ~X · ~W =∑xiwi,xi ∈ ~X ,wi ∈ ~W .Assuming there are two faults at the same location and x1,x2(x1 > x2 > 0) arethe deviations from the two faults. The monotonicity property is satisfied as:|x1wi| ≥ |x2wi|. As mentioned that we are more interested in the effect causedby the single bit-flip fault, so we do not consider data that are not affected by1https://github.com/udacity/self-driving-car2https://github.com/commaai/research19fault. The multiply function is monotonic, thus Conv is monotonic (similarly formatrix multiplication - MatMul).• Activation: Activation (Act) function is often used to introduce non-linearityinto the network, which is important for the network to learn non-linear map-pings between the inputs and outputs [42]. A widely used activation function isRectified linear unit (ReLu) [64], defined as: f (x) = max(0,x), which is mono-tonic. Exponential linear unit (ELu) [60] is similar to ReLu and is monotonic.• Pooling: Pooling is usually used for non-linear down sampling. Max-poolingand average-pooling are the two major pooling functions. Max-pooling functionextracts the maximum value from a set of data for down-streaming, and it ismonotonic as follows: max(xi,xk) ≥ max(x j,xk), i f xi > x j, where xi,x j couldbe faults at different bits, and xk denotes all the data unaffected by the fault.Similarly, average-pooling function calculates the average value from a group ofdata, and it is also monotonic as follows: avg(xi,xk)≥ avg(x j,xk), i f xi > x j.• Normalization: Normalization (Norm) is used to facilitate the training of thenetwork by dampening oscillations in the distribution of activations, which helpsprevent problems like gradient vanishing [49]. Local response Norm (LRN) andbatch Norm (BN) are the two Norm approaches used in the above networks.LRN implements a form of lateral inhabitation by creating competition for bigactivities among neuron outputs from different kernels [54]. However, LRNdoes not satisfy the monotonicity property as it normalizes the values acrossdifferent neurons in a way that only needs to maintain competition (relative or-dering) among the neighboring neurons. Thus a larger input might generate asmaller output as long as the normalized value maintains the same relative or-dering among the neighboring neurons (we also validated this by experiments).However, LRN was found to have significant limitations [83], and hence modernML algorithms usually use BN instead of LRN [16, 21, 44, 72, 85–87].BN normalizes the value in a mini-batch during training phase to improve thelearning, and normalization is neither necessary nor desirable during the infer-ence as the output is expected to be only dependent on the input deterministically[49]. And thus in inference phase, BN applies the same linear transformation to20each activation function given a feature map. So we can describe BN in infer-ence phase as: f (x) =wx+b, where w,b are the statistics learned during trainingphase. BN is thus monotonic as f ′(x) = w.• Data transformation: There are computations that simply transform the data,e.g., reshape the matrix, and not alter the data value. Dropout is considered asan identity mapping since it performs dropout only in training phase [84]. Thesetransforming functions are thus monotonic as: xi > x j, i f xi > x j.• Others: SoftMax [45] is often used in the output layer to convert the logit (theraw prediction generated by the ML model), computed for each class into aprobability within (0,1) and the probabilities of all classes add up to 1. SoftMaxfunction is defined as: f (xi) = exi/∑Jj=1 ex j( f or i = 1, . . . ,J), where xi the pre-dicted value for different class (J classes in total). The derivative of SoftMaxwith respect to xi is as follows:∂ f (xi)∂xi=∂∂xi(exi∑Jj ex j) =(exi)′ ∗∑Jj ex j − exi ∗ exi(∑Jj ex j)2=exi∑Jj ex j− exi∑Jj ex j∗ exi∑Jj ex j= f (xi)(1− f (xi))(4.1)as f (xi) ranges in (0,1), the derivative of SoftMax is always positive, thus Soft-Max is monotonic.Residual function used in ResNet [44] has the property that the mapping fromthe input to output at each layer before activation function is added with anextra mapping: H(x) = F(x)+Wx, where H(x) is the mapping from input tooutput, F(x) is the residual function. The insight is that it is easier to learn theresidual function F(x) than the original mapping H(x). For example, if H(x) isan identity mapping to be learned by the network, it is easier for the networkto learn F(x) = 0 (W = 1) rather than H(x) = x. There are different types ofresidual blocks and we consider the original one in [44]:H(x) = BN(Conv(ReLu(BN(Conv(x)))))+Wx,where Wx is the extra mapping. Assuming a fault occurs at xi ∈ x, H(xi) is mono-21tonic if the derivatives of the original mapping F(xi) and Wixi are always positiveor negative. However, the derivative of F(xi) might vary due to fault propaga-tion. For example, assume H(x) = 100∗ReLu(x−1)−ReLu(x), where one faultpropagates into two state spaces (thus there are x−1 and x). The monotonicityof H(x) discontinues according to the value of x: H(x) = 99x− 100,x > 1 andH(x) =−x,x ∈ (0,1). Therefore, we call the residual block function as approx-imately monotonic.Thus we find that almost all the operations in Table 4.1 exhibit monotonicity.However, this is not an exhaustive list, e.g., there are other activation functions,some of which are non-monotonic (e.g., Swish [69], Sinusoid [33]). Nevertheless,the above models are representative of models used in domains such as objectdetection (as they yield state-of-art performance and are frequently referred to byother studies) and the functions in Table 4.1 are common ML functions. Thus weassume that the computations in most of the ML models in the application domainsmentioned above have the monotonicity property.4.4 Error Propagation and (Approximate) MonotonicityAs mentioned earlier, the EP function is a composite function consisting of all theML functions involved in error propagation. The EP function therefore satisfies themonotonic or approximately monotonic property, dependent on the ML model.• EP with monotonicity: Recall the example in Fig. 4.1 (kNN model), where thefault occurs at the tf.add operator, and it eventually affects the distance of the testimage to the neighbor image via tf.abs function. In this case, EP(x) = abs(x),which is monotonic.• EP with approximate monotonicity: Consider another function in which asingle fault x propagates into two state spaces (x−1,x) and ReLu is the subse-quent function. The weights associated with the respective data are (100,−1),thus we can model the EP function simply as: EP(x) = 100 ∗max(x− 1,0)−max(x,0), which is either monotonically increasing or decreasing in different in-tervals, e.g., EP(x) = 99x−100, x> 1 (monotonically increasing) and EP(x) =22−x,x ∈ (0,1) (monotonically decreasing). The EP function is thus approxi-mately monotonic as its monotonicity during x > 1 becomes discontinued whenx ∈ (0,1), and we approximate that it is monotonic when x > 0.We leverage the (approximate) monotonicity of the EP function for injectingfaults. Our methodology applies to EP functions in all ML models, with eithermonotonicity or approximate monotonicity, though our approach might incur mi-nor inaccuracy in the latter since it is an approximation (see our evaluation in Sec-tion 4.6.2). In the following discussion, we use both terms (monotonicity andapproximate monotonicity) interchangeably and do not distinguish them if not ex-plicitly specified. EP functions with monotonicity satisfy the following property:|EP(x)| ≥ |EP(y)|,(x > y 0)∪ (x < y 0) (4.2)where x,y are two faults at the bits of both 0 or 1 in the same data, x occurs athigh-order bit and y at low-order bit. We consider the faults that deviate consid-erably from 0 since faults around 0 only yield small deviations, and would notlead to SDCs. The monotonic EP function can be monotonically increasing or de-creasing. When EP is monotonically increasing, EP(x) ≤ EP(y) ≤ 0,x < y 0,thus Eq. 4.2 is satisfied. Monotonically decreasing only occurs in multiply-relatedfunctions (e.g., Conv, linear transformation) where the weights are negative. Forthose functions also, | f (x)| > | f (y)|,(x > y 0)∪ (x < y 0), thus Eq. 4.2 issatisfied. Hence the EP functions in ML models satisfy Eq. 4.2. Note that forEP with approximate monotonicity, Eq. 4.2 might not always hold since it is anapproximation.4.5 Binary Fault Injection - BinFIIn this section, we discuss how we can leverage the (approximate) monotonicity ofthe EP functions in ML systems to efficiently pinpoint the critical bits that lead toSDCs. As mentioned earlier, our methodology applies to all ML programs whoseEP functions exhibit either monotonicity or approximate monotonicity.With (approximate) monotonicity of the EP function, the outcome by differ-ent faults are sorted based on the original deviation caused by the faults. Faults230 0 0 0 0 0 0SDC Non-SDCNot result in SDC,move to higher-orderResults in SDC,move to lower-order213+Fault occurs at tf.add Op in Fig.1.Figure 4.3: Illustration of binary fault injection. Critical bits are clusteredaround high-order bits.at higher-order bits (larger input) would have larger impact on the final outcome(larger output), and are thus more likely to result in SDCs. Therefore, we searchfor an SDC-boundary bit, where faults at higher-order bits would lead to SDCsand faults from lower-order bits would be masked. Fig. 4.3 illustrates this process.Finding such a boundary is similar to searching for a specific target within a sortedarray (bits), and thus we decide to use binary-search like algorithm as the FI strat-egy. We outline the procedure of binFI in Algorithm 1. getSDCBound function isthe core function to search for the SDC-boundary.BinFI is run separately for each operator in the model and it considers eachdata element in the output of the targeted operator. BinFI first converts the datainto a binary expression (line 2) and then obtains the index of the bits of 0 and1 respectively (line 3, 4), because faults causing positive and negative deviationswould have different impacts (this is why we separate the cases for x > y 0 andx < y 0). We then find the SDC-boundary bit in the bits of 0 and 1 (line 6, 7).We illustrate how binFI works on the tf.add operator in the example of Fig. 4.1(kNN model). Assuming the fault occurs at (i,y),y ∈ [1,784], which correspondsto the nearest neighbor, thus the 0 bits in Fig. 4.3 are those in the data at (i,y) of theoutput by the tf.add operator. In this example, whether an SDC will occur dependson whether disi is still the nearest neighbor after a fault (Section 4.1).BinFI first bisects the injection space and injects a fault in the middle bit(line 16, 17). The next injection is based on the result from the current injec-tion. For example, if the fault does not result in an SDC (i.e., disi is still the nearestneighbor), the next injection moves to higher-order bit (from step 1 to step 2 inFig. 4.3), because monotonicity indicates that no fault from lower-order bits would24lead to SDCs. More specifically, assume that the result from the fault at step 1 isdis′i = disi+abs(N), where abs(N) is the deviation caused by the simulated fault.We can express the results by faults in lower-order bits as dis′′i = disi + abs(M).According to Eq. 4.2, abs(N)> abs(M),N > M > 0; thus dis′i > dis′′i , where dis′i,dis′′i are the nearest neighbors of the node.Moving the next FI to a higher- or lower-order bit is done by adjusting the frontor rear index since we are doing binary-search like injection (e.g., line 19 moves thenext injection to lower-order bits by adjusting the front index). Step 2 in Fig. 4.3shows that the injection causes an SDC, and hence faults from higher-order bitswould also lead to SDCs. The next injection will move to lower-order bits (fromstep 2 to step 3). We also record the index of the latest bit where a fault could leadto an SDC, and it eventually becomes sdcBoundary (line 20). sdcBound in line 6,7 indicates the index of the SDC boundary, as well as how many critical bits (e.g.,sdcBound 1 = 5 means there are 5 critical bits in the bits of 1). Thus we can usethem to calculate the SDC rate by calculating the number of critical bits over thetotal number of bits (line 8).4.6 EvaluationWe evaluate BinFI by asking three research questions as follows:RQ1: Among all the critical bits in a program, how many of them can beidentified by BinFI, compared with random FI?RQ2: How close is the overall SDC rate measured by BinFI to the ground truthSDC, compared with that measured by random FI?RQ3: What is the overhead for BinFI, compared with exhaustive and randomFI approaches, and how does it vary by data type?4.6.1 Experimental SetupAs mentioned in Section 2.5, we use an open-source FI tool developed in our groupcalled TensorFI 3, for performing FI experiments on TensorFlow-supported MLprograms. We modify TensorFI to (1) provide support for DNNs, and (2) supportBinFI’s approach for injecting faults. The former is necessary as the current version3https://github.com/DependableSystemsLab/TensorFI25Algorithm 1: Binary-search fault injectionData: opOutput← output from targeted operatorResult: SDC boundary in the bits of 0 and 1, and SDC rate for eachelement in the targeted operator1: for each in opOutput do2: binVal = binary(each) // binary conversion3: 0 list = getIndexO f 0 bit(binVal) // build the bit index (bit of 0) from MSBto LSB4: 1 list = getIndexO f 1 bit(binVal)5: /* results for each element are stored in a list */6: sdcBound 0.append(bound0=getSDCBound(binVal, 0 list))7: sdcBound 1.append(bound1=getSDCBound(binVal, 1 list))8: sdcRate.append((bound0 + bound1)/length(binVal))9: end for10: return sdcBound 1, sdcBound 0, sdcRateFunction: getSDCBound(binVal, indexList)11: /* get the front and rear index for binary splitting */12: front = 0; // higher-order bit13: rear = getLength(indexList)-1; // lower-order bit14: sdcBoundary = 0; // by default there is no SDC15: while front ≤ rear and front 6= rear 6= lastInjectedBit do16: currInjectedBit = indexList[(front + rear)/2]; // binary split17: FIres = f ault in jection(binVal, currInjectedBit);18: if FIres results in SDC then19: front = currInjectedBit+1; //move next FI to low-order bit20: sdcBoundary = currInjectedBit; // index of critical bit21: else22: rear = currInjectedBit - 1; //move next FI to high-order bit23: end if24: lastInjectedBit = currInjectedBit25: end while26: return sdcBoundaryof TensorFI does not have support for complex operations such as convolutionsused in DNNs - we added this support and made it capable of injecting faults intoDNNs (these modifications have since been merged with the TensorFI mainlinetool). The latter is necessary so that we have a uniform baseline to compare against.26Table 4.2: ML models and datasets for evaluationDataset Dataset Description ML modelsMNIST [13] Hand-written digits 2-layer NNLeNet-4 [56]Survive [17] Prediction of patient kNNsurvivalCifar-10 [7] General images AlexNet [54]ImageNet [28] General images VGG16 [83]German traffic sign [47] Traffic sign images VGG11 [83]Driving [10] Driving video frames Nvidia Dave [21]Comma.ai [8]ML models and test datasetsWe consider 8 ML models in our evaluation, ranging from simple models, e.g.,neural network - NN, kNN, to DNNs that can be used in the self-driving car do-mains, e.g., Nvidia DAVE systems [21], Comma.ai steering model [8].We use 6 different datasets including general image classification datasets (Mnist,Cifar-10, ImageNet). These are used for the standard ML models. In addition, weuse two datasets to represent two different ML tasks in AVs: motion planningand object detection. The first dataset is a real-world driving dataset that containsimages captured by a camera mounted behind the windshield of a car [10]. Thedataset is recorded around Rancho Palos Verdes and San Pedro California, andlabeled with steering angles. The second one is the German traffic sign dataset,which contains real-world traffic sign images [47]. We use two different steeringmodels for AVs from: (1) Comma.ai, which is a company developing AV tech-nology and provides several open-source frameworks for AVs such as openpilot[8]; (2) Nvidia DAVE self-driving system [21], which has been implemented in areal car for road tests [9] and has been used as a benchmark in other studies ofself-driving cars [66, 89]. We build a VGG11 model [83] to run on the traffic signdataset. Table 5.1 summarizes the ML models and test datasets used in this study.27FI campaignsWe perform different FI campaigns on different operations in the network. Due tothe time-consuming nature of FI experiments (especially considering that we needto perform exhaustive FI to obtain the ground truth), we decide to evaluate 10 inputsfor each benchmark. We also made sure that the inputs were correctly classified bythe network in the absence of faults. In the case of the driving frame dataset, therewas no classification, so we checked if the steering angle was correctly predictedin the absence of faults.We evaluate all of the operations in the following networks: 2-layer NN, kNN,LeNet-4 and AlexNet (without LRN). However, FI experiments on all operatorsin the other networks are very time-consuming, and hence we choose for injectionone operator per type in these networks, e.g., if there are multiple convolutionoperations, we choose one of them. We perform injections on all the data withinthe chosen operator except VGG16, which consists of over 3 million data elementin one operation’s output. Therefore, it is impractical to run FI on all of those data(it will take more than 276525 hours to do exhaustive FI on one operator for oneinput). So we decide to evaluate our approach on the first 100 data items in theVGG16 model (this took around 13 hours for one input in one operator). We use32-bit fixed-point data type (1 sign bit, 21 integer bits and 10 mantissa bits), asthe fixed-point datatype is more energy efficient than the floating point datatype[26, 38].We organize our results by the research questions (RQs).4.6.2 RQ1: Identifying Critical Bits.To answer this RQ, we consider four safety-critical ML systems used in AVs. Thisis because SDCs in these systems would be potential safety violations, and hencefinding such critical bits that lead to unsafe scenarios is important. We use the steer-ing systems from Nvidia [21] and Comma.ai [8] (on a real-world driving dataset);VGG11 [83] (on a real-world traffic sign dataset) and VGG16 [83] (vehicle imagesin ImageNet). For the latter two datasets, we consider SDC as any misclassificationproduced by the system. However, for the two steering models, the output is thesteering angle, which is a continuous value, and there is no clear definition of what28Figure 4.4: Single-bit flip outcome on the self-controlled steering systems.Blue arrows point to the expected steering angles and red arrows arefaulty outputs by the systems.Figure 4.5: Single-bit flip outcome on the classifier models. Images at thefirst row are the input images, those at the second row are the faultyoutputs.constitutes an SDC (to our knowledge). Consequently, we came up with three dif-ferent values of the acceptable threshold for deviations of the steering angle fromthe correct angle to classify SDCs, namely 5, 30 and 60 degrees. Note that thesevalues include both positive and negative deviations. Fig. 4.4 shows the test im-ages in our evaluation and exemplify the effect of SDCs. There are two types ofdeviation since the threshold is for both positive and negative deviations.Apart from steering the vehicles, it is also crucial for the AVs to correctlyidentify traffic signs and surrounding vehicles, as incorrect classification in suchscenarios could lead to fatal consequences (e.g., misclassifying a “stop” sign as “goahead” sign). We show in Fig. 4.5 the potential effects of SDC in such scenarios.We first perform exhaustive FI and record the results from flipping every bit29100 100 100 100 100 100 100 10099.56 99.84 99.85 99.71 98.71 99.99 99.99 98.8663.4563.0063.4663.2163.1463.1663.1962.9439.5639.1839.4135.95 39.2939.3139.4038.8722.2522.0922.3020.20 22.0922.1322.0821.8718.8218.8318.9516.4617.7817.7715.61 17.009.859.8710.038.639.339.328.138.925.025.045.184.424.784.774.174.44020406080100Davethreshold = 5Davethreshold = 30Davethreshold = 60Comma.aithreshold = 5Comma.aithreshold = 30Comma.aithreshold = 60VGG11(Traffic Sign)VGG16(ImageNet)Recall of Critical Bits (%)all FI binFI ranFI - 1 ranFI - 0.5 ranFI - 0.25 ranFI ~ 0.2 ranFI ~ 0.1 ranFI ~ 0.05Figure 4.6: Recall of critical bits by different FI techniques on four safety-critical ML systems. ranFI − 1 is random FI whose FI trial is thesame as that by allFI; FI trial for ranFI−0.5 is half of that by allFI.ranFI ∼ 0.2 takes the same FI trials as BinFI.in the injection space. We also record all critical bits identified by BinFI, andrandom FI for different numbers of trials. We report the recall (i.e., how manycritical bits were found) by BinFI and random FI. Exhaustive FI is the baselineand it has a 100% recall as it covers the entire injection space. Fig. 4.6 presentsthe recall for different FI approaches on the four safety-critical ML systems. Inthe interest of space, we report only the recalls for the four simple models: LeNet- 99.84%, AlexNet - 99.61%, kNN and NN - both 100%. We also found thatthe same operation (e.g., Conv operation) in different layers of a DNN exhibiteddifferent resilience (which is in line with the finding in prior work [57]), BinFIis able to achieve high recall nonetheless. The reason why BinFI has a recall of100% in kNN and NN is that the EP functions in these models can be modeledas monotonic functions, unlike the other models where the EP function is onlyapproximately monotonic.Recall is computed as the number of critical bits (found by each FI approach)in the whole state space divided by the total number of critical bits found by ex-haustive FI (ground truth). We also provide the number of critical bits (obtainedfrom exhaustive FI) found in each benchmark and the total injection space in Ta-ble 4.3, across all 10 inputs. The critical bits found in the two steering modelsdecrease along with the larger SDC threshold, since larger threshold implies fewerSDCs are flagged.For the two steering models (the left six histograms in Fig. 4.6), BinFI is ableto find over 98.71% of critical bits (an average of 99.61%) that would lead to safetyviolations across all three thresholds. The consistently high recall of BinFI when30Table 4.3: Number of critical bits in each benchmark.Model & Dataset Critical Bits Total Bits Percentagein FI Space (%)Dave-5 - Driving 2,267,185 4,289,160 52.86Dave-30 - Driving 2,092,200 4,289,160 48.78Dave-60 - Driving 1,754,487 4,289,160 40.91Comma.ai-5 - Driving 3,352,848 8,144,320 41.71Comma.ai-30 - Driving 2,679,756 8,144,320 32.90Comma.ai-60 - Driving 2,217,353 8,144,320 27.23VGG11 - Traffic sign 808,999 4,483,840 18.04VGG16 - Vehicles 16,919 186,000 9.10using different SDC thresholds suggest that BinFI is agnostic to the specific thresh-old used for classifying SDCs in these models. Similarly, for the other two models,BinFI also achieves very high recalls. We thus observe a consistent trend acrossall the benchmarks, each of which exhibits different sensitivity to transient faultsas shown in Table 4.3.The coverage of random FI depends on the number of trials performed. Weconsider different numbers of trials for randomFI ranging from 5% to 100% of thetrials for the exhaustive FI case. These are labeled with the fraction of trials per-formed. For example, ranFI−0.5, means that the number of trials is 50% that ofexhaustive injection. BinFI performs about 20% of the trials as exhaustive injection(see Fig. 4.7 in RQ3), so the corresponding randomFI experiment (ranFI ∼ 0.2)with the same number of trials identifies about 19% of the critical bits. Even in thebest case where the number of trials of random FI is the same as that of exhaustiveFI, i.e., ranFI−1.0, the recall is less than 65%, which is much lower than BinFI’srecall of nearly 99%. This is because the bits chosen by random FI are not neces-sarily unique, especially as the number of trials increases, and hence not all bits arefound by random FI.In addition to recall, we also measure the precision (i.e., how many bits iden-tified as critical are indeed critical bits). This is because low precision might raisefalse alarms and waste unnecessary resources on over-protecting non-critical bits.We report the precision in the four safety-critical models, in Table 4.4 (precision31Table 4.4: Precision for BinFI on identifying critical bits.Model DAVE 1 Comma.ai 1 VGG11 VGG16(Driving) (Driving) (Traffic sign) (Vehicles)Precision (%) 99.60 99.70 99.99 99.141 Results are averaged from using three thresholds.values for the remaining four models range from 99.31% to 100%). We find thatBinFI has precision of over 99% across all of the benchmarks, which means that itfinds very few unnecessary critical bits.In summary, we find that BinFI can achieve an average recall of 99.56% with99.63% precision, thus demonstrating its efficacy at finding critical bits comparedto random FI, which only finds 19% of the critical bits with the same number oftrials.4.6.3 RQ2: Overall SDC EvaluationWe calculate the overall SDC probabilities measured by BinFI and random FI forthe ML models. Note that the overall SDC probability is a product of the number ofbits in the operator, as well as the SDC probability per bit. So we weight the per bitSDC with the number of bits in the operator to obtain the overall SDC probability.For example, the overall SDC probability for an operator with 100 bits (with SDCprobability 20%) and another operator with 10000 bits (with SDC probability 5%)will be (100∗0.20+10000∗0.05)/(100+10000) = 5.15%.To quantify the accuracy of BinFI in measuring the overall SDC, we measurethe deviation of the SDC probabilities from the ground truth obtained through ex-haustive FI in Table 4.5 (error bars at the 95% confidence intervals are shown belowthe SDC rates in the table). We limit the number of trials performed by random FIto those performed by BinFI to obtain a fair comparison. Overall, we find that theSDC probabilities measured by BinFI are very close to the ground truth obtainedthrough exhaustive FI. Table 4.5 also shows that BinFI achieves 0% deviation intwo ML models (NN and kNN), which is because the EP function in these twomodels are monotonic.While the above results demonstrate that BinFI can be used to obtain accurate32Table 4.5: Overall SDC deviation compared with ground truth. Deviationshown in percentage (%), and error bars shown at the 95% confidenceintervals for random FI.Model BinFI ranFI∼0.2 ranFI∼0.1 ranFI∼0.05Dave-5 0.070 0.100 0.101 0.119(Driving) (±0.01∼±0.31) (±0.02∼±0.44) (±0.02∼±0.62)Dave-30 0.036 0.096 0.134 0.172(Driving) (±0.04∼±0.30) (±0.06∼±0.42) (±0.09∼±0.59)Dave-60 0.125 0.116 0.118 0.234(Driving) (±0.05∼±0.29) (±0.08∼±0.42) (±0.11∼±0.59)Comma.ai-5 0.049 0.064 0.040 0.095(Driving) (±0.23∼±0.24) (±0.33∼±0.34) (±0.47∼±0.48)Comma.ai-30 0.212 0.092 0.160 0.206(Driving) (±0.22∼±0.24) (±0.31∼±0.34) (±0.45∼±0.48)Comma.ai-60 0.008 0.060 0.094 0.212(Driving) (±0.21∼±0.22) (±0.30∼±0.31) (±0.43∼±0.44)VGG11 0.002 0.101 0.156 0.199(Traffic sign) (±0.23∼±0.26) (±0.33∼±0.38) (±0.47∼±0.53)VGG16 0.042 1.039 0.778 0.800(ImageNet) (±0.62∼±1.07) (±0.86∼±1.58) (±1.28∼±2.14)AlexNet 0.068 0.319 0.420 0.585(Cifar-10) (±0.15∼±0.18) (±0.22∼±0.25) (±0.31∼±0.36)LeNet 0.030 0.228 0.366 3.38(Mnist) (±0.45∼±0.46) (±0.64∼±0.65) (±0.887∼±0.95)NN 0.000 0.849 0.930 0.989(Mnist) (±0.33∼±1.1) (±0.47∼±1.55) (±0.67∼±2.20)kNN 0.000 0.196 0.190 0.197(Survival) (±0.05∼±0.2) (±0.19∼±0.32) (±0.195∼±0.47)estimates of the overall resilience, random FI achieves nearly the same results witha much lower number of trials. Therefore, if the goal is to only obtain the overallSDC probabilities, then random FI is more efficient than BinFI.4.6.4 RQ3: Performance OverheadWe evaluate the performance overhead for each FI technique in terms of the numberof FI trials, as the absolute times are machine-dependent. We show the results forAlexNet and VGG11, in Fig. 4.7. The overall FI trials for VGG11 are lower than33those for AlexNet as we do not inject fault into all operators in VGG11. Fig. 4.7shows that the number of FI trials performed by BinFI is around 20% of that byexhaustive FI. This is expected as BinFI performs a binary-search and we use a32-bit datatype - log231 ≈ 5. Note that the value is not 5/31 ≈ 16% since weseparately do injection for 0-bits and 1-bits. Thus, it is closer to 6/31(≈ 20%). Weobserve a similar trend in all the benchmarks (not shown due to space constraints).111197044838421756287993111197044838455598522419227799311209621756287993108781439965439021998-10000010000030000050000070000090000011000001300000Alexnet VGG11FI trilasallFI bin FI ranFI - 1 ranFI - 0.5 ranFI - 0.25 ranFI ~ 0.2 ranFI ~ 0.1 ranFI ~ 0.05Figure 4.7: FI trials (overhead) for different FI techniques to identify criticalbits in AlexNet and VGG11.The performance overhead of BinFI also depends on the number of bits usedin the data representation. In general, the overhead gains increase as the numberof bits increases (as BinFI’s time grows logarithmically with the number of bits,while exhaustive FI’s time grows linearly with the number of bits). To validatethis intuition, we evaluate BinFI on VGG11 using datatypes with different width(16-bit, 32-bit and 64-bit) and compare its overhead with that of exhaustive FIin Fig. 4.8. We find that the growth rate of BinFI is indeed logarithmic with thenumber of bits. We also measure the recall and precision of BinFI for the threedata-types. The recall values are 100%, 99.97%, 99.99% for 16-bit, 32-bit and 64bits respectively, while the respective precision values are 100%, 99.96%, 99.94%.This shows that the precision and recall values of BinFI are independent of thedatatype.4.7 DiscussionIn this section, we discuss the inaccuracy of BinFI, followed by the effects of non-monotonicity on its efficacy.344296 5531 666015360317446451202000040000600008000016-bit 32-bit 64-bitFI trialbinFIexhaustive FIFigure 4.8: Numbers of FI trials by BinFI and exhaustive FI to identify criti-cal bits in VGG11 with different datatypes.4.7.1 Inaccuracy of BinFIAs mentioned in Section 4.4, the EP function is often only approximately mono-tonic - this is the main source of inaccuracy for BinFI, as it may overlook thecritical bits in the non-monotonic portions of the function. Assuming EP(x) = 2∗max(x−1,0)−max(x,0) and a fault raises a deviation of x = 2,EP(2) = 0, whichdoes not result in an SDC. According to Eq. 4.2, BinFI regards that all the faultsfrom 0 < y < 2 will not cause SDCs as |Eq(y)| ≤ 0. However, |EP(0.5)| = 0.5,which violates Eq. 4.2. A fault incurring a deviation of 0.5 could be a critical bitunidentified by BinFI (thus resulting in inaccuracy). This is the reason why BinFIincurs minor inaccuracy except for two models in Table 4.5, in which the EP func-tion is monotonic (not just approximately so). Nevertheless, our evaluation showsthat BinFI incurs only minor inaccuracy as it has an average recall of 99.56% with99.63% precision, and the overall SDC rate is very close to ground truth as well.This is because the non-monotonicity occurs in most cases when the fault is smallin magnitude, and is hence unlikely to lead to an SDC.4.7.2 Effect of Non-monotonicityWhile our discussion in Section 4.2 shows that many computations within thestate-of-art ML models are monotonic, there are also some models that use non-monotonic functions. Though BinFI requires the functions to be monotonic so thatthe EP function is (approximately) monotonic, we also want to measure the effectswhen BinFI is run on those models using functions that are not monotonic. There-fore, we conduct an experiment to evaluate BinFI on two networks using different35non-monotonic functions. Specifically, we use a Neural Network using a Swishactivation function [69], and AlexNet with LRN [54]. As before, we measure therecall and precision of BinFI on these models.For the NN model, Fig. 4.9 shows that BinFI has a recall of 98.9% and aprecision of 97.3%, which is quite high. The main reason is that Swish function ismonotonic across a non-trivial interval, thus exhibiting approximate monotonicity,so BinFI works well on it. On the other hand, when it comes to AlexNet with LRN,BinFI has high recall but low precision, which means BinFI is not suitable for thismodel as LRN is non-monotonic in nature (Section 4.3).0.989 0.9880.9730.7640.0000.5001.000NN with Switsh AlexNet with LRNRecallPrecisionFigure 4.9: Recall and precision for BinFI on ML models with non-monotonic functions.4.7.3 Discussion for Multiple Bit-flips ScenarioIn this section, we discuss the potential application of BinFI in the presence ofmultiple bit-flips. The main idea is that we can consider multiple bit-flips as thecomposition of several single bit-flips. Assume there are two bit flips occurring atthe same time. In order to identify the bit flips that result in SDCs (i.e., criticalbits), we can fix one bit at one location, and then perform binary fault injection onthe other bit using BinFI and vice versa. We leave the detailed exploration of BinFIon multiple bit-flip faults to future work.4.8 SummaryIn this chapter, we describe how transient faults could result in SDC in commonML programs. We then provide an analysis of the monotone property found incommon ML functions, based on which we approximate the error propagation be-havior in the model as a monotone function. This thus leads to the observation36that critical faults tend to cluster around high-order bits and the effects of differentfaults can be viewed as a sorted array. Based on this observation, we present BinFI,a binary-search like fault injection technique to efficiently identify the critical bits.We evaluate BinFI on 8 ML benchmarks including ML systems that can bedeployed in AVs. Our evaluation demonstrates that BinFI can correctly identify99.56% of the critical bits with 99.63% precision, which significant outperformsconventional random FI- based approaches. It also incurs significantly lower over-head than exhaustive FI techniques (by 5X). BinFI can also accurately measure theoverall resilience of the application.37Chapter 5Ranger: Boosting ErrorResilience of Deep NeuralNetworks through RangeRestrictionIn this chapter, we describe Ranger, an automated technique to prevent transientfaults from corrupting the output of ML applications.To protect the DNNs from transient faults (particularly critical faults), we se-lectively restrict the ranges of values in different DNNs layers. This technique thatwe call Ranger, is based on 2 properties in DNNs: 1) the monotonicity of operatorsin DNNs applications (described in Chapter 4); 2) the inherent resilience of DNNsto insignificant errors [90].In this section, we start by discussing how the characteristics of critical faults inDNNs, and the inherent resilience of DNNs can be leveraged to improve the errorresilience of DNNs. We then elaborate the design of selective range restriction inDNNs to enable effective error correction without degrading the accuracy of themodel.3812345678ABCorrect classification(fault-free)Mis-classification(fault occurs)L1 L2 Output12345678ABL1 L2 OutputCorrect classification(fault occurs)12345678ABL1 L2 OutputRangerSignficant error (result in SDC)Insignificant error (tolerated by DNN and not result in SDC)Figure 5.1: Example of a fault resulting in misclassification and how Rangerenables fault correction by dampening the error to be tolerated by theDNN. Darker colors represent larger activation values. Assume label Ais the correct label and B is the incorrect label.5.1 Intuition behind Range RestrictionMonotonicity. Chapter 4 discuss the observation of the monotone property foundin the common functions used in DNNs (e.g., ReLu, SoftMax). This monotoneproperty implies that for faults to cause large deviation at the output layer, theyshould also cause large deviation at the place where they occur.This leads to the observation that critical faults tend to cluster around high-order bits (i.e., causing large value deviation), while faults at low-order bits tendto be masked and not affect the output. Thus range restriction is analogous to“transferring” the faults from high-order bits to low-order bits (since the effectcaused by fault is reduced), which can be tolerated by the model itself.Mitigating critical faults. Typically critical faults in DNNs corrupt the pro-gram’s outputs by fault amplification, which would propagate a single fault intomultiple values causing large value deviations. Restricting the values in the DNNsreduces the deviations caused by the faults, which in turn lowers the chance of thefault to result in an SDC. This is because DNNs inherently can tolerate small valuedeviations and generate correct outputs regardless. Ranger leverages this propertyto prevent critical faults from corrupting the outputs, while tolerating other faults.39Derive restriction bounds Insert RangerUnprotected DNNProtected DNN by RangerFigure 5.2: Work flow of Ranger.We provide an example in Fig. 5.1, where the goal of Ranger is to reduce the devia-tion by the critical faults (i.e., significant error) into smaller ones (i.e., insignificanterror). Despite the insignificant errors, the system is able to generate the correctoutput due to the inherent resilience of DNNs.Maintaining the accuracy of original models. While Ranger can be used forfault mitigation, it is also important to ensure that it does not affect the accuracyof the original models. In our evaluation, we derive the restriction bound basedon the training data, and evaluate the model on a separate validation set, to repre-sent an unbiased approximation for test data, i.e., unseen data. While it is possiblethat Ranger might truncate the values during the fault-free execution, this rarelyhappens as the restriction bound collected from the training data is usually repre-sentative of the value ranges (e.g., in the VGG16 model, only in 5 out of 50,000cases, do the values exceed the restriction bound in the fault-free execution). More-over, even when the neuronal values in the fault-free execution are restricted (e.g.,the 5 cases mentioned above), Ranger only truncates the values to the restrictionbound. Such value reduction can often be tolerated inherently by the DNNs, anddoes not lower its accuracy (Section 5.3.2). Note that we do not require the trainingdata to represent the faulty outcomes (unlike in prior work [79]), and hence we donot need to perform expensive FI experiments to use Ranger (though we use FI toevaluate its coverage).5.2 Selective Range RestrictionIn this section, we explain how to implement Ranger in DNNs. The work flowof Ranger is shown in Fig. 5.2. The first step is to derive the restriction boundsfrom the network through profiling, after which we transform the network to In-sert Ranger on selective layers. Finally, the protected DNN with Ranger can bereleased for deployment.40Step 1: Deriving restriction bounds. The restriction bound can be derivedfrom: (a) the function itself, e.g., Tanh function has a bound of (−1,1): For thesefunctions, we do not have to learn the value ACT distribution; (b) Statistical sam-pling: For functions that are unbounded (such as the ReLu function that does nothave an upper bound), the restriction bound can be derived from sampling the dis-tribution of the values in the function, from which we can choose an appropriatebound. Selection of the bound can be adjusted based on whether we are willing toaccept accuracy loss for resilience boosting. A conservative approach is to set thebound to the maximal value such that it is less likely to affect the accuracy of themodel. Alternatively, we can also choose a smaller bound to gain higher resilienceboosting at the cost of accuracy. We choose the conservative approach, but alsostudy the latter approach in our evaluation.Step 2: Inserting Ranger into selected DNN layers. After deriving the re-striction bound, the next step is to apply Ranger into the selective layers in DNNs.As mentioned, a DNN typically consists of different layers (e.g., Conv, ACT, Pool-ing, Normalization, FC, etc). While these different layers can be considered forrange restriction, we find that range restriction on the ACT layer is particularlydesirable for two reasons: 1) ACT function is used to determine the response fromthe neuron’s output (e.g., filtering out the negative output in ReLu ACT function),this particular feature of ACT also makes it ideal to be used for “filtering” out thepotential critical faults. For example, an ACT function can restrict a large outputfrom the previous layer (where a transient fault occurs). This restriction thus re-duces the deviation caused by faults, lowering the probability of the fault leadingto SDCs. 2) ACT function is frequently used in different layers in DNNs (e.g.,VGGNet, ResNet, etc), and thus applying range restriction on ACT function effec-tively dampens fault amplification in between layers (as DNNs usually have manylayers).While range restriction in ACT function would limit the fault amplificationeffect, it is not sufficient as there are still computations that occur between ACTfunctions. A single fault in these can quickly propagate and be amplified. Weprovide an example below to illustrate this problem.41y = Relu2(Conv(MaxPool(ReLu1(x)), I)), (5.1)where ReLu1 and ReLu2 are guarded by Ranger. Let their bounds be bound(ReLu1)=10,bound(ReLu2)= 1000. For simplicity, assume x= [1,2], and MaxPool(ReLu1(x))=2. Let the Conv layer have n kernels (we assume that each kernel is a simple 1x1identity kernel I), each of which performs a dot-product computation. Thus thedimension of y is (n,1).Assume that a fault occurs at MaxPool function and deviates the MaxPool’soutput from 2 to 1024+ 2. The faulty value of 1026 subsequently propagatesthrough the Conv layer, and thus all the elements from y would be affected. Inthis case, the fault-free output of y is a n-element vector of 2. Ranger can restrictthe values from 1026 to 1000, thus y becomes a vector of 1000. Such a large valuedeviation indicates higher probability of resulting in SDCs. And thus applyingRanger on the ACT layer alone is not enough to mitigate critical faults.By analyzing the value dependency between layers in the model, we find therestriction bound applied to the ACT function can also be extended to other func-tions. Using the same example above, the bound of ReLu1 is also applicable to theMaxPool function, i.e., bound(MaxPool(ReLu1(x))) = bound(Relu1(x)). There-fore, the values from MaxPool should not be greater than bound(ReLu1) = 10, andthus y will be a vector of 10 even under the the fault in the MaxPool function, whichhas a significantly lower deviation (compared with a faulty vector with values of1000). Thus the fault is less likely to cause an SDC as the limited deviation can betolerated by the inherent resilience of DNNs. Therefore, we need to apply rangerestriction to selected layers beyond just the ACT function to effectively mitigateSDCs.We describe the procedure of applying Ranger to an unprotected DNN in Al-gorithm 2. The input to the algorithm is the restriction bounds collected from theprofiling process ( j pair of upper and lower bounds in total for j ACT layers). Theresulting output is the DNN protected with Ranger. Line 2 traverses each operationin the network. For each of the ACT operation, the output of it will be bounded(Line 3-4). For all the operations that follow (connect to) the ACT operation and42Algorithm 2: Ranger restriction on an unprotected DNNInput: i← number of operation in the modelj← number of ACT operation(low j,up j)← bounds for the jth ACT opOutput: Protected DNN with Ranger1: /* Traverse each operation in the network from the first layer to the last one */2: for opi in operations in the network do3: if opi is the jth ACT operation then4: Bound opi with (low j,up j)5: if opi+1 in {Max-Pool, Avg-Pool, Reshape} then6: Bound opi+1 with (low j,up j)7: else if opi+1 in {Concatenate} then8: Bound opi+1 with (min(low j−1, low j),max(up j−1,up j))9: end if10: end if11: end for12: return Protected DNNbelong to {Max-Pool, Avg-Pool, Reshape, Concatenate}, the output of them willalso be applied with the same restriction bounds (Line 5-8). These are the opera-tions where Ranger can be deployed beyond the ACT operation (i.e., the operatorsprotected by Ranger remain the same in the same model). Note that for the Con-catenate operation that concatenates the output from the previous 2 ACT operation(this is used in the SqueezeNet model), the restriction bound is derived from thebounds in the preceding 2 ACT operation: lower bound = min(low j−1, low j), andupper bound = max(up j−1,up j). The time complexity of the algorithm is O(n),where n is the network size.5.3 Evaluation for RangerImplementation of Ranger: We have implemented Ranger using the TensorFlowframework [19]. This allows Ranger to be applied to a diverse set of ML programs.Ranger directly modifies the TensorFlow graph by adding the extra operators forrange restriction (tf.math.minimize and tf.math.maximize), as per Algorithm 2 af-ter learning the restriction bounds.43Note that Ranger can also be applied to DNNs written in other ML frameworkssuch as PyTorch. This is because Ranger leverages the inherent properties of theDNN model (e.g., the monotone property of the DNN components, the inherentresilience of DNN), which are platform-independent.DNN benchmarks and datasets. In our evaluation, we evaluate DNNs ap-plications in both classification and regression tasks. In particular, we evaluate 8different DNNs comprising of common DNN benchmarks in ML studies such asVGGNet, SqueezeNet, ResNet. SqueezeNet is a compressed DNN model, whichis more energy-efficient than traditional DNNs, as many non-critical connectionsin the models are pruned (e.g., the model has a size of only less than 5MB in ourexperiment) [48]. For the DNNs in regression tasks (where the output is a variablevalue instead of a class label), we choose two DNNs applications that can be usedin the AV domain - these are the applications that can predict the steering anglesof the AV. We choose the Nvidia Dave driving model [21] and the steering modelfrom Comma.ai [8]. These DNNs applications have been adopted in real-worldvehicles [9] and are thus used as benchmarks for AV studies [25, 66].We use standard ML datasets (such as MNist, Cifar-10, ImageNet) as well as areal-world driving dataset collected from the images captured by a real vehicle [10].MNist is a popular ML dataset consisting of 60000 images of digits with 10 classes.Cifar-10 dataset contains 60000 32*32 color images in 10 different classes. GTSRBis a dataset collected from the real-world traffic sign for classification and it has 43different classes of traffic signs. ImageNet is a large database with more than 14million images categorized into 1000 different classes.Table 5.1 summarizes the models and datasets in our evaluation. For modelsusing the ImageNet dataset, we report the top-1 accuracy (i.e., the target label isthe predicted class that has the highest probability) and the top-5 accuracy (i.e., thetarget label is one of the top 5 predictions) [57]. For the two steering models, weuse RMSE (root mean square error) and the average deviation per frame to evaluatethe model’s accuracy - these are used in AV DNN studies [30].Note that FI experiments on DNNs are highly time-consuming - for each input,we need to perform thousands of FI experiments to obtain a statistically significantestimate of the SDC rate. Therefore, we choose 10 inputs per model and ensure theDNNs are able to generate correct prediction on these inputs (i.e., correct classifica-44Table 5.1: DNN models and datasets used for the evaluation of RangerDNN model Dataset Dataset DescriptionLeNet MNist Hand-written digitsAlexNet Cifar-10 General imagesVGG11 GTSRB Real-world traffic signVGG16 ImageNet General imagesResNet-18 ImageNet General imagesSqueezeNet ImageNet General imagesNvidia Dave [3] Driving Real-world driving framesComma.ai [8] Driving Real-world driving framestion; for the steering models, we check the steering angle is correct). For the DNNsusing ImageNet, we perform 3000 FI trials as they are more time-consuming, and5000 FI trials for the others. This is able to provide statistically significant resultand we also calculate error bars at the 95% confidence intervals. In our evaluations,we use 32-bit fixed-point data type for the first 3 RQs (this is more energy-efficientthan floating-point data type [51]). For RQ4, we evaluate Ranger on DNNs usinga 16-bit fixed-point data type.Learning Restriction Bounds In our work, we learn the restriction boundsfrom a randomly-sampled subset of the training set. We chose 20% of the trainingset from each dataset. We find that this is sufficient to learn the ranges for all theDNNs used in our study. For example, Fig. 5.3 shows the ranges of activationvalues obtained when sampling different amounts of training data on the VGG16model. The values are normalized with respect to the global maximal values (i.e.,maximal values on all the sampling data). As shown, the value range quicklyconverges to the global maximal values for all layers. We observe similar trendsin the other networks. This suggests that the range of values learned from thesampling is sufficient to characterize the range of values in the model.Note that learning restriction bounds is a one-time cost, and is incurred beforethe deployment of the DNN. In our experiments, it took less than 3 hours to com-plete this process on the largest network (VGG16) in our evaluation with around20% of training data in the ImageNet dataset. Nevertheless, feeding more data toget a larger restriction bound (e.g., using 100% of the training data) is also a viable45Figure 5.3: Range of value observed in each ACT layer using differentamount of data on the VGG16 network (13 ACT layers in total). Atotal of 186056 images (around 20% of the whole training set) wereused.option and can be completed in reasonable amount of time (e.g., it would take lessthan 16 hours on VGG16).To reduce Ranger’s effect on the model’s accuracy, we conservatively choosethe maximal value observed during the sampling process as the restriction bound(we study the effect of choosing other restriction bounds in Section 5.4.1).5.3.1 RQ1: Effectiveness of Range Restriction.We measure the SDC rate, which is the percentage of the transient faults that causeSDCs, with and without Ranger. For classifier models, an SDC can manifest asan image misclassification. For the two steering models that produce continuousvalues as outputs, we use different threshold values for the deviations of steeringangles to identify SDCs, namely 15, 30, 60 and 120 degrees.Note that we only consider those faults that would not lead to obvious systemfailures such as a crash (e.g., modifying the dimension of a tensor might cause anerror, and terminate the program). We also exclude the last FC layer because valuesin the last FC layer are directly associated with the final outputs. Thus, restrictingthe values in the last FC layer is not effective in mitigating SDCs (we validatedthis in our experiments). However, the state space of the last FC layer constitutesa very small fraction of the state space (e.g., in VGG16 model, the last FC layer4619.65 19.82 19.896.87 6.1319.67 18.4012.8311.020.00 0.20 1.46 0.00 0.00 1.04 0.98 0.27 0.010.005.0010.0015.0020.0025.00LeNet AlexNet VGG11 VGG16(top-1)VGG16(top-5)ResNet-18(top-1)ResNet-18(top-5)SqueezeNet(top-1)SqueezeNet(top-5)SDC rate (%) Original RangerFigure 5.4: SDC rates of the original classifier models and enhanced modelswith Ranger. For the models using ImageNet dataset, we provide theresults from top-1 and top-5 accuracy. Error bars range from ±0.04%to ±1.46% at the 95% confidence interval. Lower is better.23.68 21.9320.0716.0227.70 25.88 24.1322.209.78 8.557.074.011.68 0.26 0.01 0.000.005.0010.0015.0020.0025.0030.00Dave-15 Dave-30 Dave-60 Dave-120 Comma-15 Comma-30 Comma-60 Comma-120SDC rate (%)Original RangerFigure 5.5: SDC rates of the original steering models and enhanced modelswith Ranger. SDC is defined by thresholding different degrees of devi-ation to the correct steering angles (i.e., 15, 30, 60, and 120 degrees).Error bars range from ±0.03% to ±1.24% at the 95% confidence inter-val. Lower is better.only accounts for 0.0047% of the state space), and techniques such as duplicationcan be used to protect this particular layer with minimal overheads.Fig. 5.4 illustrates the SDC rates in 6 of the classifier models with and with-out Ranger. While different DNNs exhibit different SDC rates, Fig. 5.4 showsthat Ranger achieves significant SDC reduction across all the models. For exam-ple, in the LeNet model, the SDC rate decreases from around 20% to 0% (in ourexperiments). On average, the SDC rates reduce from 14.92% to 0.44%.The results from the two steering models are shown in Fig. 5.5. In the Commasteering model, Ranger can largely reduce the deviation of steering angles due to47transient faults and eliminate large deviations (e.g., 0% of SDCs in the categoryof threshold=120). However, Ranger achieves less pronounced SDC reduction inthe Dave model. This is because the Dave model outputs the steering angle inradians, while the Comma model outputs the steering angle in degrees. The con-version from degrees to radians is more sensitive to deviations. This is becausethe conversion function (Atan function in TensorFlow) is horizontal asymptote(y∈(−pi/2,pi/2)) and thus even a small deviation at the input of Atan functionwould cause a large output deviation, i.e., higher SDC probability. Based on thisobservation, we train a new model which outputs the steering angle in degrees in-stead of the radian value, which achieves both better accuracy and resilience withRanger (Section 5.4).Quantitative Comparison with related work. Hong et al. [46] suggest a de-fense mechanism against memory bit-flips by modifying the ACT functions of themodels such as changing ReLu into Tanh (a similar approach is also proposed in[? ]). We compare Ranger with the method in Hong et al. for 5 of the 8 DNNs.The models we consider are: LeNet, AlexNet, VGG11, Dave and Comma steeringmodel. We only consider 5 models as we need to train a new model for each DNN,and it is time consuming to do so for the other 3 DNNs.For both approaches, we report the SDC rate reduction relative to the originalSDC rates in Fig. 5.6 (for brevity, we report the average results for the steeringmodels across all the thresholds). Fig. 5.6 shows that the approach from Hong etal. achieves 0% relative SDC reduction in models using Tanh function. This isbecause transient faults could occur after the Tanh function in the network, andare not affected by the replacement. In contrast, Ranger achieves significant SDCreduction because it performs selective range restriction in the entire DNN. Formodels using the ReLu function, while both approaches can reduce SDC rates,Ranger enables significantly higher resilience boosting. For example, for the cat-egory of threshold=120 in the two steering models, the SDC rates in Hong et al.vary from 4.76% to 9.48%, while with Ranger the SDC rates vary from 0% to0.27%, which is an order of magnitude lower. On average, Ranger achieves morethan 90% SDC rate reduction than Hong et al. across all the models.480.00 0.00 0.00 0.00 0.00 0.00100.00 96.81 100.0087.2097.61 94.1918.2034.99 38.1657.0650.24 47.32100.00 98.97 92.64 86.9694.64 93.850.0010.0020.0030.0040.0050.0060.0070.0080.0090.00100.00LeNet AlexNet VGG11 Dave(Average)Comma(Average)AverageRelative SDC Reduction (%)Tanh - Hong et.al Tanh - Ranger Relu - Hong et.al Relu - RangerFigure 5.6: Relative SDC rate reduction in DNNs following the approach inHong et al. [46] and Ranger. Error bars range from±0.12% to±1.38%at the 95% confidence interval. Higher is better.Ranger reduces the SDC rate from 14.92% to 0.44% (34X reduction) forthe 6 classifier DNNs. For the steering models, the SDC rate is decreased from24.98% to 0.49% on the Comma.ai model (50X reduction), and 20.42% to7.35% (2.77X reduction) on the Nvidia Dave model.5.3.2 RQ2: Accuracy.While Fig. 5.4 and Fig. 5.5 demonstrate the effectiveness of Ranger in reducingthe SDC rates in DNNs, it is also important to understand whether it affects themodel’s accuracy. Thus, we compare the accuracy of the original models and themodels with Ranger. As mentioned, the restriction bound is learned from samplingthe ACT values from a subset of the training set (from 10% to 50% of the trainingdata in our experiments). The models are then evaluated on a separate validationset (e.g., ImageNet validation set that has 50000 images), which is used as anunbiased approximation of the test data - this is a common practice in ML studies.We report the accuracy of all the DNNs in Table 5.2.Our results show that applying Ranger does not degrade the accuracy of thebaseline models in any of the 8 DNNs. This is because the restriction boundslearned from existing data are sufficient to characterize the ranges of values inthe DNN. For the cases where the normal values exceed the restriction bounds,the value reduction due to Ranger does not affect the accuracy. In fact, in the49Table 5.2: Accuracy of the original DNN and the DNN protected withRanger. + indicates accuracy improvement. Higher is better.DNN model Org. Model w/ Ranger Diff.LeNet 99.20% 99.20% 0.00%AlexNet 82.14% 82.14% 0.00%VGG11 99.74% 99.74% 0.00%VGG16 (top-1) 64.72% 64.72% 0.00%VGG16 (top-5) 85.736% 85.736% 0.00%ResNet-18 (top-1) 62.66% 62.66% 0.00%ResNet-18 (top-5) 84.61% 84.61% 0.00%SqueezeNet (top-1) 52.936% 52.940% +0.004%SqueezeNet (top-5) 74.150% 74.154% +0.004%Dave (RMSE) 9.808 9.808 0.000Dave (Avg. Dev.) 3.153 3.153 0.000Comma (RMSE) 24.122 24.122 0.000Comma (Avg. Dev.) 12.640 12.640 0.000SqueezeNet model, applying range restriction marginally increases the accuracyof the model. This is because the large value corresponding to the incorrect la-bel is reduced by Ranger, and thus the classification probability of the incorrectlabel is also decreased. We discuss the limitations of our empirical evaluation inSection 5.4.Our empirical evaluation demonstrates that Ranger does not degrade theaccuracy of any of the evaluated DNN models.5.3.3 RQ3: Overhead.We evaluate the runtime overhead of Ranger in terms of its memory and perfor-mance overhead during the inference phase. The memory overhead comes fromthe storage for the restriction bounds, which is proportional to the number of ACTfunctions in the models (the restriction bounds are applicable for not just ACTfunctions, but also others such as MaxPool function). Given the typical size of the50Table 5.3: Computation overhead (FLOPs) of Ranger. M stands for Million;B stands for Billion.DNN model w/o Ranger w/ Ranger OverheadLeNet 24.622M 24.722M 0.408%AlexNet 11.361M 11.468M 0.937%VGG11 87.057M 87.326M 0.309%VGG16 309.604B 309.905B 0.097%ResNet-18 36.354B 36.404B 0.138%SqueezeNet 530.813M 539.215M 1.583%Dave 56.545M 56.764M 0.387%Comma 17.673M 17.723M 0.282%DNNs, this memory overhead is negligible (e.g., VGG16 model has a size of over500MB).We measure the performance overhead of Ranger in terms of the floating-pointoperations (FLOPs) incurred by it. FLOPs are a measure of the latency and energyconsumption of ML models [39, 81, 88], and are independent of the hardwareplatform. We use the TensorFlow built-in profiler to measure the FLOPs of eachmodel.The results are reported in Table 5.3. We did not observe variation across dif-ferent inputs as they all have the same dimension. We find that the overhead ofRanger to be very low (0.518% on average). This is because Ranger only involvesrange checking and truncation, whereas DNNs usually have more time-consumingcomputations. Note that we have only evaluated the overhead of Ranger on pro-grams written using TensorFlow. However, the overhead should be small on otherframeworks as well given the simplicity of the range restriction.5.3.4 RQ4: Effectiveness of Ranger under Reduced Precision DataType.In previous evaluations, we use a 32-bit fixed-point data type, which is moreenergy-efficient than a floating-point data type. The data type precision can bereduced to gain further energy efficiency [51]. Therefore, we study the effective-5123.224 21.97616.9020.3714.571.7510.1830.1315.1100.272 1.226 0.01 1.2302.560.63 0.93-4161116212631LeNet AlexNet VGG11 SueezeNet(Avg.)ResNet-18(Avg.)VGG16(Avg.)Dave(Avg.)Comma(Avg.)AverageSDC rate (%)Original RangerFigure 5.7: SDC rate of DNNs using 16-bit fixed-point data type. Error barsrange from ±0.04% to ±1.33% at the 95% confidence interval. Loweris better.ness of Ranger in DNNs using 16-bit fixed-point data type, which can be used forinference in large DNNs (using smaller data types such as 8-bit datatype cannotprovide enough dynamic value range for large DNNs [51]). While it is possibleto use quantization such that the same 8-bit data width can represent larger valuerange, this is essentially similar to the vanilla fixed-point datatypes without quan-tization (e.g., a quantized 8-bit datatype can represent comparable range as theunquantized 16-bit datatype). We use 14 bits for integer and 2 for fractional part,which is sufficient to retain the accuracy of common DNNs according to [51].Similar to our previous evaluation, we report the SDC rates of the originalmodels and the models protected with Ranger. The results are presented in Fig. 5.7.As shown, Ranger is still effective in reducing the SDC rate of all 8 DNNs, from15.11% to 0.93% (a 16X reduction).5.4 DiscussionIn this section, we first perform a case study on the trade off between accuracyand resilience for Ranger. We then discuss the design alternatives for Ranger andfinally the limitation of Ranger.5.4.1 Trade-off between Accuracy and Resilience.We study how to adjust the restriction bound to gain additional resilience at the costof accuracy (e.g., in systems that are more prone to transient faults). We choose52the Dave steering model as an example, because as shown in Fig. 5.5, the averageSDC rate is still around 7% even with Ranger.As mentioned, Ranger does not yield significant SDC reduction on the originalDave model that outputs the radian value. Therefore, we train a new Dave modelwhose output is the steering angle in degrees. We then evaluate both the accuracyand SDC rate of the model using different restriction bounds. Recall that we learnthe distribution of ACT values from statistical sampling, and we can choose therestriction bound accordingly. For example, setting the restriction bound to the100 percentile means we use the value that covers all of the sampled values (i.e.,the maximum value). Similarly, setting the bound to 99 percentile chooses a valuethat covers 99% of the sampled values.Fig. 5.8 shows the SDC rates of the model with different restriction bounds, andTable 5.4 shows the corresponding accuracy values. While the new Dave modelexhibits higher SDC, the SDC rate reduction due to Ranger is higher than that inFig. 5.5. For example, when applying Ranger in both models, the SDC rate for thecategory of threshold=120 is 4.01% in the original Dave model, and 2.23% in thenew one. Ranger does not degrade the accuracy of either model. In fact, the accu-racy of the new Dave model is higher than that in Fig. 5.5. This is likely because thenew model outputs the steering angle in degrees, which has a larger dynamic rangethan radian values, thereby allowing the model to make more accurate predictions.As expected, setting a restriction bound to a lower-percentile value boosts theresilience but also degrades the accuracy of the models. When comparing the mod-els using 100% bound and 99.9% bound, the average deviation per frame increasesfrom 2.651 to 2.883 However, this provides a higher resilience boost, e.g., the SDCrate for threshold=120 reduces from 2.23% to 0.27%. On average, choosing the99.9% percentile restriction bound reduces the SDC rate from 49.49% to 2.9%, forthe Dave model, without much accuracy loss.5.4.2 Design Alternative for RangerWhile Ranger restore the out-of-bound value to the restriction bound, it is alsopossible to explore other design alternatives for Ranger. Reagen et al. proposeto reset the faulty value to 0 upon the detection of a fault [71]. Similarly, we can5354.37 51.07 47.83 44.706.80 5.263.67 2.235.65 4.04 1.65 0.2714.21 11.480.570.0014.08 11.480.48 0.000.0010.0020.0030.0040.0050.0060.00threshold =15 threshold =30 threshold =60 threshold =120SDC rate (%)Original Bound-100% Bound-99.9% Bound-99% Bound-98%Figure 5.8: SDC rates for Dave model with different restriction bounds. Errorbar range from ±0.14% to ±1.39% at 95% confidence interval. Loweris better.Table 5.4: Accuracy difference of the Dave model with different restrictionbounds. Lower is better.Accuracy Original 100% 99.9% 99% 98%Bound Bound Bound BoundRMSE 6.069 6.069 8.5719 12.370 13.940Avg. Deviation 2.651 2.651 2.883 4.077 4.884restrict all the out-of-bound value to 0, instead of the restriction bound. We conducta targeted experiment on the VGG16 model using this strategy. We observe thatreseting all values beyond the bound to 0 significantly degrades the accuracy ofthe original model (e.g., 3/5=60% of the inputs are inaccurate due to the reset of0). This is because the value reduction is so drastic that the model is not ableto generate correct output. Further, resetting the values to 0 is likely to cause 0values in subsequent operations, such as multiplication operation (which can leadto incorrect results).Another possible strategy is to use a random value between 0 and the restrictionbound. We perform another targeted experiment where we randomly reset valuesoutside the restriction range to random values (1000 times). Even though someof the classification results change (where the incorrect predicted labels change),the top-1 and top-5 accuracy remains the same. Therefore, the random replacementstrategy is also a viable one, though Ranger is much more deterministic and may be54preferred for safety-critical systems. We plan to explore the random replacementstrategy in future work.5.4.3 Limitations of RangerDespite our results to the contrary, Ranger might decrease the model’s accuracyin some cases, because we are not able to test all unseen data in practice. How-ever, there is no easy remedy for this problem and using a validation set (as perour evaluation) is a common practice to evaluate the generalizability of the trainedmodels (in our case, the models are the ones applied with Ranger). Further, therestriction bounds learned from existing data should cover the ranges of the major-ity of possible values. And for the rare cases where the normal values exceed therestriction bound, Ranger still does not degrade the accuracy of the model (fromour evaluation). Therefore, even if there is an accuracy degradation in the DNNsdue to Ranger in real-world scenarios, such accuracy degradation is unlikely to behigh due to the inherent resilience to insignificant errors in DNNs.While Ranger enables significant resilience boosting in different DNNs, evento 0% in some models (e.g., LeNet, VGG16), we can still observe a small margin ofSDCs in some of the models. This indicates the DNNs systems are still vulnerableto a small number of critical faults. We attribute the reason to the heterogeneouspatterns of the critical faults in different DNNs. In particular, while most criticalfaults need to result in large deviations to cause SDCs, this is not always the case.We can reduce the restriction bound to gain further SDC reduction at the cost ofaccuracy as we have shown above.With the above said, transient faults are relatively rare events, and Ranger onlyleaves unprotected 0.44% of SDCs in 6 of the classifier DNNs and 2.49% in the twosteering models (using degrees in the Dave model), Further, it does not degrade theaccuracy of the original models, and only incurs a modest computation overhead.5.4.4 Discussion for Multiple Bit-flips ScenarioIn this section, we discuss the potential application of Ranger in the presence ofmultiple bit-flips. Multiple bit flips occurring in the system will result in morevalues being affected due to the increase of faults. However, Ranger is agnostic55to the number of out-of-bound values as it will restrict all the values that exceedthe restriction bound. Thus Ranger will still be able to mitigate the critical faultsby restricting the range of the values in the network. We plan to evaluate Rangerunder multiple bit-flip fault model in future work.5.5 SummaryIn this chapter, we describe how we can exploit the monotone property of the MLcomputations as well as the inherent resilience of ML models to protect the systemsfrom transient faults. Specifically, we present Ranger, a technique to selectivelyrestrict the range of values in the ML models. The main insight is to transformthe critical faults in the high-order bits into lower-order bits, thus becoming benignfaults and can be tolerated by the inherent resilience of ML.We evaluate Ranger on 8 DNN models with a total of 5 datasets (including twoDNN applications in the AV domain). The evaluation demonstrates that Ranger:1) significantly enhances the resilience of the DNNs models - it reduces the SDCrates from 14.92% to 0.44% (in classifier DNNs), and from 37.24% to 2.49% (inthe AV DNNs); 2) does not degrade the accuracy of any of the evaluated models,and 3) incurs negligible memory and performance overheads (0.518% on average).56Chapter 6Conclusion and Future Work6.1 SummaryHardware transient faults are growing in frequency and increasingly leading to theunreliability in modern ML systems (such as causing an AV to miss the obstacleon its path). This thesis attempts to understand and improve the error resilience ofML systems in the presence of hardware transient faults.To understand the resilience of ML systems, we propose BinFI, an efficientfault injector to identify the critical bits in ML systems. Critical bits are those bitswhere the occurrence of fault would result in output corruption. Our insight isbased on the observation that many of the ML computations are monotonic withrespect to different faults, which constrains their fault propagation behavior. Wethus identify the existence of the SDC boundary, where faults from higher-orderbits would result in SDCs while faults at lower-order bits would be masked. Fi-nally, we design a binary-search like fault injector to identify the SDC boundary,and implement it as a tool called BinFI for ML programs written using the Tensor-Flow framework. Our evaluation of BinFI demonstrates that BinFI can correctlyidentify 99.56% of the critical bits with 99.63% precision, which significant out-performs conventional random FI-based approaches. It also incurs significantlylower overhead than exhaustive FI techniques (by 5X). BinFI can also accuratelymeasure the overall resilience of the application.To improve the resilience of ML systems, we present Ranger, an automated57technique that selectively restricts the ranges of values in DNNs, in order to dampenthe large deviations typically caused by critical faults leading to SDCs. The re-duced deviations can thus be tolerated by the inherent resilience of DNNs, withoutcausing SDCs. Ranger can be integrated into existing DNNs without major addi-tional effort. We evaluate Ranger on 8 popular DNN models including two end-to-end DNNs applications in self-driving domains. Our evaluation demonstrates thatRanger can reduce the SDC rates of the classifier DNN from 14.92% to 0.44%,and those of the steering models from 37.24% to 2.49%, without degrading theaccuracy of the original models, and incurring modest memory and computationoverheads.In conclusion, this thesis provides an analytical understanding on the error re-silience of ML systems under the presence of soft errors and characterizes thecritical faults in ML applications. Based on that understanding, we design an effi-cient fault mitigation technique that can be integrated into real-world ML systemsto protect them from soft errors.6.2 Future WorkThere are three potential directions in which this thesis work can be extended.6.2.1 Evaluation on End-to-end Self-driving PlatformsIn this thesis, we present BinFI and Ranger and evaluate them on the ML bench-marks. While ML is an integral component in modern self-driving platforms, AVsare systems that combine a variety of components (e.g., LiDAR) to collectivelysafely navigate the vehicles. While this thesis primarily leverages the propertiesthat are unique to ML to design error resilient solution, it is also important to eval-uate the proposed techniques in a more realistic setting, i.e., an end-to-end AVsplatforms. There are several industry-grade AVs platforms, e.g., DriveAV - a pro-prietary self-driving platforms from Nvidia [14]; Apollo - an open-source AVsplatform from Baidu [6]. These are the platforms that can be used to evaluate theeffectiveness of BinFI and Ranger.586.2.2 Extend to Other Safety-critical ML ApplicationsWe consider autonomous vehicles (AVs) as an emerging example of safety-criticalML applications. However, there are many other domains where BinFI and Rangercan be adopted. For example, ML has also seen increasing application in health-care domain (e.g., precision medicine) and several studies have propose ML so-lutions in this area. Xiong et al. builds a DNN for detection of atrial fibrillation,which is powered by thousands of GPU cores. Yoon et al. [94] and Khryashchev etal. [53] use supercomputer facility to empower the DNNs for analyzing the medi-cal data. In addition, DNNs is also applicable in network security domain, whereDNN-based intrusion detection system has been built [67]. In many of those DNNssystems, transient faults can have severe implications on safety and thus BinFI canbe used to assess the resilience of the systems, and Ranger to protect the systemsfrom transient faults.6.2.3 Address the Remaining Critical FaultsAs shown in Section 5.3, even with the protection by Ranger, there is still a smallmargin of critical faults in the ML system. This means that the system is still vul-nerable (though the degree of vulnerability has been reduced significantly). There-fore, one future direction is to investigate how to protect the ML systems from theremaining portion of critical faults. One potential solution is to apply BinFI on themodels protected by Ranger, with the aim to identify the remaining critical faults inthe systems by efficient fault injection. Based on the results from fault injections,one can attempt to characterize the remaining critical faults, such as via a super-vised learning algorithm to distinguish them [79]. This can complement Ranger toprovide comprehensive protection of the ML system to transient faults.59Bibliography[1] Training ai for self-driving vehicles: the challenge of scale, . URLhttps://devblogs.nvidia.com/training-self-driving-vehicles-challenge-scale/.→ page 7[2] Autonomous and adas test cars produce over 11 tb of data per day, . URLhttps://www.tuxera.com/blog/autonomous-and-adas-test-cars-produce-over-11-tb-of-data-per-day/. →page 7[3] Tensorflow implementation of nvidia dave system, . URLhttps://github.com/SullyChen/Autopilot-TensorFlow. → page 45[4] Autumn model in udacity challenge, . URL https://github.com/udacity/self-driving-car/tree/master/steering-models/community-models/autumn.→ page 19[5] Autonomous car - a new driver for resilient computing and design-for-test.URL https://nepp.nasa.gov/workshops/etw2016/talks/15WED/20160615-0930-AutonomousSaxena-Nirmal-Saxena-Rec2016Jun16-nasaNEPP.pdf. → page 7[6] Baidu apollo. URL http://apollo.auto/. → page 58[7] Cifar dataset. URL https://www.cs.toronto.edu/∼kriz/cifar.html. → page 27[8] comma.ai’s steering model. URL https://github.com/commaai/research. →pages 18, 27, 28, 44, 45[9] On-road tests for nvidia dave system. URLhttps://devblogs.nvidia.com/deep-learning-self-driving-cars/. → pages27, 44[10] Driving dataset. URL https://github.com/SullyChen/driving-datasets. →pages 27, 4460[11] Epoch model in udacity challenge. URL https://github.com/udacity/self-driving-car/tree/master/steering-models/community-models/cg23. →page 19[12] Functional safety methodologies for automotive applications. URLhttps://www.cadence.com/content/dam/cadence-www/global/en US/documents/solutions/automotive-functional-safety-wp.pdf. → page 2[13] Mnist dataset. URL http://yann.lecun.com/exdb/mnist/. → page 27[14] Nvidia drive. URL https://developer.nvidia.com/drive/drive-software. →page 58[15] Nvidia drive agx. URLhttps://www.nvidia.com/en-us/self-driving-cars/drive-platform/hardware/. →page 1[16] Rambo. URL https://github.com/udacity/self-driving-car/tree/master/steering-models/community-models/rambo. → pages 19, 20[17] Survival dataset. URLhttps://archive.ics.uci.edu/ml/datasets/Haberman‘s+Survival. → page 27[18] Tensorflow popularity. URL https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a. → page 8[19] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system forlarge-scale machine learning. In 12th {USENIX} Symposium on OperatingSystems Design and Implementation ({OSDI} 16), pages 265–283, 2016. →page 43[20] R. A. Ashraf, R. Gioiosa, G. Kestor, R. F. DeMara, C.-Y. Cher, and P. Bose.Understanding the propagation of transient errors in hpc applications. InSC’15: Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, pages 1–12. IEEE, 2015. →page 8[21] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal,L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning forself-driving cars. arXiv preprint arXiv:1604.07316, 2016. → pages6, 18, 20, 27, 28, 4461[22] C.-K. Chang, S. Lym, N. Kelly, M. B. Sullivan, and M. Erez. Evaluating andaccelerating high-fidelity error injection for hpc. In Proceedings of theInternational Conference for High Performance Computing, Networking,Storage, and Analysis, SC ’18, pages 45:1–45:13, 2018. → page 8[23] C.-K. Chang, S. Lym, N. Kelly, M. B. Sullivan, and M. Erez. Evaluating andaccelerating high-fidelity error injection for hpc. In Proceedings of theInternational Conference for High Performance Computing, Networking,Storage, and Analysis, page 45. IEEE Press, 2018. → page 9[24] Z. Chen, N. Narayanan, B. Fang, G. Li, K. Pattabiraman, andN. DeBardeleben. Tensorfi: A flexible fault injection framework fortensorflow applications. arXiv preprint arXiv:2004.01743, 2020. → pagesxi, 9[25] Z. Chen et al. Binfi: An efficient fault injector for safety-critical machinelearning systems. In Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis. ACM, 2019. →pages 8, 9, 13, 44[26] M. Courbariaux, Y. Bengio, and J.-P. David. Training deep neural networkswith low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.→ page 28[27] N. DeBardeleben, J. Laros, J. T. Daly, S. L. Scott, C. Engelmann, andB. Harrod. High-end computing resilience: Analysis of issues facing the heccommunity and path-forward for research and development. Whitepaper,Dec, 2009. → page 1[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: Alarge-scale hierarchical image database. 2009. → pages 18, 27[29] F. F. dos Santos, C. Lunardi, D. Oliveira, F. Libano, and P. Rech. Reliabilityevaluation of mixed-precision architectures. In 2019 IEEE InternationalSymposium on High Performance Computer Architecture (HPCA), pages238–249. IEEE, 2019. → pages 10, 13[30] S. Du, , et al. Self-driving car steering angle prediction based on imagerecognition. Department of Computer Science, Stanford University, Tech.Rep. CS231-626, 2017. → page 44[31] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, andS. Thrun. Dermatologist-level classification of skin cancer with deep neuralnetworks. Nature, 542(7639):115, 2017. → page 1962[32] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. Gpu-qin: Amethodology for evaluating the error resilience of gpgpu applications. In2014 IEEE International Symposium on Performance Analysis of Systemsand Software (ISPASS), pages 221–230. IEEE, 2014. → page 8[33] M. S. Gashler and S. C. Ashmore. Training deep fourier neural networks tofit time-series data. In International Conference on Intelligent Computing,pages 48–55. Springer, 2014. → page 22[34] G. Georgakoudis, I. Laguna, D. S. Nikolopoulos, and M. Schulz. Refine:Realistic fault injection via compiler-based instrumentation for accuracy,portability and speed. In Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis, page 29.ACM, 2017. → page 8[35] J. George, B. Marr, B. E. Akgul, and K. V. Palem. Probabilistic arithmeticand energy efficient embedded signal processing. In Proceedings of the 2006international conference on Compilers, architecture and synthesis forembedded systems, pages 158–168. ACM, 2006. → page 10[36] I. Goodfellow et al. Deep learning. MIT press, 2016. → pages 1, 12[37] H. Guan et al. In-place zero-space memory protection for cnn. In Advancesin Neural Information Processing Systems 32. 2019. → page 8[38] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learningwith limited numerical precision. In International Conference on MachineLearning, pages 1737–1746, 2015. → page 28[39] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights andconnections for efficient neural network. In Advances in neural informationprocessing systems, pages 1135–1143, 2015. → page 51[40] M. A. Hanif et al. Error resilience analysis for systematically employingapproximate computing in convolutional neural networks. In 2018 Design,Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2018.→ page 12[41] S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer:Exploiting application-level fault equivalence to analyze applicationresiliency to transient faults. In ACM SIGPLAN Notices, volume 47, pages123–134. ACM, 2012. → pages 10, 1463[42] S. Haykin. Neural networks, volume 2. Prentice hall New York, 1994. →page 20[43] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov,M. Fawzy, B. Jia, Y. Jia, A. Kalro, et al. Applied machine learning atfacebook: a datacenter infrastructure perspective. In 2018 IEEEInternational Symposium on High Performance Computer Architecture(HPCA), pages 620–629. IEEE, 2018. → pages 1, 6[44] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 770–778, 2016. → pages 18, 20, 21[45] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neuralnetwork. arXiv preprint arXiv:1503.02531, 2015. → page 21[46] S. Hong, P. Frigo, Y. Kaya, C. Giuffrida, and T. Dumitras¸. Terminal braindamage: Exposing the graceless degradation in deep neural networks underhardware fault attacks. arXiv preprint arXiv:1906.01017, 2019. → pagesxii, 1, 11, 48, 49[47] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel. Detection oftraffic signs in real-world images: The german traffic sign detectionbenchmark. In The 2013 international joint conference on neural networks(IJCNN), pages 1–8. IEEE, 2013. → page 27[48] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, andK. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parametersand¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. → page 44[49] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. arXiv preprintarXiv:1502.03167, 2015. → pages 7, 20[50] Z. Ji, Z. C. Lipton, and C. Elkan. Differential privacy and machine learning:a survey and review. arXiv preprint arXiv:1412.7584, 2014. → page 12[51] P. Judd et al. Proteus: Exploiting precision variability in deep neuralnetworks. Parallel Computing, 2018. → pages 12, 45, 51, 52[52] K. D. Julian, J. Lopez, J. S. Brush, M. P. Owen, and M. J. Kochenderfer.Policy compression for aircraft collision avoidance systems. In 2016IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), pages 1–10.IEEE, 2016. → page 1964[53] V. Khryashchev, A. Lebedev, O. Stepanova, and A. Srednyakova. Usingconvolutional neural networks in the problem of cell nuclei segmentation onhistological images. In International Conference on InformationTechnologies, pages 149–161. Springer, 2019. → pages 6, 59[54] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in neural informationprocessing systems, pages 1097–1105, 2012. → pages 18, 20, 27, 36[55] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncatedgradient. Journal of Machine Learning Research, 10(Mar):777–801, 2009.→ page 12[56] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E.Hubbard, and L. D. Jackel. Handwritten digit recognition with aback-propagation network. In Advances in neural information processingsystems, pages 396–404, 1990. → pages 18, 27[57] G. Li, S. K. S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer, and S. W.Keckler. Understanding error propagation in deep learning neural network(dnn) accelerators and applications. In Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis, page 8. ACM, 2017. → pages 2, 4, 7, 9, 10, 11, 13, 30, 44[58] G. Li, K. Pattabiraman, S. K. S. Hari, M. Sullivan, and T. Tsai. Modelingsoft-error propagation in programs. In 2018 48th Annual IEEE/IFIPInternational Conference on Dependable Systems and Networks (DSN),pages 27–38. IEEE, 2018. → pages 8, 10, 11, 14[59] L. Li and J. Chen. Iterative roots of piecewise monotonic functions withfinite nonmonotonicity height. Journal of Mathematical Analysis andApplications, 411(1):395–404, 2014. → page 16[60] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improveneural network acoustic models. In Proc. icml, volume 30, page 3, 2013. →page 20[61] A. Mahmoud, S. K. S. Hari, C. W. Fletcher, S. V. Adve, C. Sakr,N. Shanbhag, P. Molchanov, M. B. Sullivan, T. Tsai, and S. W. Keckler.Hardnn: Feature map vulnerability evaluation in cnns. arXiv preprintarXiv:2002.09786, 2020. → pages 4, 1265[62] S. Michalak, K. Harris, N. Hengartner, B. Takala, and S. Wender. Predictingthe number of fatal soft errors in los alamos national laboratory’s asc qsupercomputer. Device and Materials Reliability, IEEE Transactions on, 5:329 – 335, 10 2005. doi:10.1109/TDMR.2005.855685. → page 1[63] M. Monterrubio-Velasco, J. C. Carrasco-Jimenez, O. Castillo-Reyes,F. Cucchietti, and J. De la Puente. A machine learning approach forparameter screening in earthquake simulation. In 2018 30th InternationalSymposium on Computer Architecture and High Performance Computing(SBAC-PAD), pages 348–355. IEEE, 2018. → pages 1, 6[64] V. Nair and G. E. Hinton. Rectified linear units improve restrictedboltzmann machines. In Proceedings of the 27th international conference onmachine learning (ICML-10), pages 807–814, 2010. → pages 17, 20[65] N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking bysoftware signatures. IEEE transactions on Reliability, 51(1):111–122, 2002.→ page 14[66] K. Pei, Y. Cao, J. Yang, and S. Jana. Deepxplore: Automated whiteboxtesting of deep learning systems. In proceedings of the 26th Symposium onOperating Systems Principles, pages 1–18. ACM, 2017. → pages7, 19, 27, 44[67] S. Potluri and C. Diedrich. Accelerated deep neural networks for enhancedintrusion detection system. In 2016 IEEE 21st International Conference onEmerging Technologies and Factory Automation (ETFA), pages 1–8. IEEE,2016. → page 59[68] P. Rajpurkar, A. Y. Hannun, M. Haghpanahi, C. Bourn, and A. Y. Ng.Cardiologist-level arrhythmia detection with convolutional neural networks.arXiv preprint arXiv:1707.01836, 2017. → pages 2, 19[69] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017. → pages 22, 36[70] B. Reagen, U. Gupta, L. Pentecost, P. Whatmough, S. K. Lee,N. Mulholland, D. Brooks, and G.-Y. Wei. Ares: A framework forquantifying the resilience of deep neural networks. In 2018 55thACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6. IEEE,2018. → pages 2, 10, 1366[71] B. Reagen et al. Minerva: Enabling low-power, highly-accurate deep neuralnetwork accelerators. In ACM/IEEE 43rd Annual International Symposiumon Computer Architecture (ISCA). IEEE, 2016. → pages 11, 53[72] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. InProceedings of the IEEE conference on computer vision and patternrecognition, pages 7263–7271, 2017. → pages 18, 20[73] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once:Unified, real-time object detection. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 779–788, 2016. → page18[74] D. A. Reed and J. Dongarra. Exascale computing and big data.Communications of the ACM, 58(7):56–68, 2015. → pages 1, 6[75] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-timeobject detection with region proposal networks. In Advances in neuralinformation processing systems, pages 91–99, 2015. → page 18[76] A. H. M. Rubaiyat, Y. Qin, and H. Alemzadeh. Experimental resilienceassessment of an open-source driving agent. arXiv preprintarXiv:1807.06172, 2018. → page 7[77] B. Sangchoolie, K. Pattabiraman, and J. Karlsson. One bit is (not) enough:An empirical study of the impact of single and multiple bit-flip errors. In2017 47th Annual IEEE/IFIP International Conference on DependableSystems and Networks (DSN), pages 97–108. IEEE, 2017. → page 9[78] S. K. Sastry Hari, R. Venkatagiri, S. V. Adve, and H. Naeimi. Ganges: Gangerror simulation for hardware resiliency evaluation. ACM SIGARCHComputer Architecture News, 42(3):61–72, 2014. → pages 10, 11[79] C. Schorn et al. Efficient on-line error detection and mitigation for deepneural network accelerators. In International Conference on ComputerSafety, Reliability, and Security. Springer, 2018. → pages 4, 9, 11, 40, 59[80] B. Schroeder and G. A. Gibson. Understanding failures in petascalecomputers. In Journal of Physics: Conference Series, volume 78, page012022. IOP Publishing, 2007. → page 1[81] A. Sehgal and N. Kehtarnavaz. Guidelines and benchmarks for deploymentof deep learning models on smartphones as real-time apps. MachineLearning and Knowledge Extraction, 1(1):450–465, 2019. → page 5167[82] R. Shokri and V. Shmatikov. Privacy-preserving deep learning. InProceedings of the 22nd ACM SIGSAC conference on computer andcommunications security, pages 1310–1321, 2015. → pages 12, 13[83] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. →pages 18, 20, 27, 28[84] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.Dropout: a simple way to prevent neural networks from overfitting. TheJournal of Machine Learning Research, 15(1):1929–1958, 2014. → page 21[85] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and patternrecognition, pages 1–9, 2015. → pages 18, 20[86] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking theinception architecture for computer vision. In Proceedings of the IEEEconference on computer vision and pattern recognition, pages 2818–2826,2016.[87] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4,inception-resnet and the impact of residual connections on learning. InThirty-First AAAI Conference on Artificial Intelligence, 2017. → pages18, 20[88] R. Tang, W. Wang, Z. Tu, and J. Lin. An experimental analysis of the powerconsumption of convolutional neural networks for keyword spotting. In2018 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 5479–5483. IEEE, 2018. → page 51[89] Y. Tian, K. Pei, S. Jana, and B. Ray. Deeptest: Automated testing ofdeep-neural-network-driven autonomous cars. In Proceedings of the 40thinternational conference on software engineering, pages 303–314. ACM,2018. → pages 7, 19, 27[90] S. Venkataramani et al. Axnn: energy-efficient neuromorphic systems usingapproximate computing. In Proceedings of the 2014 internationalsymposium on Low power electronics and design. ACM, 2014. → pages4, 12, 3868[91] J. Wei, A. Thomas, G. Li, and K. Pattabiraman. Quantifying the accuracy ofhigh-level fault injection techniques for hardware faults. In 2014 44thAnnual IEEE/IFIP International Conference on Dependable Systems andNetworks, pages 375–382. IEEE, 2014. → page 8[92] Y. Wu and Y. Liu. Robust truncated hinge loss support vector machines.Journal of the American Statistical Association, 102(479):974–983, 2007.→ page 12[93] Z. Xiong, M. K. Stiles, and J. Zhao. Robust ecg signal classification fordetection of atrial fibrillation using a novel neural network. In 2017Computing in Cardiology (CinC), pages 1–4. IEEE, 2017. → pages 2, 6, 19[94] H.-J. Yoon, A. Ramanathan, and G. Tourassi. Multi-task deep neuralnetworks for automated extraction of primary site and laterality informationfrom cancer pathology reports. In INNS Conference on Big Data, pages195–204. Springer, 2016. → pages 1, 2, 6, 19, 59[95] K. Zhao, S. Di, S. Li, X. Liang, Y. Zhai, J. Chen, K. Ouyang, F. Cappello,and Z. Chen. Algorithm-based fault tolerance for convolutional neuralnetworks. arXiv preprint arXiv:2003.12203, 2020. → page 1169
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Understanding and improving the error resilience of...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Understanding and improving the error resilience of machine learning systems Chen, Zitao 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Understanding and improving the error resilience of machine learning systems |
Creator |
Chen, Zitao |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | With the massive adoption of machine learning (ML) applications in HPC domains, the reliability of ML is also growing in importance. Specifically, ML systems are found to be vulnerable to hardware transient faults, which are growing in frequency and can result in critical failures (e.g., cause an autonomous vehicle to miss an obstacle in its path). Therefore, there is a compelling need to understand the error resilience of the ML systems and protect them from transient faults. In this thesis, we first aim to understand the error resilience of ML systems under the presence of transient faults. Traditional solutions use random fault injection (FI), which, however, is not desirable for pinpointing the vulnerable regions in the systems. Therefore, we propose BinFI, an efficient fault injector (FI) for finding the critical bits (where the occurrence of faults would corrupt the output) in the ML systems. We find the widely-used ML computations are often monotonic with respect to different faults. Thus we can approximate the error propagation behavior of an ML application as a monotonic function. BinFI uses a binary-search like FI strategy to pinpoint the critical bits. Our result shows that BinFI significantly outperforms random FI in identifying the critical bits of the ML application with much lower costs. With BinFI being able to characterize the critical faults in ML systems, we study how to improve the error resilience of ML systems. It is known that while the inherent resilience of ML can tolerate some transient faults (which would not affect the system's output), there are critical faults that cause output corruption in ML systems. In this work, we exploit the inherent resilience of ML to protect the ML systems from critical faults. In particular, we propose Ranger, a technique to selectively restrict the ranges of values in particular network layers, which can dampen the large deviations typically caused by critical faults to smaller ones. Such reduced deviations can usually be tolerated by the inherent resilience of ML systems. Our evaluation demonstrates that Ranger achieves significant resilience boosting without degrading the accuracy of the model, and incurs negligible overheads. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-04-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0389875 |
URI | http://hdl.handle.net/2429/74063 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_may_chen_zitao.pdf [ 2.25MB ]
- Metadata
- JSON: 24-1.0389875.json
- JSON-LD: 24-1.0389875-ld.json
- RDF/XML (Pretty): 24-1.0389875-rdf.xml
- RDF/JSON: 24-1.0389875-rdf.json
- Turtle: 24-1.0389875-turtle.txt
- N-Triples: 24-1.0389875-rdf-ntriples.txt
- Original Record: 24-1.0389875-source.json
- Full Text
- 24-1.0389875-fulltext.txt
- Citation
- 24-1.0389875.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0389875/manifest