UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Resilience assessment of machine learning applications under hardware faults Agarwal, Udit Kumar

Abstract

Machine learning (ML) applications have been ubiquitously deployed across critical domains such as autonomous vehicles (AVs), and medical diagnosis. Vision-based ML models like ResNet are used for object classification and lane detection, while Large Language Models (LLMs) like ChatGPT are used in cars to enable robust and flexible voice commands in AVs. The use of ML models in safety-critical scenarios requires reliable ML models. In the first part of this thesis, we primarily focus on understanding the resilience of ML models against transient hardware faults in CPUs. Towards this end, we present an LLVM IR-level FI tool, LLTFI, which we use to evaluate the effect of transient faults on Deep Neural Networks (DNNs) and LLMs. We found that LLTFI is more precise than TensorFI, an application-level FI tool proposed by prior work. Unlike LLTFI, TensorFI underestimates the resilience of DNNs by implicitly assuming that every injected fault corrupts the outputs of the intermediate layers of the DNN. Using LLTFI, we also evaluated the efficacy of Selective Instruction Duplication to make DNNs more resilient against transient faults. While in the case of DNNs, transient faults cause the model to misclassify or mispredict the object, for LLMs, we found transient faults to cause the model to produce semantically and syntactically incorrect outputs. In the second part of this thesis, we evaluate the effect of permanent stuck-at faults in systolic arrays on DNNs. We present a Register Transfer (RTL)-Level FI tool, called SystoliFI, to inject permanent stuck-at faults in the systolic array, which we use to understand the manifestation of stuck-at faults in systolic arrays in the intermediate layers of the DNNs. We found that the manifestation of the stuck-at faults varies significantly with the type of operation (Convolution vs. Matrix multiplication), the operation size, and the systolic array size.

Item Media

Item Citations and Data

Rights

Attribution 4.0 International