Resilience assessment of machine learning applications under hardware faults

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Resilience assessment of machine learning applications under hardware faults Agarwal, Udit Kumar

Abstract

Machine learning (ML) applications have been ubiquitously deployed across critical domains such as autonomous vehicles (AVs), and medical diagnosis. Vision-based ML models like ResNet are used for object classification and lane detection, while Large Language Models (LLMs) like ChatGPT are used in cars to enable robust and flexible voice commands in AVs. The use of ML models in safety-critical scenarios requires reliable ML models. In the first part of this thesis, we primarily focus on understanding the resilience of ML models against transient hardware faults in CPUs. Towards this end, we present an LLVM IR-level FI tool, LLTFI, which we use to evaluate the effect of transient faults on Deep Neural Networks (DNNs) and LLMs. We found that LLTFI is more precise than TensorFI, an application-level FI tool proposed by prior work. Unlike LLTFI, TensorFI underestimates the resilience of DNNs by implicitly assuming that every injected fault corrupts the outputs of the intermediate layers of the DNN. Using LLTFI, we also evaluated the efficacy of Selective Instruction Duplication to make DNNs more resilient against transient faults. While in the case of DNNs, transient faults cause the model to misclassify or mispredict the object, for LLMs, we found transient faults to cause the model to produce semantically and syntactically incorrect outputs. In the second part of this thesis, we evaluate the effect of permanent stuck-at faults in systolic arrays on DNNs. We present a Register Transfer (RTL)-Level FI tool, called SystoliFI, to inject permanent stuck-at faults in the systolic array, which we use to understand the manifestation of stuck-at faults in systolic arrays in the intermediate layers of the DNNs. We found that the manifestation of the stuck-at faults varies significantly with the type of operation (Convolution vs. Matrix multiplication), the operation size, and the systolic array size.

Item Metadata

Title	Resilience assessment of machine learning applications under hardware faults
Creator	Agarwal, Udit Kumar
Supervisor	Pattabiraman, Karthik
Publisher	University of British Columbia
Date Issued	2023
Description	Machine learning (ML) applications have been ubiquitously deployed across critical domains such as autonomous vehicles (AVs), and medical diagnosis. Vision-based ML models like ResNet are used for object classification and lane detection, while Large Language Models (LLMs) like ChatGPT are used in cars to enable robust and flexible voice commands in AVs. The use of ML models in safety-critical scenarios requires reliable ML models. In the first part of this thesis, we primarily focus on understanding the resilience of ML models against transient hardware faults in CPUs. Towards this end, we present an LLVM IR-level FI tool, LLTFI, which we use to evaluate the effect of transient faults on Deep Neural Networks (DNNs) and LLMs. We found that LLTFI is more precise than TensorFI, an application-level FI tool proposed by prior work. Unlike LLTFI, TensorFI underestimates the resilience of DNNs by implicitly assuming that every injected fault corrupts the outputs of the intermediate layers of the DNN. Using LLTFI, we also evaluated the efficacy of Selective Instruction Duplication to make DNNs more resilient against transient faults. While in the case of DNNs, transient faults cause the model to misclassify or mispredict the object, for LLMs, we found transient faults to cause the model to produce semantically and syntactically incorrect outputs. In the second part of this thesis, we evaluate the effect of permanent stuck-at faults in systolic arrays on DNNs. We present a Register Transfer (RTL)-Level FI tool, called SystoliFI, to inject permanent stuck-at faults in the systolic array, which we use to understand the manifestation of stuck-at faults in systolic arrays in the intermediate layers of the DNNs. We found that the manifestation of the stuck-at faults varies significantly with the type of operation (Convolution vs. Matrix multiplication), the operation size, and the systolic array size.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2023-08-31
Provider	Vancouver : University of British Columbia Library
Rights	Attribution 4.0 International
DOI	10.14288/1.0435699
URI	http://hdl.handle.net/2429/85737
Degree	Master of Applied Science - MASc
Program	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2023-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Resilience assessment of machine learning applications under hardware faults Agarwal, Udit Kumar

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights