Reliability analysis of deep learning accelerators via software-level fault injection

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Reliability analysis of deep learning accelerators via software-level fault injection Sadati Seyedmahaleh, Seyedmani

Abstract

Deep learning accelerators (DLAs) are increasingly deployed in safety-critical applications such as autonomous vehicles and medical diagnostics. However, these specialized chips are vulnerable to hardware faults, including transient faults caused by radiation and permanent faults due to aging or manufacturing defects. Existing fault injection (FI) approaches face a trade-off between realism and scalability: hardware-level methods, such as particle beam testing and register-transfer-level (RTL) simulations, offer high accuracy but are costly and slow, whereas software-level FI is efficient but often inaccurate due to its lack of hardware awareness. This thesis proposes a hardware-informed approach to software-level FI that achieves both accuracy and scalability. We extract essential hardware-level insights, such as realistic fault models and microarchitectural characteristics, through a small, targeted set of hardware-level studies. These insights enable the development of two complementary software-level frameworks that realistically simulate the effects of hardware faults in DLAs: (1) TPU-FI, which derives realistic transient fault models from beam experiments on Google TPUs and implements them in TensorFlow kernels, and (2) DLAFI, which uses RTL simulations of systolic arrays to capture microarchitectural characteristics and perform accurate permanent fault injection at the LLVM IR level.

Item Metadata

Title	Reliability analysis of deep learning accelerators via software-level fault injection
Creator	Sadati Seyedmahaleh, Seyedmani
Supervisor	Pattabiraman, Karthik
Publisher	University of British Columbia
Date Issued	2025
Description	Deep learning accelerators (DLAs) are increasingly deployed in safety-critical applications such as autonomous vehicles and medical diagnostics. However, these specialized chips are vulnerable to hardware faults, including transient faults caused by radiation and permanent faults due to aging or manufacturing defects. Existing fault injection (FI) approaches face a trade-off between realism and scalability: hardware-level methods, such as particle beam testing and register-transfer-level (RTL) simulations, offer high accuracy but are costly and slow, whereas software-level FI is efficient but often inaccurate due to its lack of hardware awareness. This thesis proposes a hardware-informed approach to software-level FI that achieves both accuracy and scalability. We extract essential hardware-level insights, such as realistic fault models and microarchitectural characteristics, through a small, targeted set of hardware-level studies. These insights enable the development of two complementary software-level frameworks that realistically simulate the effects of hardware faults in DLAs: (1) TPU-FI, which derives realistic transient fault models from beam experiments on Google TPUs and implements them in TensorFlow kernels, and (2) DLAFI, which uses RTL simulations of systolic arrays to capture microarchitectural characteristics and perform accurate permanent fault injection at the LLVM IR level.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2025-12-16
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0451035
URI	http://hdl.handle.net/2429/93197
Degree (Theses)	Master of Applied Science - MASc
Program (Theses)	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2026-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Reliability analysis of deep learning accelerators via software-level fault injection Sadati Seyedmahaleh, Seyedmani

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights