UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Reliability analysis of deep learning accelerators via software-level fault injection Sadati Seyedmahaleh, Seyedmani

Abstract

Deep learning accelerators (DLAs) are increasingly deployed in safety-critical applications such as autonomous vehicles and medical diagnostics. However, these specialized chips are vulnerable to hardware faults, including transient faults caused by radiation and permanent faults due to aging or manufacturing defects. Existing fault injection (FI) approaches face a trade-off between realism and scalability: hardware-level methods, such as particle beam testing and register-transfer-level (RTL) simulations, offer high accuracy but are costly and slow, whereas software-level FI is efficient but often inaccurate due to its lack of hardware awareness. This thesis proposes a hardware-informed approach to software-level FI that achieves both accuracy and scalability. We extract essential hardware-level insights, such as realistic fault models and microarchitectural characteristics, through a small, targeted set of hardware-level studies. These insights enable the development of two complementary software-level frameworks that realistically simulate the effects of hardware faults in DLAs: (1) TPU-FI, which derives realistic transient fault models from beam experiments on Google TPUs and implements them in TensorFlow kernels, and (2) DLAFI, which uses RTL simulations of systolic arrays to capture microarchitectural characteristics and perform accurate permanent fault injection at the LLVM IR level.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International