- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Reliability analysis of deep learning accelerators...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Reliability analysis of deep learning accelerators via software-level fault injection Sadati Seyedmahaleh, Seyedmani
Abstract
Deep learning accelerators (DLAs) are increasingly deployed in
safety-critical applications such as autonomous vehicles and medical
diagnostics. However, these specialized chips are vulnerable to hardware
faults, including transient faults caused by radiation and permanent
faults due to aging or manufacturing defects. Existing fault injection
(FI) approaches face a trade-off between realism and scalability:
hardware-level methods, such as particle beam testing and
register-transfer-level (RTL) simulations, offer high accuracy but are
costly and slow, whereas software-level FI is efficient but often
inaccurate due to its lack of hardware awareness.
This thesis proposes a hardware-informed approach to software-level FI
that achieves both accuracy and scalability. We extract essential
hardware-level insights, such as realistic fault models and
microarchitectural characteristics, through a small, targeted set of
hardware-level studies. These insights enable the development of two
complementary software-level frameworks that realistically simulate the
effects of hardware faults in DLAs: (1) TPU-FI, which derives realistic
transient fault models from beam experiments on Google TPUs and
implements them in TensorFlow kernels, and (2) DLAFI, which uses RTL
simulations of systolic arrays to capture microarchitectural
characteristics and perform accurate permanent fault injection at the
LLVM IR level.
Item Metadata
| Title |
Reliability analysis of deep learning accelerators via software-level fault injection
|
| Creator | |
| Supervisor | |
| Publisher |
University of British Columbia
|
| Date Issued |
2025
|
| Description |
Deep learning accelerators (DLAs) are increasingly deployed in
safety-critical applications such as autonomous vehicles and medical
diagnostics. However, these specialized chips are vulnerable to hardware
faults, including transient faults caused by radiation and permanent
faults due to aging or manufacturing defects. Existing fault injection
(FI) approaches face a trade-off between realism and scalability:
hardware-level methods, such as particle beam testing and
register-transfer-level (RTL) simulations, offer high accuracy but are
costly and slow, whereas software-level FI is efficient but often
inaccurate due to its lack of hardware awareness.
This thesis proposes a hardware-informed approach to software-level FI
that achieves both accuracy and scalability. We extract essential
hardware-level insights, such as realistic fault models and
microarchitectural characteristics, through a small, targeted set of
hardware-level studies. These insights enable the development of two
complementary software-level frameworks that realistically simulate the
effects of hardware faults in DLAs: (1) TPU-FI, which derives realistic
transient fault models from beam experiments on Google TPUs and
implements them in TensorFlow kernels, and (2) DLAFI, which uses RTL
simulations of systolic arrays to capture microarchitectural
characteristics and perform accurate permanent fault injection at the
LLVM IR level.
|
| Genre | |
| Type | |
| Language |
eng
|
| Date Available |
2025-12-16
|
| Provider |
Vancouver : University of British Columbia Library
|
| Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
| DOI |
10.14288/1.0451035
|
| URI | |
| Degree (Theses) | |
| Program (Theses) | |
| Affiliation | |
| Degree Grantor |
University of British Columbia
|
| Graduation Date |
2026-05
|
| Campus | |
| Scholarly Level |
Graduate
|
| Rights URI | |
| Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International