- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Efficient modeling of error propagation in GPU programs
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Efficient modeling of error propagation in GPU programs Anwer, Abdul Rehman
Abstract
Graphics Processing Units (GPUs) are popular for reliability-conscious uses in High Performance Computing (HPC), machine learning algorithms, and safety-critical applications. Fault injection (FI) techniques are generally used to determine the reliability profiles of programs in the presence of soft errors. However, these techniques are highly resource- and time-intensive. GPU applications are highly multi-threaded and typically execute hundreds of thousands of threads, which makes it challenging to apply FI techniques. Prior research developed a model called TRIDENT to analytically predict Silent Data Corruption (SDC) (i.e., incorrect output without any indication) probabilities of single-threaded CPU applications, without requiring any FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures compared to CPU programs. The main challenge is that modeling error propagation across thousands of threads in a Graphics Processing Unit (GPU) kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications. Further, there are GPU-specific behaviors that must be modeled for accuracy. In this thesis, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. Our key insight is that error propagation across threads can be modeled based on program execution patterns. These can be characterized by control-flow, loop iteration, data, and thread block patterns of the GPU program. We also identify two major sources of inaccuracy in building analytical models of error propagation and mitigate them to improve accuracy. We find that GPU-TRIDENT can predict the SDC probabilities of both the overall GPU programs and individual instructions accurately, and is two orders of magnitude faster than FI-based approaches. We also demonstrate that GPUTRIDENT can guide selective instruction duplication to protect GPU programs similar to FI. We also deploy GPU-TRIDENT to assess the input-dependence of reliability of GPU kernels and find that the SDC probability of kernels is generally insensitive to variation in inputs.
Item Metadata
Title |
Efficient modeling of error propagation in GPU programs
|
Creator | |
Publisher |
University of British Columbia
|
Date Issued |
2020
|
Description |
Graphics Processing Units (GPUs) are popular for reliability-conscious uses in High Performance Computing (HPC), machine learning algorithms, and safety-critical applications. Fault injection (FI) techniques are generally used to determine the reliability profiles of programs in the presence of soft errors. However, these techniques are highly resource- and time-intensive. GPU applications are highly multi-threaded and typically execute hundreds of thousands of threads, which makes it challenging to apply FI techniques. Prior research developed a model called TRIDENT to analytically predict Silent Data Corruption (SDC) (i.e., incorrect output without any indication) probabilities of single-threaded CPU applications, without requiring any FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures compared to CPU programs. The main challenge is that modeling error propagation across thousands of threads in a Graphics Processing Unit (GPU) kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications. Further, there are GPU-specific behaviors that must be modeled for accuracy. In this thesis, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. Our key insight is that error propagation across threads can be modeled based on program execution patterns. These can be characterized by control-flow, loop iteration, data, and thread block patterns of the GPU program. We also identify two major sources of inaccuracy in building analytical models of error propagation and mitigate them to improve accuracy. We find that GPU-TRIDENT can predict the SDC probabilities of both the overall GPU programs and individual instructions accurately, and is two orders of magnitude faster than FI-based approaches. We also demonstrate that GPUTRIDENT can guide selective instruction duplication to protect GPU programs similar to FI. We also deploy GPU-TRIDENT to assess the input-dependence of reliability of GPU kernels and find that the SDC probability of kernels is generally insensitive to variation in inputs.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2020-06-16
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0391897
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2020-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International