Efficient modeling of error propagation in GPU programs

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Efficient modeling of error propagation in GPU programs Anwer, Abdul Rehman

Abstract

Graphics Processing Units (GPUs) are popular for reliability-conscious uses in High Performance Computing (HPC), machine learning algorithms, and safety-critical applications. Fault injection (FI) techniques are generally used to determine the reliability profiles of programs in the presence of soft errors. However, these techniques are highly resource- and time-intensive. GPU applications are highly multi-threaded and typically execute hundreds of thousands of threads, which makes it challenging to apply FI techniques. Prior research developed a model called TRIDENT to analytically predict Silent Data Corruption (SDC) (i.e., incorrect output without any indication) probabilities of single-threaded CPU applications, without requiring any FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures compared to CPU programs. The main challenge is that modeling error propagation across thousands of threads in a Graphics Processing Unit (GPU) kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications. Further, there are GPU-specific behaviors that must be modeled for accuracy. In this thesis, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. Our key insight is that error propagation across threads can be modeled based on program execution patterns. These can be characterized by control-flow, loop iteration, data, and thread block patterns of the GPU program. We also identify two major sources of inaccuracy in building analytical models of error propagation and mitigate them to improve accuracy. We find that GPU-TRIDENT can predict the SDC probabilities of both the overall GPU programs and individual instructions accurately, and is two orders of magnitude faster than FI-based approaches. We also demonstrate that GPUTRIDENT can guide selective instruction duplication to protect GPU programs similar to FI. We also deploy GPU-TRIDENT to assess the input-dependence of reliability of GPU kernels and find that the SDC probability of kernels is generally insensitive to variation in inputs.

Item Metadata

Title	Efficient modeling of error propagation in GPU programs
Creator	Anwer, Abdul Rehman
Publisher	University of British Columbia
Date Issued	2020
Description	Graphics Processing Units (GPUs) are popular for reliability-conscious uses in High Performance Computing (HPC), machine learning algorithms, and safety-critical applications. Fault injection (FI) techniques are generally used to determine the reliability profiles of programs in the presence of soft errors. However, these techniques are highly resource- and time-intensive. GPU applications are highly multi-threaded and typically execute hundreds of thousands of threads, which makes it challenging to apply FI techniques. Prior research developed a model called TRIDENT to analytically predict Silent Data Corruption (SDC) (i.e., incorrect output without any indication) probabilities of single-threaded CPU applications, without requiring any FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures compared to CPU programs. The main challenge is that modeling error propagation across thousands of threads in a Graphics Processing Unit (GPU) kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications. Further, there are GPU-specific behaviors that must be modeled for accuracy. In this thesis, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. Our key insight is that error propagation across threads can be modeled based on program execution patterns. These can be characterized by control-flow, loop iteration, data, and thread block patterns of the GPU program. We also identify two major sources of inaccuracy in building analytical models of error propagation and mitigate them to improve accuracy. We find that GPU-TRIDENT can predict the SDC probabilities of both the overall GPU programs and individual instructions accurately, and is two orders of magnitude faster than FI-based approaches. We also demonstrate that GPUTRIDENT can guide selective instruction duplication to protect GPU programs similar to FI. We also deploy GPU-TRIDENT to assess the input-dependence of reliability of GPU kernels and find that the SDC probability of kernels is generally insensitive to variation in inputs.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2020-06-16
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0391897
URI	http://hdl.handle.net/2429/74708
Degree	Master of Applied Science - MASc
Program	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2020-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Efficient modeling of error propagation in GPU programs Anwer, Abdul Rehman

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights