UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Efficient modeling of error propagation in GPU programs Anwer, Abdul Rehman

Abstract

Graphics Processing Units (GPUs) are popular for reliability-conscious uses in High Performance Computing (HPC), machine learning algorithms, and safety-critical applications. Fault injection (FI) techniques are generally used to determine the reliability profiles of programs in the presence of soft errors. However, these techniques are highly resource- and time-intensive. GPU applications are highly multi-threaded and typically execute hundreds of thousands of threads, which makes it challenging to apply FI techniques. Prior research developed a model called TRIDENT to analytically predict Silent Data Corruption (SDC) (i.e., incorrect output without any indication) probabilities of single-threaded CPU applications, without requiring any FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures compared to CPU programs. The main challenge is that modeling error propagation across thousands of threads in a Graphics Processing Unit (GPU) kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications. Further, there are GPU-specific behaviors that must be modeled for accuracy. In this thesis, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. Our key insight is that error propagation across threads can be modeled based on program execution patterns. These can be characterized by control-flow, loop iteration, data, and thread block patterns of the GPU program. We also identify two major sources of inaccuracy in building analytical models of error propagation and mitigate them to improve accuracy. We find that GPU-TRIDENT can predict the SDC probabilities of both the overall GPU programs and individual instructions accurately, and is two orders of magnitude faster than FI-based approaches. We also demonstrate that GPUTRIDENT can guide selective instruction duplication to protect GPU programs similar to FI. We also deploy GPU-TRIDENT to assess the input-dependence of reliability of GPU kernels and find that the SDC probability of kernels is generally insensitive to variation in inputs.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International