Understanding and modeling error propagation in programs

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Understanding and modeling error propagation in programs Li, Guanpeng

Abstract

Hardware errors are projected to increase in modern computer systems due to shrinking feature sizes and increasing manufacturing variations. The impact of hardware faults on programs can be catastrophic, and can lead to substantial financial and societal consequences. Error propagation is often the leading cause of catastrophic system failures, and hence must be mitigated. Traditional hardware only techniques to avoid error propagation are energy hungry, and hence not suitable for modern computer systems (i.e., commodity systems). Researchers have proposed selective software-based protection techniques to prevent error propagation at lower costs. However, these techniques use expensive fault injection simulations to determine which parts of a program must be protected. Fault injection simulation artificially introduces a fault to program execution and observe failures (if any) upon the completion of the program execution. Thousands of such simulations need to be performed in order to achieve statistical significance. It is time-consuming as even a single program execution of a common application may take a long time. In this dissertation, I first characterize error propagation in programs that lead to different types of failures, proposed both empirical and analytical approaches to identify and mitigate error propagation without expensive fault injections. The key observation is that only a small fraction of states are responsible for almost all error propagation in programs, and the propagation falls into identifiable patterns which can be modeled efficiently. The proposed techniques are nearly as close as fault injection approaches in measuring failure rates of programs, and orders of magnitude faster than fault injections. This allows developers to build low-cost fault-tolerant applications in an extremely efficient manner.

Item Metadata

Title	Understanding and modeling error propagation in programs
Creator	Li, Guanpeng
Publisher	University of British Columbia
Date Issued	2019
Description	Hardware errors are projected to increase in modern computer systems due to shrinking feature sizes and increasing manufacturing variations. The impact of hardware faults on programs can be catastrophic, and can lead to substantial financial and societal consequences. Error propagation is often the leading cause of catastrophic system failures, and hence must be mitigated. Traditional hardware only techniques to avoid error propagation are energy hungry, and hence not suitable for modern computer systems (i.e., commodity systems). Researchers have proposed selective software-based protection techniques to prevent error propagation at lower costs. However, these techniques use expensive fault injection simulations to determine which parts of a program must be protected. Fault injection simulation artificially introduces a fault to program execution and observe failures (if any) upon the completion of the program execution. Thousands of such simulations need to be performed in order to achieve statistical significance. It is time-consuming as even a single program execution of a common application may take a long time. In this dissertation, I first characterize error propagation in programs that lead to different types of failures, proposed both empirical and analytical approaches to identify and mitigate error propagation without expensive fault injections. The key observation is that only a small fraction of states are responsible for almost all error propagation in programs, and the propagation falls into identifiable patterns which can be modeled efficiently. The proposed techniques are nearly as close as fault injection approaches in measuring failure rates of programs, and orders of magnitude faster than fault injections. This allows developers to build low-cost fault-tolerant applications in an extremely efficient manner.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2019-03-06
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0376650
URI	http://hdl.handle.net/2429/68502
Degree	Doctor of Philosophy - PhD
Program	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2019-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Understanding and modeling error propagation in programs Li, Guanpeng

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights