Understanding and improving the error resilience of machine learning systems

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Understanding and improving the error resilience of machine learning systems Chen, Zitao

Abstract

With the massive adoption of machine learning (ML) applications in HPC domains, the reliability of ML is also growing in importance. Specifically, ML systems are found to be vulnerable to hardware transient faults, which are growing in frequency and can result in critical failures (e.g., cause an autonomous vehicle to miss an obstacle in its path). Therefore, there is a compelling need to understand the error resilience of the ML systems and protect them from transient faults. In this thesis, we first aim to understand the error resilience of ML systems under the presence of transient faults. Traditional solutions use random fault injection (FI), which, however, is not desirable for pinpointing the vulnerable regions in the systems. Therefore, we propose BinFI, an efficient fault injector (FI) for finding the critical bits (where the occurrence of faults would corrupt the output) in the ML systems. We find the widely-used ML computations are often monotonic with respect to different faults. Thus we can approximate the error propagation behavior of an ML application as a monotonic function. BinFI uses a binary-search like FI strategy to pinpoint the critical bits. Our result shows that BinFI significantly outperforms random FI in identifying the critical bits of the ML application with much lower costs. With BinFI being able to characterize the critical faults in ML systems, we study how to improve the error resilience of ML systems. It is known that while the inherent resilience of ML can tolerate some transient faults (which would not affect the system's output), there are critical faults that cause output corruption in ML systems. In this work, we exploit the inherent resilience of ML to protect the ML systems from critical faults. In particular, we propose Ranger, a technique to selectively restrict the ranges of values in particular network layers, which can dampen the large deviations typically caused by critical faults to smaller ones. Such reduced deviations can usually be tolerated by the inherent resilience of ML systems. Our evaluation demonstrates that Ranger achieves significant resilience boosting without degrading the accuracy of the model, and incurs negligible overheads.

Item Metadata

Title	Understanding and improving the error resilience of machine learning systems
Creator	Chen, Zitao
Publisher	University of British Columbia
Date Issued	2020
Description	With the massive adoption of machine learning (ML) applications in HPC domains, the reliability of ML is also growing in importance. Specifically, ML systems are found to be vulnerable to hardware transient faults, which are growing in frequency and can result in critical failures (e.g., cause an autonomous vehicle to miss an obstacle in its path). Therefore, there is a compelling need to understand the error resilience of the ML systems and protect them from transient faults. In this thesis, we first aim to understand the error resilience of ML systems under the presence of transient faults. Traditional solutions use random fault injection (FI), which, however, is not desirable for pinpointing the vulnerable regions in the systems. Therefore, we propose BinFI, an efficient fault injector (FI) for finding the critical bits (where the occurrence of faults would corrupt the output) in the ML systems. We find the widely-used ML computations are often monotonic with respect to different faults. Thus we can approximate the error propagation behavior of an ML application as a monotonic function. BinFI uses a binary-search like FI strategy to pinpoint the critical bits. Our result shows that BinFI significantly outperforms random FI in identifying the critical bits of the ML application with much lower costs. With BinFI being able to characterize the critical faults in ML systems, we study how to improve the error resilience of ML systems. It is known that while the inherent resilience of ML can tolerate some transient faults (which would not affect the system's output), there are critical faults that cause output corruption in ML systems. In this work, we exploit the inherent resilience of ML to protect the ML systems from critical faults. In particular, we propose Ranger, a technique to selectively restrict the ranges of values in particular network layers, which can dampen the large deviations typically caused by critical faults to smaller ones. Such reduced deviations can usually be tolerated by the inherent resilience of ML systems. Our evaluation demonstrates that Ranger achieves significant resilience boosting without degrading the accuracy of the model, and incurs negligible overheads.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2020-04-17
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0389875
URI	http://hdl.handle.net/2429/74063
Degree (Theses)	Master of Applied Science - MASc
Program (Theses)	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2020-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Understanding and improving the error resilience of machine learning systems Chen, Zitao

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights