UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Understanding and improving the error resilience of machine learning systems Chen, Zitao


With the massive adoption of machine learning (ML) applications in HPC domains, the reliability of ML is also growing in importance. Specifically, ML systems are found to be vulnerable to hardware transient faults, which are growing in frequency and can result in critical failures (e.g., cause an autonomous vehicle to miss an obstacle in its path). Therefore, there is a compelling need to understand the error resilience of the ML systems and protect them from transient faults. In this thesis, we first aim to understand the error resilience of ML systems under the presence of transient faults. Traditional solutions use random fault injection (FI), which, however, is not desirable for pinpointing the vulnerable regions in the systems. Therefore, we propose BinFI, an efficient fault injector (FI) for finding the critical bits (where the occurrence of faults would corrupt the output) in the ML systems. We find the widely-used ML computations are often monotonic with respect to different faults. Thus we can approximate the error propagation behavior of an ML application as a monotonic function. BinFI uses a binary-search like FI strategy to pinpoint the critical bits. Our result shows that BinFI significantly outperforms random FI in identifying the critical bits of the ML application with much lower costs. With BinFI being able to characterize the critical faults in ML systems, we study how to improve the error resilience of ML systems. It is known that while the inherent resilience of ML can tolerate some transient faults (which would not affect the system's output), there are critical faults that cause output corruption in ML systems. In this work, we exploit the inherent resilience of ML to protect the ML systems from critical faults. In particular, we propose Ranger, a technique to selectively restrict the ranges of values in particular network layers, which can dampen the large deviations typically caused by critical faults to smaller ones. Such reduced deviations can usually be tolerated by the inherent resilience of ML systems. Our evaluation demonstrates that Ranger achieves significant resilience boosting without degrading the accuracy of the model, and incurs negligible overheads.

Item Media

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International