- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Failure analysis and prediction in compute clouds
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Failure analysis and prediction in compute clouds Chen, Xin
Abstract
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failures to design a reliable cloud system. This thesis presents a characterization study of cloud application failures, and proposes a method to predict application failures in order to save resources. We first analyze a workload trace from a production cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We observe that there are many opportunities to enhance the reliability of the applications running in the cloud, and further nd that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Next, we propose a prediction method based on recurrent neural networks to identify the failures. It takes the resource usage measurements or performance data, and generate features to categorize the applications into different classes. We then evaluate the method on the cloud workload trace. Our results show that the model is able to predict application failures. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications to avoid resource wastage.
Item Metadata
Title |
Failure analysis and prediction in compute clouds
|
Creator | |
Publisher |
University of British Columbia
|
Date Issued |
2014
|
Description |
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failures to design a reliable cloud system. This thesis presents a characterization study of cloud application failures, and proposes a method to predict application failures in order to save resources.
We first analyze a workload trace from a production cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We observe that there are many opportunities to enhance the reliability of the applications running in the cloud, and further nd that resource usage patterns of the jobs can be leveraged by failure prediction techniques.
Next, we propose a prediction method based on recurrent neural networks to identify the failures. It takes the resource usage measurements or performance data, and generate features to categorize the applications into different classes. We then evaluate the method on the cloud workload trace. Our results show that the model is able to predict application failures. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much
earlier than the termination of applications to avoid resource wastage.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2014-10-23
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivs 2.5 Canada
|
DOI |
10.14288/1.0167620
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2014-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivs 2.5 Canada