Failure analysis and prediction in compute clouds

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Failure analysis and prediction in compute clouds Chen, Xin

Abstract

Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failures to design a reliable cloud system. This thesis presents a characterization study of cloud application failures, and proposes a method to predict application failures in order to save resources. We first analyze a workload trace from a production cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We observe that there are many opportunities to enhance the reliability of the applications running in the cloud, and further nd that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Next, we propose a prediction method based on recurrent neural networks to identify the failures. It takes the resource usage measurements or performance data, and generate features to categorize the applications into different classes. We then evaluate the method on the cloud workload trace. Our results show that the model is able to predict application failures. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications to avoid resource wastage.

Item Metadata

Title	Failure analysis and prediction in compute clouds
Creator	Chen, Xin
Publisher	University of British Columbia
Date Issued	2014
Description	Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failures to design a reliable cloud system. This thesis presents a characterization study of cloud application failures, and proposes a method to predict application failures in order to save resources. We first analyze a workload trace from a production cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We observe that there are many opportunities to enhance the reliability of the applications running in the cloud, and further nd that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Next, we propose a prediction method based on recurrent neural networks to identify the failures. It takes the resource usage measurements or performance data, and generate features to categorize the applications into different classes. We then evaluate the method on the cloud workload trace. Our results show that the model is able to predict application failures. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications to avoid resource wastage.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2014-10-23
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivs 2.5 Canada
DOI	10.14288/1.0167620
URI	http://hdl.handle.net/2429/50871
Degree (Theses)	Master of Applied Science - MASc
Program (Theses)	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2014-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/2.5/ca/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Failure analysis and prediction in compute clouds Chen, Xin

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights