UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Failure analysis and prediction in compute clouds Chen, Xin 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_november_chen_xin.pdf [ 1.71MB ]
JSON: 24-1.0167620.json
JSON-LD: 24-1.0167620-ld.json
RDF/XML (Pretty): 24-1.0167620-rdf.xml
RDF/JSON: 24-1.0167620-rdf.json
Turtle: 24-1.0167620-turtle.txt
N-Triples: 24-1.0167620-rdf-ntriples.txt
Original Record: 24-1.0167620-source.json
Full Text

Full Text

Failure Analysis and Prediction inCompute CloudsbyXin ChenB.E., University of Science and Technology of China, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)October 2014© Xin Chen 2014AbstractMost cloud computing clusters are built from unreliable, commercial off-the-shelf compo-nents compared with supercomputer clusters. The high failure rates in their hardware andsoftware components result in node and application failures. Therefore, it is important tounderstand their failures to design a reliable cloud system. This thesis presents a charac-terization study of cloud application failures, and proposes a method to predict applicationfailures in order to save resources.We first analyze a workload trace from a production cloud cluster and characterize theobserved failures. The goal of our work is to improve the understanding of failures incompute clouds. We present the statistical properties of job and task failures, and attemptto correlate them with key scheduling constraints, node operations, and attributes of usersin the cloud. We observe that there are many opportunities to enhance the reliability ofthe applications running in the cloud, and further find that resource usage patterns of thejobs can be leveraged by failure prediction techniques.Next, we propose a prediction method based on recurrent neural networks to identifythe failures. It takes the resource usage measurements or performance data, and generatefeatures to categorize the applications into different classes. We then evaluate the methodon the cloud workload trace. Our results show that the model is able to predict applicationfailures. Moreover, we explore early classification to identify failures, and find that theprediction algorithm provides the cloud system enough time to take proactive actions muchearlier than the termination of applications to avoid resource wastage.iiPrefaceThis thesis is based on the projects in collaboration with Dr. Karthik Pattabiraman andDr. Charng-Da Lu. The work will be published as papers in the 25th IEEE InternationalSymposium on Software Reliability Engineering (ISSRE).Xin Chen, Charng-Da Lu and Karthik Pattabiraman. Failure Analysis of Jobsin Compute Clouds: A Google Cluster Case Study. In the 25th InternationalSymposium on Software Reliability Engineering (ISSRE). IEEE, 2014 [9]. (Ac-ceptance Rate: 25%)Xin Chen, Charng-Da Lu and Karthik Pattabiraman. Failure Prediction ofJobs in Compute Clouds: A Google Cluster Case Study. In the InternationalWorkshop on Reliability and Security Data Analysis (RSDA), co-located withISSRE. IEEE, 2014 [10].I was the main author, and was responsible for coming up with the research methodology,evaluating the solution, and writing the paper. Karthik and Charng-Da guided me withthe experiment design and result analysis, as well as editing the paper.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Virtualization in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Google Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7ivTable of Contents2.3 Failure Prediction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Recurrent Neural Networks (RNNs) . . . . . . . . . . . . . . . . . . 102.3.2 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Failure Characterization Study . . . . . . . . . . . . . . . . . . . . . . . . . 133.1 Basic Failure Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Task Resubmissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Scheduling Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Node Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5.1 Usage at the Job Level . . . . . . . . . . . . . . . . . . . . . . . . . 233.5.2 Usage at the Task Level . . . . . . . . . . . . . . . . . . . . . . . . . 253.6 User-centric Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 Failure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.1 Failure Prediction Framework . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Design of the Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4.1 Prediction Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4.2 User Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 444.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.1 Implication of the Characterization Results . . . . . . . . . . . . . . . . . . 465.2 Threats to Validity of the Characterization . . . . . . . . . . . . . . . . . . 48vTable of Contents5.3 Discussions on the Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3.1 Advantages of the Predictor . . . . . . . . . . . . . . . . . . . . . . 495.3.2 Limitations of the Predictor . . . . . . . . . . . . . . . . . . . . . . 495.3.3 Threats to Validity of the Prediction . . . . . . . . . . . . . . . . . 506 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 57Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59viList of Tables2.1 Collected resources in the Google dataset. . . . . . . . . . . . . . . . . . . . 92.2 Numbers of job and task events. . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Definitions of job, task and node failures. . . . . . . . . . . . . . . . . . . . 103.1 Fitting of log-normal distribution on job duration. µ and σ are the param-eters in log-normal distribution, and KS is the maximal distance betweendistributions in the test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Fitting of log-normal distribution on CPU/memory usage. . . . . . . . . . . 153.3 Task priorities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 K-means clustering on users profiles. The features for clustering are the ratiosof evict, fail, finish and kill events in both jobs and tasks. The statistics of jobattributes and resource usage are in average, and # represents the numberof a variable. In the user attributes, a user is called production user if theproduction jobs account for more than 20% of its all jobs. . . . . . . . . . . 30viiList of Figures2.1 General structures of VMs and containers. . . . . . . . . . . . . . . . . . . . 72.2 General infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Failed jobs in the period of one month . . . . . . . . . . . . . . . . . . . . . 102.4 Recurrent neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Distribution of duration of failed, finished and killed jobs . . . . . . . . . . 143.2 Distribution of normalized CPU/memory usage of failed, finished and killedjobs. The original units of CPU and memory are core-seconds/second andbytes. Both measurements are normalized by the respective maximal mea-sured value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Numbers of failed, finished and killed jobs with task re-executions. The x-axisis the maximum resubmission time of all tasks in a jobs. A value of 1 meansthat all tasks in the job are executed once. The y-axis is the cumulativedistribution function (CDF). . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Failed, finished and killed jobs in different scheduling classes . . . . . . . . . 183.5 Task submissions by priority and termination status . . . . . . . . . . . . . 193.6 Task submissions by priority and termination status (excluding resubmissions) 203.7 Distribution of machine cycles. . . . . . . . . . . . . . . . . . . . . . . . . . 21viiiList of Figures3.8 Average ratio of failed tasks v.s. machine cycles. The x-axis is the machinecycle number. In cycle number k, data points in the kth cycle are chosenfrom the machines that have no less than k cycles. The y-axis is the ratio offailed tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.9 The failure rate in a machine life cycle. The y-axis is the mean failure rate ina machine life cycle, and the x-axis is the number of machine removals. Everymachine life cycle is represented by a blue dot, and the machines having morethan 5 updates are plotted in red. . . . . . . . . . . . . . . . . . . . . . . . 233.10 CPU and memory usage of jobs under different combinations of job andscheduling parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.11 Ratios of resource (CPU and memory) consumed by failed jobs to that con-sumed by finished jobs in resubmitted jobs. The resource is calculated bydividing the total job resource by the number of task. The x-axis shows theratio of average resource in failed jobs to that in finished jobs, and the y-axisrepresents the cumulative distribution function (CDF). The vertical line inred shows the ratio equals 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.12 p-values of normalized resource usage in the rank-sum tests of each pair ofcategories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.13 p-values of normalized resource usage in the rank-sum tests of failed andfinished executions. The resources from beginning to 50%, 80% and 90% ofthe running time in jobs longer than 10 minutes are collected. . . . . . . . . 284.1 General framework of prediction . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Prediction modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Approximate relative savings in CPU usage in the predictor designs . . . . 394.4 Task level results of metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 41ixList of Figures4.5 Job level results of metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.6 Relative savings of resources (CPU usage, memory usage and task hour) inthe groups of high resource consumption . . . . . . . . . . . . . . . . . . . . 434.7 Resource savings of original predictor and user-based optimization . . . . . 45xList of AcronymsCPI Cycles Per InstructionFPR False Positive RateHMM Hidden Markov ModelKS Kolmogorov-SmirnovMAI Memory Accesses Per InstructionMTBF Mean Time Between FailuresMTTR Mean Time To RepairNN Neural NetworkRNN Recurrent Neural NetworkTPR True Positive RateVM Virtual MachinexiAcknowledgementsFirst, I would like to thank my advisor Dr. Karthik Pattabiraman for his support in thepast two years. Karthik taught me how to do research, and gave me insightful advices inthe weekly meetings. Karthik is also a mentor who encouraged me to communicate withresearchers in seminars and conferences. Without his continuous support, I would have notimproved in the reasoning and critical thinking.I also want to thank my collaborator Dr. Charng-Da Lu for his advices, efforts andresources on the projects. The projects would never been accomplished without his contri-bution.I would like to thank Prof. Sathish Gopalakrishnan, Prof. Matei Ripeanu and othercolleagues in the Computer Systems Reading Group (CSRG) for their meaningful discus-sions. Many thanks to my labmates for making the lab an enjoyable place to work. I amalso grateful to my dear friends in Canada, China and United States, for their help andencouragement.Finally, I want to express my deepest gratitude to my family. To me they are theforemost source of unconditional love and support.xiiDedicationTo my parentsxiiiChapter 1IntroductionCloud systems experience failures due to their large-scale, heterogeneity and distributednature. Node failures in the cloud may cause the jobs running on them to abort [29]. Also,applications may experience exceptions such as out-of-memory [43] and software bugs. Inthe cloud, different applications can share resources, and lack of resources may lead toperformance degradation and potential application failures. One of the main challenges incloud systems is to ensure the reliability of job execution in the presence of failures [37].Cloud applications may span thousands of nodes and run for a long time before beingaborted, which leads to the wastage of energy and other resources. Under such circum-stances, it is necessary to understand the reliability of cloud applications. The long-termgoal of our work is to enhance the reliability of the cloud platform and applications runningin the cloud.Many prior studies on large-scale system reliability focus on hardware/software failuresand their causes [15, 16, 49]. While these are valuable, they do not provide much insightinto failures experienced by end users. Application failures have been analyzed in popularsystems such as Hadoop [25, 43] and distributed scientific workflows on Amazon EC2 [47].However, these studies are limited to MapReduce or scientific computations, and are difficultto extrapolate to generic clouds. They do not correlate the failures with properties of ap-plications and clouds. This leads to the first research question: what are the characteristicsof job failures in a production compute cloud?From the perspective of the underlying cloud infrastructure provider, predicting if a11.1. Proposed Approachcertain application will ultimately fail is important for resource savings. Currently, mostcloud failures are detected only when they indeed happen. Some prior studies on large-scale system reliability have focused on finding correlations between resource consumptionand failure behaviour of applications to build predictors. Ganesha [35] assumes that fault-free nodes in MapReduce have similar behaviors, and that a deviation from this behaviourindicates a failure. Williams et al. [50] find that a fault possibly manifests as unstableperformance behaviors before a failure occurs, thus enabling failure prediction techniques.Besides these algorithms, Ren et al. [43] find that most of the task failures in clouds resultfrom out of memory exceptions, and that job failures are mainly caused by task failures in acommercial cloud trace. While these techniques are useful, none of them can deal with thescale and heterogeneity of large-scale clouds. This leads to the second research question:What are good predictors for application failures in a large scale compute cloud system?1.1 Proposed ApproachIn this thesis, we select a representative cloud system from Google as the foundation toaddress the problems. The Google cluster workload traces [41] contains the workload mea-surements of more than 12,000 nodes during a one month period. The jobs in the trace rangefrom single-task jobs to multi-task computations [39, 40]. The jobs consist of productionjobs, e.g., web services, and batch jobs that perform computations and finish.To answer the first research question, we conduct a failure analysis of the Google clusterworkload trace. In particular, our goal is to understand the characteristics of job failuresin order to improve the dependability of the underlying cloud infrastructure from the per-spective of cloud providers. Further, we want to explore the potential for failure predictionand anomaly detection in cloud applications in order to avoid wastage of resources by man-aging the jobs that ultimately fail. Finally, we would like to understand the effect of job21.1. Proposed Approachscheduling and node maintenance on the failures.In comparison with the related work, our work focuses on failure characteristics fromthe jobs’ perspective, and covers broader classes of jobs than MapReduce or scientific com-putations.The reliability of cloud applications can be affected by the job type, cloud configurations,and the dynamic states of the cloud system. We consider the following four aspects in ouranalysis of cloud failures: (1) application factors, which are the programs, number of tasksin a job, and the job owners; (2) cloud factors, which are node failures and maintenances;(3) configurations, which include scheduling constraints, and the policy on how many timesa failed task can be resubmitted; (4) real-time execution status, which means runtime CPUand memory resource usage. While the Google dataset provides comprehensive logs of eachjob’s resource consumption and failures in the month-long period, it hides specifics aboutthe nature of the job, as well as the physical nodes that the jobs are running on (due toprivacy reasons). Therefore, we cannot factor these into our analysis.To answer the second research question, we propose a prediction technique for cloudsystems that makes use of the resource usage measures of workloads, to predict job andtask failures. The main challenge of using the resource usage time series is to discoverfeatures that are indicative of job or task failures. Unfortunately, it is difficult to extract thefeatures directly from the time series data [51]. Instead, we use Recurrent Neural Networks(RNNs) [27, 45] to learn the temporal characteristics of the resource usage measures suchas CPU and memory usage. Then we combine trained RNNs with various job/node/userattributes to predict job failures.In the Google cluster trace, we predict if a certain application will ultimately fail, with-out identifying the underlying reasons, from the perspective of the underlying cloud infras-tructure provider. In particular, we do not distinguish the reasons behind the occurrencesof application failures, namely performance reasons (e.g., lack of resources) and reliability31.2. Contributionsrelated hardware/software/network reasons.1.2 ContributionsWe make the following two-fold contributions in the failure characterization and predictionin this thesis.In the failure characterization study, we analyze the statistics of job failures in the Googledataset, and correlate with the features of applications and the cloud, configurations andreal-time statuses. Our study finds that there is significant wastage of resources in theGoogle cluster due to failed jobs. The application failures are affected by job parameters,configurations and real-time status of resources in the cloud. We also find that there aremany opportunities to save resources in the cluster, if that is a desired goal. For example,failure prediction techniques can yield great benefit as they can lead to early terminationof jobs (excluding those for debug/test) that are likely to fail ultimately. We find that suchprediction schemes are not only feasible, but can yield high accuracy even just halfway intoa job’s execution (for long running jobs).We further propose a general framework of failure prediction to solve this problem. Inthe failure prediction, we present a machine learning approach based on recurrent neuralnetworks for predicting job and task failures. We find that the historical information ofjobs from the same user or users affiliated with the same group is essential to achievinghigh prediction accuracy. Our algorithm accurately predicts failures in the selected failedand finished jobs. For example, our prediction achieves a true positive rate of about 40%,and a false positive rate of 6%. Using the prediction results, proactive failure managementtechniques (e.g., restarting or killing jobs) provide 6% to 10% of relative resource savingson average.To the best of our knowledge, we are the first to perform failure characterization in a41.3. Thesis Structurelarge-scale, generic cloud system, from the job and user perspectives. Besides, we are thefirst to predict application (job) failures on the Google cluster dataset and to perform earlypredictions.1.3 Thesis StructureThe rest of this thesis is organized as follows. Chapter 2 describes the framework of runningapplications in the cloud with containers, the target cloud system that is represented bythe Google cluster dataset, and machine learning techniques used for prediction. Chapter 3characterizes the application failures from many perspectives. Chapter 4 presents a generalfailure prediction model, and evaluates the predictor using the Google dataset. Chapter 5includes a detailed discussions of implications and limits of this characterization study, andthe designs of the predictors. Chapter 6 provides previous research related to this work.Chapter 7 concludes the thesis and proposes future directions to extend.5Chapter 2BackgroundThis chapter first discusses the virtualization techniques deployed in cloud platforms suchas the Google cluster (Chapter 2.1). Then the Google cluster trace is introduced, and thebasic facts of the application failures are explained (Chapter 2.2). Finally it discusses thecandidates of machine learning techniques for building prediction approaches (Chapter 2.3).2.1 Virtualization in the CloudA cloud platform for applications include isolated layers from bottom to top: physicalservers, middle layers (e.g., network), virtualization (e.g., hypervisor), and applications. Arecent study has found that virtual machines have lower failure rates and lower recurrentfailure probabilities than physical machines [6]. Normally, applications running in the VMsbenefit from the isolated structure and the reliability of VMs.In a traditional VM, an entire guest OS is compulsory, which may weigh 10s of GB. Incomparison, the size of each virtualized application may range from only 10s of MB to a fewof GB, which is much less than the size of the OS. To support the applications, necessarybinaries and libraries are also needed. Containers are a form of lightweight virtualizationtools which package the application and its libraries, without requiring the guest OS. InGoogle, various services from Search to Gmail are packaged and run in Linux containers.More than 2 billion container instances are launched every week 1.1http://googlecloudplatform.blogspot.ca/2014/06/an-update-on-container-support-on-google-cloud-platform.html62.2. Google DatasetFigure 2.1 shows the general structures of VMs and containers.(a) VM (b) containerFigure 2.1: General structures of VMs and containers.The container comprises just the application and its dependencies. For example, docker [3]runs as an isolated process in userspace on the host OS, sharing the kernel with other con-tainers. Thus, the resource isolation and allocation are achieved and this schema is moreportable and efficient. A container that adopts the kernel-improved operating system vir-tualization is more efficient than the typical hardware virtualization.2.2 Google DatasetThe Google cluster workload traces [41] are one of the first and few publicly available tracesfrom large cloud systems (about 12,500 compute nodes over 29 days). The jobs consist ofproduction jobs (e.g., web services), and batch jobs that perform computations and finish.In the trace, every job contains the job name, its resource requirements and the numberof tasks in it. A job consists of at least one task, and each task is also constrained byscheduling and resource usage limits. These constraints and limits are present in the trace.The resource isolation and usage measurement are achieved by setting up separate Linuxcontainers for different tasks. Around 670,000 jobs and 26 million tasks are logged in the72.2. Google Datasettrace. Figure 2.2 shows the infrastructure of the clusters.Figure 2.2: General infrastructureThe dataset contains the following periodically profiled resource usage metrics: CPUusage (average and peak), memory usage (canonical, assigned, and peak), page cache (un-mapped and total), disk I/O time (current and peak), disk usage, cycles per instruction, andmemory accesses per instruction. All these measurements are normalized by the respectivemaximum values measured. Table 2.1 describes the metrics of collected resource usage inthe dataset.A job or task has several possible termination statuses (called “event types” in the trace.)These are: (1) evicted, (2) killed, (3) failed, (4) finished, and (5) lost. Evicted means thatthe system is unable to satisfy the job or task’s resource requirements, and hence the jobor task is not scheduled. Killed means that the job or task was killed either by the useror by the system administrator, or the job(s) on which it is dependent was terminatedabnormally. Failed means that the job or task did not finish execution, and was terminatedby an exception or abnormal condition. Finished means the job or task completed executionsuccessfully. Lost means that the record indicating the job termination is missing. Table 2.2shows the numbers of jobs and tasks of each termination status. Among the above fivetypes, finished jobs are the most frequent (57.6%), followed by killed (40.7%), and failedjobs (1.7%). A job being evicted or lost is a very rare event. Therefore, we focus mainly onfinished, killed and failed jobs in our study.In the study, we consider three kinds of failures, as shown in Table 2.3.Figure 2.3 shows the number of job failures over the one month period of the trace. An82.2. Google DatasetTable 2.1: Collected resources in the Google dataset.Resource Descriptionmean CPU mean CPU (core seconds per second) in 1s windowmax CPU maximum CPU (core seconds per second) in 1s win-dowmean memory mean canonical memory usage measurementassigned memory memory usage based on the memory actually as-signed to the containerunmapped page cachememoryLinux page cache (file-backed memory) not mappedinto any userspace processpage cache memory total Linux page cachemax memory the maximum value of the canonical memory usagemeasurement observedmean disk I/O time mean of the sum across all disks (disk-time secondsper second) on the machinemax disk I/O time maximum of the sum across all disks (disk-time sec-onds per second) on the machinemean local disk space mean runtime local disk capacity usageCPI cycles per instruction that is collected from proces-sor performance countersMAI memory accesses per instruction that is collectedfrom processor performance countersTable 2.2: Numbers of job and task events.Event number Jobs TasksTotal terminated 667791 48737884Failed 10118 13825994Finished 385439 18207622Killed 272190 10292068Evicted 22 5752056Lost 16 8754average of 14.6 jobs fail in a hour, and the minimum and maximum are 0 and 177 jobs,respectively. There is also a rough weekly pattern in the job failures, with failures dippingroughly in weekly intervals, probably due to weekly clean-ups.92.3. Failure Prediction MethodsTable 2.3: Definitions of job, task and node failures.Failures TraceEventDescriptionJobfailureJob faileventA job is descheduled due to task failures.TaskfailureTask faileventA task is descheduled due to a task failure (e.g., exceptions orsoftware bugs).NodefailureMachineremoveeventA node failure leads to the machine being removed from thecluster. Node failures are clubbed together with node mainte-nance, as they can not be distinguished in the trace.0 5 10 15 20 25 30Time(day)020406080100120140160180Number of failed jobsFigure 2.3: Failed jobs in the period of one month2.3 Failure Prediction Methods2.3.1 Recurrent Neural Networks (RNNs)Traditional representative techniques, such as the Hidden Markov Models (HMM) anddistribution-based methods, have been applied to the time series data in other failure pre-diction problems [46]. Different from those data, the Google cluster has a large amount ofhigh dimensional and noisy data that can have dependencies on prior data segments. Theseproperties make the above techniques a poor fit for the Google cluster data. For example,102.3. Failure Prediction MethodsHMMs assume no dependencies exist in the time domain. For distribution-based methods,the heterogeneity or changing mean/variance characteristics make the methods difficult tomimic the data. In comparison, recurrent neural networks (RNNs) [45] can capture thetemporal relations in the trace. Further, because RNNs are based on feedforward networkswith connections between inputs and outputs, they can handle varying lengths in the timedomain. Therefore, we use RNNs in our prediction algorithm.Figure 2.4 shows a typical architecture of recurrent neural network.Figure 2.4: Recurrent neural networkGiven a input sequence of resource usage x = (x1, x2, ..., xT ), the standard RNN calcu-lates sequences of states in the hidden layer h = (h1, h2, ..., hT ), and sequences of outputsy = (y1, y2, ..., yT ). The problem is considered as an instance of the general classificationproblem. Then the computation has the following iteration equations [45].ht = H(Wxhxt +Whhht−1 + bh) (2.1)yt = softmax(Whyht + by) (2.2)where Wxh, Whh, Why are weight matrices, bh, by are biases matrices, H is the hidden layerfunction (e.g., tanh), and softmax is a logistic function (e.g., tanh).The objective function in the RNN problem for a single pair (x, y) is f = L(yˆ, y), where Lis a distance measurement function between the prediction yˆ and the target y. Examples of112.3. Failure Prediction MethodsL include the squared error and edit distance. The overall objective function is the averagenormalized individual objective function of all data points in the entire set, or practicallyin the same user.E =1NN∑n=1L(yˆn, yn) (2.3)where N denotes the number of sequences, and yˆn and yn are the prediction sequence andthe corresponding target termination statuses.2.3.2 Ensemble MethodsThe Google trace is also extremely diverse in terms of the attributes of the programs,machines and users. Ensemble methods built on single estimators can capture such diversitywith robustness. A common selection for the construction is to use the tree-based modelas a single estimator with a vector of features [14]. Each estimator can be trained with arandom subset of the entire training data.Representative ensemble methods consist of two families: averaging methods and boost-ing methods [22]. In the first method family, the estimators are built independently andthen the outcomes of the predictions are averaged. Examples are bagging methods andrandom forests. In the second method family, the base estimators are built sequentially andfurther the bias of combined estimator are reduced. Gradient boosting and stochastic gra-dient boosting are typical methods. Empirically, ensemble methods tend to generate betterresults when the data has a significant diversity. Therefore, we use ensemble methods ontop of RNNs for prediction.12Chapter 3Failure Characterization StudyIn this chapter, we present the failure characterization study on the Google cluster dataset.We first characterize the durations of failed, finished and killed jobs, followed by the dis-tribution of job resource consumptions (Chapter 3.1). We then study the task submissions(Chapter 3.2), the effects of scheduling constraints on the jobs (Chapter 3.3), and the effectof node maintenance/removal on failures (Chapter 3.4). Further, we investigate correlationsbetween resource usage and the termination status (success/failure) of a job (Chapter 3.5).Finally, we examine similarities among users and user-centric failure characteristics (Chap-ter 3.6). The findings are summarized in Chapter Basic Failure DistributionsWe observe that the job durations of failed, finished and killed jobs follows a heavy-taileddistribution as shown in Figure 3.1. The majority of jobs terminate within 2000 secondsfrom start (less than an hour), while the longest failed job lasts for 25 days (almost 29 daysincluding the time to be scheduled). In addition to the terminated jobs, around 0.5% of thejobs do not terminate in the trace period, and they are not considered in this study.Among distributions we attempted to fit, we find that the log-normal distribution hasthe best fit on all the three job termination types, and the parameters are shown in Table 3.1.Our goodness of fit criterion is the Kolmogorov-Smirnov (KS) test. Based on this fitting,we find that finished jobs have shorter lengths than both failed jobs and killed jobs, on133.1. Basic Failure Distributionsaverage. We speculate three possible reasons for this phenomenon. First, many shortjobs are consecutively executed by a few users, and the vast majority of these jobs finishsuccessfully. Second, jobs may hang/freeze and thus get killed after running out of theallocated time or resources. Third, some debug/test jobs can run for a long time beforebeing killed during the development cycle.10-1010-910-810-710-610-510-410-310-210-1 100 101 102 103Job length (hour) jobsFinished jobsKilled jobsFigure 3.1: Distribution of duration of failed, finished and killed jobsTable 3.1: Fitting of log-normal distribution on job duration. µ and σ are the parametersin log-normal distribution, and KS is the maximal distance between distributions in thetest.Type Mean Duration (Hour) µ σ KSFailed 2.297 -2.785 1.593 0.06Finished 0.181 -3.454 1.555 0.11Killed 1.609 -2.557 1.297 0.06We also plot the CPU/memory usage of jobs in Figure 3.2. They also have a heavy-tailed distribution, and follow a log-normal distribution with parameters in Table 3.2. Theaverage CPU and memory consumptions of finished jobs (per second) are around half ofthose in failed and killed jobs. Overall, the amount of CPU and memory consumed by143.1. Basic Failure Distributionsfailed jobs are 2.5 and 6.6 times those consumed by finished jobs. Therefore, we posit thateffective failure prediction strategies to prevent resource wastage are needed.10-5 10-4 10-3 10-2 10-1 100Normalized resource usage0. jobs (CPU)Finished jobs (CPU)Killed jobs (CPU)Failed jobs (memory)Finished jobs (memory)Killed jobs (memory)Figure 3.2: Distribution of normalized CPU/memory usage of failed, finished and killedjobs. The original units of CPU and memory are core-seconds/second and bytes. Bothmeasurements are normalized by the respective maximal measured value.Table 3.2: Fitting of log-normal distribution on CPU/memory usage.Type Mean Resources µ σ KSFailed (CPU) 0.0080 -6.018 1.426 0.059Finished (CPU) 0.0047 -6.890 2.088 0.049Killed (CPU) 0.0089 -6.247 2.149 0.079Failed (memory) 0.0044 -7.203 1.751 0.118Finished (memory) 0.0017 -7.702 1.682 0.381*Killed (memory) 0.0035 -6.878 1.209 0.078* 0.381 is KS value for the entire distribution, while KS value decreases to0.082 after removing the biased beginning part.153.2. Task Resubmissions3.2 Task ResubmissionsDuring the life cycle of a job, its tasks can be resubmitted and rescheduled multiple timesafter abnormal terminations, e.g., failures, evictions or being killed. A task can also bere-executed if the user so chooses. We examine the effects of task resubmission on thetermination statues of tasks and jobs.In the entire dataset, the ratios of jobs with tasks that execute multiple times for failed,finished and killed jobs are 35.8%, 0.9%, 14.1%, respectively. As expected this ratio is lowfor finished jobs, which rarely have failing tasks and hence do not need to reexecute them.Across all categories, we observe that around 76% of the jobs have tasks re-executed atmost 4 times. Some systems such as Hadoop MapReduce have limits on the number of taskresubmissions, or the user sets a limit on the resubmissions. However, we speculate that inthe Google cluster, there is no system-wide limit on resubmission, nor are users mandatedto set such limits, and hence it is possible for resubmitted tasks to fail over and over again.Figure 3.3 shows the CDF of task resubmissions on single-task and multi-task jobs foreach of the job types. The percentages of failed, finished and killed jobs that consists ofmultiple tasks (i.e., multi-task jobs) are 17.8%, 4.24% and 53.3%, respectively. In terms ofthe average number of task resubmissions, the finished jobs have the smallest value, followedby killed jobs, and failed jobs. In each category of jobs, we observe that multi-task jobshave more average resubmissions than their single-task counterparts. Besides, only 0.3%of finished jobs submit tasks more than 10 times, while around 9.5% of failed jobs submittasks more than 10 times. We also observe that the maximum of task resubmissions ina killed job can be as high as 9062 times (single-task) and 1417 times (multi-task). Incomparison, the maximum of task resubmissions in failed and finished jobs are around 400and 150 respectively. We speculate that such excessive task re-executions are not useful(except when debugging or testing), and lengthy failed or killed jobs can be preemptively163.3. Scheduling Constraintsstopped before more resources are wasted.100 101 102 103 104Maximum number of resubmission0. jobs (single task)Finished jobs (single task)Killed jobs (single task)Failed jobs (multi task)Finished jobs (multi task)Killed jobs (multi task)Figure 3.3: Numbers of failed, finished and killed jobs with task re-executions. The x-axisis the maximum resubmission time of all tasks in a jobs. A value of 1 means that all tasksin the job are executed once. The y-axis is the cumulative distribution function (CDF).3.3 Scheduling ConstraintsWe also examine jobs by the scheduling criteria and constraints as they may serve differentpurposes and have divergent behaviours. Jobs and tasks are assigned scheduling classesbased on the urgency, the latency sensitivity, and the resource access policies. Productionjobs and latency sensitive jobs are likely to be in a higher scheduling class, while non-production or non latency sensitive jobs are likely to be in the lowest class.Figure 3.4 shows the distribution of failed, finished and killed jobs for different schedulingclasses. The proportion of killed jobs varies across scheduling classes, with the non-latency-sensitive-job class 0 having the highest number of killed jobs. (similar results have also beenobserved in prior work [30]). However, we find that the ratio of failed jobs to finished jobsis steadily low across most of the scheduling classes, and hence scheduling class does not173.3. Scheduling Constraintscorrelate with job failures. This implies the need to find factors other than scheduling classto characterize failures.11010010001000010000010000000 1 2 3Number of jobs Scheduling class Failed jobsFinished jobsKilled jobsFigure 3.4: Failed, finished and killed jobs in different scheduling classesA more fine-grained categorization of scheduling attributes is the task priority, whichdetermines the nodes assigned to the task, and the turnaround on the task. The priority isassociated with only tasks, and not with jobs. The priorities are grouped into five classes [40]ranging from 0 to 11, as described in Table 3.3.Table 3.3: Task priorities.PriorityNumberTask Pur-poseLevel Note0-1 Free Lowest Resources are rarely charged.2-8 Batch Middle Mainly for batch jobs.9 Normal pro-ductionHigh Dominant in production priorities;usually latency-sensitive tasks.10 Monitoring High Monitor the health of other jobs.11 Infrastructure Highest Storage/disk IO servicesNormally, all tasks of a job have the same priority. However, 14 out of 925 users havejobs with tasks of two or more priorities. We do not consider such jobs, and group theremaining tasks and their resubmissions by their priorities and termination statuses. Thegrouping is shown in Figure 3.5.As seen in Figure 3.5, a large number of low-priority tasks are evicted, possibly due183.3. Scheduling ConstraintsFigure 3.5: Task submissions by priority and termination statusto the over commitment of resources. We also see a large number of task failures in thetwo lowest priorities. A prominent percentage of tasks of the highest priority are killed,likely because either the requirements of the tasks are not easily fulfilled or because theyhave hard real-time constraints. On the contrary, most of the middle or batch priorities donot have many tasks that abnormally finished, and they have the highest average ratio offinished jobs in all three priority groups. This is because middle-priority jobs tend to bebatch jobs, and we speculate that they often perform routine tasks, and are hence less likelyto fail.The above graph includes task resubmissions, and may hence be biased towards low-priority tasks that have a lot of submissions. To counteract the effect of resubmission, weplot the distribution of tasks after discarding resubmitted tasks in Figure 3.6. As before, weobserve a high number of failed tasks in low- and high-priorities. The ratio of failed tasksin low priorities is low, but the same ratio is three times the average of failure ratios in themiddle priorities. This shows that, even ignoring resubmissions, both low- and high-prioritytasks are vulnerable to evictions, kills, and failures.193.4. Node FailuresFigure 3.6: Task submissions by priority and termination status (excluding resubmissions)3.4 Node FailuresNode dependability is an important aspect in understanding the overall dependability ofthe cloud. However, studying node dependability becomes quite complicated when virtual-ization allows multiple containers to share a common physical node. The failures may occurdue to faults in the hardware, or container software/instances. Unfortunately, the Googletrace does not provide design details of the physical machines and containers, and hence itis not possible to track the reliability issues to either physical nodes or containers. However,we can measure: (1) the availability of machines from the perspective of users and (2) theeffects of physical node/container reliability on overall task reliability.In the Google cluster, each machine is identified by a unique ID. Machines may beadded and removed from the cluster due to both maintenance or failures. We call theperiod ranging from an “add machine” event to an “remove machine” event or the end ofthe trace as a machine cycle. A machine may have multiple cycles if it is repeatedly removedand added.Figure 3.7 shows the CDF of the number of machine cycles in the trace. 59.1% of allthe machines are never removed from the cluster, and 27% are removed exactly once. Morethan 99% of machines have less than 6 machine cycles. However, there are some machines203.4. Node Failuresthat have a high number of cycles. For instance, one machine is removed and added 165times, and hence has 165 cycles.100 101 102 103Number of life cycles on a machine0.550.600.650.700.750.800.850.900.951.00CDFFigure 3.7: Distribution of machine cycles.To calculate the availability of the cluster, the period in a cycle is regarded as theuptime of the system, and the period between two cycles is the downtime. We aggregatethe total uptime E[total uptime] and total downtime E[total downtime] over all machines,and denote the availability by the ratio of total uptime to entire time.Availability =E[total uptime]E[total uptime] + E[total downtime](3.1)We find that the Google cluster has an average availability of 99.82% across all nodes.To better understand the influence of machine cycles on task failures, we compute thecorrelation between the average ratio of failed tasks and the number of machine cycles. Only7 out of the 12,500 machines have more than 30 cycles, and we regard these as outliers.For the remaining machines, we plot the average ratio of failed tasks in each machinecycle in Figure 3.8. We calculate the Pearson correlation coefficient by comparing theaverage of failed task ratio with the number of machine cycles. The correlation coefficient213.4. Node Failuresis −0.52 (p-value = 0.003), suggesting a medium negative correlation between ratio offailed tasks and the number of machine cycle. We speculate that the machine rejuvenation(removals/additions) may be the cause for the lower ratio of failures.0 5 10 15 20 25 30The number of machine cycle0. of ratios of failed tasksFigure 3.8: Average ratio of failed tasks v.s. machine cycles. The x-axis is the machinecycle number. In cycle number k, data points in the kth cycle are chosen from the machinesthat have no less than k cycles. The y-axis is the ratio of failed tasks.A machine can also update its available resources and configurations during its lifecycle. We plot the average task failure rate in a machine cycle against the number ofmachine removals in Figure 3.9, and those machines with frequent updates are plotted inred. As can be seen, high frequency of updates (more than 5 times) occur only on machineswith few life cycles (less than 8 times). The task failure rate is at a relatively low level inthese nodes despite the fact that removals are not frequent. This is likely because machinesthat are removed were updated when they were offline. We speculate that updating maybe another strategy to enhance the reliability in addition to machine removals.223.5. Resource UsageFigure 3.9: The failure rate in a machine life cycle. The y-axis is the mean failure ratein a machine life cycle, and the x-axis is the number of machine removals. Every machinelife cycle is represented by a blue dot, and the machines having more than 5 updates areplotted in red.3.5 Resource UsageIn this section, we explore the correlations between job resource usage and job failures. Weperform the analysis at two levels, i.e., job level and task level.3.5.1 Usage at the Job LevelWe focus on CPU and memory usage in these experiments. Figure 3.10 shows the CPUand memory usage in failed and finished jobs under a combination of three factors, namelypriority class, single/multi-task jobs and job length. The priority classes considered arebatch, free and production. The jobs are classified into single-task and multi-task jobs.For the multi-task jobs, the average resource usage is divided by the number of tasks, toensure that the usage is not skewed by the number of tasks. We classify jobs based on theirlength as follows: short jobs are shorter than 10 minutes; medium-length jobs are between10 minutes and 1 hour; long jobs are longer than 1 hour. The combinations of different233.5. Resource Usageoptions add up to a total of eighteen categories. resource usage failed (CPU) failed (memory) finished (CPU) finished (memory)(a) batch priority jobs00.0050.010.0150.020.0250.03single-taskshortsingle-taskmediumsingle-tasklongmulti-taskshortmulti-taskmediummulti-tasklongNormalized resource usage failed (CPU) failed (memory) finished (CPU) finished (memory)(b) free priority jobs00. resource usage failed (CPU) failed (memory) finished (CPU) finished (memory)(c) production priority jobsFigure 3.10: CPU and memory usage of jobs under different combinations of job and schedul-ing parameters.In all three priority groups, the multi-task jobs generally have more average resourceconsumption per task, than their single-task counterparts. Because we normalize this ona per task basis, the difference is not because of the higher number of tasks. Further, thedifference is most marked between single-task and multi-task jobs that are medium-lengthor long.We also observe differences between memory usages of failed versus finished jobs. In243.5. Resource Usagegeneral, failed jobs consume slightly more memory than finished jobs for 14 of the 18categories. Only 4 out of the 18 categories are different: multi-task short batch jobs, single-task medium/long production jobs, and multi-task short production jobs. In contrast tomemory usage, CPU usage does not vary as much between failed and finished jobs. In the18 categories, 12 of them have similar average memory usage, i.e., the ratios of the usagein failed jobs to those in finished jobs range from 0.5 to 2. Seventeen of all the categorieshave similar average CPU usage. An exception is the short batch group, which accounts foralmost one-third of the total jobs, its failed jobs have lower CPU usage than finished jobs.To further remove the effects of the heterogeneity among jobs and priority groups, westudy the variations among jobs that are resubmitted. Some resubmitted jobs fail, whileothers finish successfully, and we examine differences in their resource consumptions. TheGoogle manual says that restarting a job will usually generate a new job ID but keep thesame job name and user name [41]. So we select all unique jobs with the same job names, andcompare the resources consumed by the failed executions and their finished counterparts.Figure 3.11 shows the differences between failed and finished jobs that are resubmitted.Failed jobs are found to generally have less CPU and memory usage than the finishedcounterparts in all three priority groups. Specifically, around 60% to 83% of failed jobshave lower resource usage, i.e., on the left of the red line, in all priorities. For the rest ofjobs, finished jobs consume more resources, but the ratios are still close to 1. However, a tinyportion of jobs contradict this observation, contributing to the long tails in the distributions.One noticeable example is a series of jobs from one user repeating for more than 27 days,in which the resource consumption of the failed jobs is 1000 times that of the finished jobs.3.5.2 Usage at the Task LevelAt the task level, we mainly consider task executions (submissions). We separately gatherthe resource usage of failed task executions and finished task executions. Different from253.5. Resource Usage10-4 10-3 10-2 10-1 100 101 102 103 104 105Ratio of the usage (failed jobs) to the usage (finished jobs) in resubmitted jobs0. (free)CPU (batch)CPU (production)memory (free)memory (batch)memory (production)Figure 3.11: Ratios of resource (CPU and memory) consumed by failed jobs to that con-sumed by finished jobs in resubmitted jobs. The resource is calculated by dividing the totaljob resource by the number of task. The x-axis shows the ratio of average resource in failedjobs to that in finished jobs, and the y-axis represents the cumulative distribution function(CDF). The vertical line in red shows the ratio equals 1.comparing jobs, all task executions are compared within the same job. The resource usagedata are normalized by dividing by the maximum value in a certain resource measurement.To determine if resource usage samples from two kinds of task executions are significantlydifferent, we use the Mann-Whitney U or the rank-sum test [32], and measure the p-valuesof the comparison. A p-value less than 0.05 shows that the two samples significantly differ.The rank-sum test does not require samples to be normally distributed, and is hence morewidely applicable than other statistical tests.We select the jobs containing failed, finished and killed tasks and study CPU and mem-ory consumption, as inputs to the test. For example, all CPU usage samples in failed andfinished executions are the inputs to the rank-sum test to check if distributions of the exe-cutions are significantly different in CPU usage consumption. We calculate the p-values ofrank-sum tests on normalized resources between each pair of categories. Figure 3.12 showsthe results.263.5. Resource Usage0.0 0.2 0.4 0.6 0.8 1.0p value in the rank-sum test0. failed and finished executions0.0 0.2 0.4 0.6 0.8 1.0p value in the rank-sum test0. failed and killed executions0.0 0.2 0.4 0.6 0.8 1.0p value in the rank-sum test0. killed and finished executionsFigure 3.12: p-values of normalized resource usage in the rank-sum tests of each pair ofcategories.Figure 3.12a shows the result of running the rank-sumtest between failed and finishedtask executions, for each category. We find that 54.8%, 34.8%, and 93.2% of the p-valuesare smaller than 0.05 in the free, batch, and production priority classes, respectively. Thisresult implies that most of the production jobs have vastly different resource usage betweenfailed and finished task executions. Batch and free priority classes also have significantdifferences in CPU/memory usage of tasks in a large portion of jobs.Figure 3.12b, 3.12c depict how the killed task executions are significantly different fromthe failed and the finished tasks. The average ratios of low p-values in failed/killed testand finished/killed test decrease to around 34% and 28% in all priorities. Production jobsexperience the most drops of 43% and 23% respectively.273.5. Resource UsageEarly Failure Prediction: The above analysis shows that there are significant dif-ferences in resource consumption between failed and finished jobs, suggesting that failureprediction techniques can leverage these differences. We now examine how early in a job’slifetime do these differences manifest, for jobs that exhibit such differences. To explore thepossibility of early failure prediction, we examine the differences between failed and finishedexecutions earlier than the termination. Only jobs longer than 10 minutes are selected aswe believe that shorter jobs are unlikely to benefit from early prediction. Figure 3.13 showsthe ratios of tests that have p-value of no more than 0.05, and samples are collected fromthe beginning to 50%, 80% and 90% of the total execution time of the job. total time fromthe beginning80% total time fromthe beginning90% total time fromthe beginningtotal timeRatios of small p-value (<0.05) freebatchproductionFigure 3.13: p-values of normalized resource usage in the rank-sum tests of failed andfinished executions. The resources from beginning to 50%, 80% and 90% of the runningtime in jobs longer than 10 minutes are collected.We find that the ratio of jobs with small p-values (< 0.05) for all three priorities remainsteady after 50% of the job’s execution right until the very end. The decrease in accuracy athalf-time compared to the full execution time is negligible. Therefore, for jobs that exhibitdifferences in their resource consumption, the differences in consumption are significant evenhalfway into the job.283.6. User-centric Analysis3.6 User-centric AnalysisThe goal of this analysis is to identify user-specific features that may be correlated with jobfailures. In the trace, around 700 out of all 933 users have completed jobs and terminatedtasks. Further, 334, 397, and 670 users execute failed jobs, finished jobs, and killed jobsrespectively.We would like to have more insights on how user behaviors are correlated with thefailures, to understand possible reliability issues for different user classes. However, we donot consider the interactions between jobs from different users or job dependencies, as theGoogle dataset lacks information about the computation represented by jobs/tasks and thedependencies between jobs. Instead, to get an overall understanding, we cluster the usersbased on the characteristics and termination status of the jobs submitted by them.We perform K-means clustering [22] on a user feature vector, consisting of ratios of thefailed, finished and killed jobs and tasks. Regarding a user as a data object, the objectiveis to find the optimal division of data so that a data object is similar to other objectsin the same cluster but dissimilar to objects from other clusters. A common measure ofdissimilarity s is the Euclidean distance between two data points. The average of s over allpoints in the dataset, called Silhouette score [44], evaluates how appropriate are the clusters.A Silhouette score close to 1 indicates an excellent clustering. We vary the number of setsfrom 2 to 10, and find that the best Silhouette score of 0.75 is achieved when 6 sets areselected. The centroids of the 6 clusters and statistics of jobs and resources in each clusterare shown in Table 3.4: K-means clustering on users profiles. The features for clustering are the ratios of evict, fail, finish and killevents in both jobs and tasks. The statistics of job attributes and resource usage are in average, and # represents thenumber of a variable. In the user attributes, a user is called production user if the production jobs account for morethan 20% of its all jobs.Centroids of Features StatisticsClusterRatios of Jobs Ratios of Tasks Job Attributes Resource Usage UsersEvict Fail Finish Kill Evict Fail Finish Kill # Job #Tasksper JobLength CPU Memory # Pro-ductionUser#UserC1 0 0.0604 0.0633 0.8763 0.0705 0.0554 0.0644 0.8097 443 536.8 790.98 0.00076 0.00263 56 224C2 0 0.0498 0.316 0.6341 0.0645 0.0398 0.6816 0.214 238.79 1525.85 1035.57 0.01376 0.0039 2 184C3 0 0.425 0.0741 0.5009 0.0597 0.7307 0.0499 0.1597 281.18 227.81 4448.71 0.0034 0.00713 16 63C4 0 0.0444 0.8367 0.1188 0.0439 0.0441 0.8012 0.1108 18.43 2755.82 705.96 0.00545 0.00526 2 84C5 0.0079 0.1846 0.0664 0.741 0.7 0.0559 0.08 0.164 349.6 69.74 753.32 0.0019 0.00614 13 126C6 0 0.0395 0.8127 0.1477 0.1221 0.2946 0.1818 0.4015 28.82 952.47 664.8 0.00176 0.00998 5 19303.6. User-centric AnalysisIn Table 3.4, the clusters have the following properties.1. Users in Cluster 1 have more than 87% of jobs being killed, and the ratio of killedtasks is as high as 81%. Further, these users submit the largest average number ofjobs in all 6 clusters.2. Three clusters having many jobs executed are Clusters 2, 3 and 5. The correspondingratios of killed jobs are greater than 50%, while the ratios of killed tasks is low inthese clusters.3. Users in cluster 3 have the longest median job length of 4448 minutes, and 42.5% ofthese jobs fail at the end. This leads to potentially significant wastage of resources.4. Cluster 4 has the highest ratio in the finished jobs/tasks among all the clusters, ofabout 83%. Further, each job has a median number of 2755 tasks in this cluster.5. Cluster 5 contains the most evicted jobs and tasks among all clusters. The sixthcluster has a balanced ratio between the failed jobs and finished jobs.Table 3.4 also shows the attributes of resource usage for each cluster. The resource usagestatistics concern the average CPU and memory usage measurements per unit time. CPUusage is found positively correlated with ratios of finished jobs/tasks, but no obvious patternapplies to memory usage. We also inspect the user behaviour of executing production jobsand batch jobs, and select users who run more than 20% of production jobs in the entireperiod. The vast majority of these users are grouped into Cluster 1, 3 and 5. Cluster1 in particular has about 60% of the production users. This implies strong correlationsbetween user behaviours of submitting jobs and the outcomes of application reliability.This similarity can be potentially leveraged by anomaly detection and failure predictionsystems.313.7. Summary3.7 SummaryOur major findings are as follows:• In the Google cluster workload traces, there is a significant consumption of resourcesdue to failed and killed jobs.• Job and task failures manifest differently with respect to job and cloud attributes.– Task resubmissions in failed jobs are much higher than those in finished jobs onaverage.– Both low and high priority jobs experience on average 3 times as many failuresas other priority jobs.– Node maintenance and updates are correlated with smaller ratios of task failureson nodes.• Differences in resource consumption exist between task submissions of failed and fin-ished jobs. For jobs with multiple task submissions, at least 34.8% of the jobs havesignificant differences between the resource consumptions of failed and finished tasks.In most cases, the differences exist just halfway into a job’s execution (for long runningjobs).• User profiles can be clustered into 6 dominant groups, and they are correlated withjob failures.Based on these findings, we propose the implications for the design of reliable cloudsystems (Chapter 5.1). Along with the implications, we also discuss the threats to validityin the failure characterization study (Chapter 5.2).32Chapter 4Failure PredictionIn this chapter, we introduce a general framework for application failure prediction in com-pute clouds (Chapter 4.1). We describe the design details of the predictor for the Googlecluster trace (Chapter 4.2). Then we use the Google cluster trace to evaluate the proposedprediction method. We explain the experiments (Chapter 4.3) and analyze the results(Chapter 4.4). Finally, the work of the prediction is summarized in Chapter Failure Prediction FrameworkIn this section, we introduce the general framework for the prediction problem, as shown inFigure 4.1.Figure 4.1: General framework of predictionThe framework consists of four stages as follows: (1) monitoring and storing the systemand application metrics, (2) processing the data to structured formats containing their spa-tial and temporal information, (3) predicting the failures using machine learning techniques,and (4) failure remediation management based on prediction results. The monitoring mod-ule is not at the center of the reliability study. Additionally in existing cloud traces, the334.1. Failure Prediction Frameworksystem and application metrics are already provided. Therefore, we focus on the data pro-cessing (2nd) and prediction (3rd) stages. We defer failure remediation based on predictionresults to future work.Data Processing The goal of this stage is to formulate the collected performance datainto layered application-centric structures, which are required by machine learning models.For example, in the Google cluster trace, original data tables of system and applicationmetrics cover task resource usage measures and various attributes of the jobs, tasks, nodesand users in separate files. To integrate the data, we join the table files of system andapplication metrics. Each job is associated with performance data in its all tasks, thejob/task/node/user attributes, and the failure data (or termination status mentioned inChapter 2.2). The outcomes of spatial/temporal data have a two-level format: (1) job-levelstructured data with the job termination status as the classification target, and (2) task-level structured data with the task termination status as the classification target. At thetask level, the resource usage data are organized in the chronological order.Failure Prediction This stage predicts the termination statuses of tasks and jobstaking the two-level temporal/spatial data as inputs. Figure 4.2 describes the modules inthis stage. The job modelling module trains the predictor, which is composed of RNN basedestimator extracting temporal features at the bottom and ensemble methods combiningdifferent single estimators at the top. In the test phase, the predictor can be trainedfrom jobs from either all users or one user. Then in the job-level prediction module, thetermination statuses of a job and its tasks are predicted. After a certain period (e.g., 1day), all recent data are retrained in the parameter update module.Applying RNN The traditional RNN has a serious drawback for data with long-term dependencies: The error-signals could exhibit exponential decay as they are back-propagated through time, which leads to long-term signals being effectively lost as theyare overwhelmed by un-decayed short-term signals. To overcome this issue, we need to344.1. Failure Prediction FrameworkFigure 4.2: Prediction modulesAlgorithm 1: Prediction FrameworkInput: Two-level data of jobsOutput: Termination statuses of the jobs/tasks1 select the ensemble predictor;2 foreach job do3 select the predictors;4 foreach task in job do5 extract task features/usage time series;6 predict the task termination status;7 end8 generate job feature vector;9 predict the job termination status;10 endcapture long-term dependencies. We use the Hessian-free optimization [31] to model thetemporal connections between hidden states. In this way, we further model the long-termdependencies of resource measurements on prior measurements, and better capture thetemporal characteristics of resource usage within an application, particularly those long-running ones.Prediction of Jobs Prediction is conducted for each job, and the goal is to identifythe termination status. Algorithm 1 describes the prediction algorithm.354.2. Design of the Predictor4.2 Design of the PredictorRequirements To reach the goal of designing the failure predictors to run online onproduction cloud clusters, predicting the failures can not be time consuming. More thanthat, a good predictors should leave enough time for the failure management module to takeproactive actions to mitigate the effects of failures. Besides, the predictor should gauranteeadequate prediction accuracies, as many false positives are not expected in the design offailure prediction for a reliable cloud system. In addition, the predictor should automaticallygenerate failure reports to improve the efficiency. However, the cloud providers and userswho run jobs on the cluster are supposed to give the expectations of reliability. For example,a user who is expecting conservative actions on mitigating or descheduling the jobs shouldprobably accept occasional job failures. Thus, the predictor should be customized, we canprovide the opportunities to fulfill it.Finer-Grained Selection of Data One of the challenges in the prediction is theheterogeneity of workloads. To reduce the variance between jobs in a category, we dividethe entire trace into multiple categories based on the following criteria:• priority: batch, free (i.e., separated from batch for reliability reasons) and production.• job length: short (shorter than 10 minutes), medium (10 minutes to 1 hour), and long(longer than 1 hour).• task number: single-task jobs, and multi-task jobs.In terms of priority, the number of production jobs are much less than that of batch (priorityfree included) jobs. In addition, some production jobs that consume a lot of resources donot terminate in the monitoring period, we only consider the batch jobs for the prediction.In terms of job length, the number of long/medium jobs is one quarter of the number ofshort jobs, but long/medium jobs consume much more resources. In terms of task number,the number of single-task jobs is 3 times more than the number of the corresponding multi-364.3. Experimental Setuptask jobs, while the multi-task jobs consume much more resources on average. In the batchjobs, we select the four highest categories of resource consumptions as the candidates forevaluation: multi-task, batch/free, and medium-length/long jobs. They consume more than95% of the resources in all batch (and free) jobs.Neural Network Setup In the training of neural networks, we do not use all theresource usage measures as inputs, but we limit ourselves to 5 popular measures: meanCPU usage, mean memory usage, unmapped page cache, mean disk I/O time, and meandisk usage. Each measure is represented by a class in the input sequences, and thus theinputs have 5 classes of measures at any single time point. The original sampling intervalsrange from a few seconds to a few minutes. Therefore, we choose time ranges of 15 seconds,1 minute and 5 minutes, and average the resource usage measurements in these ranges.For the target sequences, we consider task termination statuses in the failed and finishedjobs. To represent the severity of task events, we assign weights of 1, 2, 3, and 4, tothe categories finished, evicted, killed, and failed respectively. Note that task failures arelabelled with 4, as they have the highest severity.In the ensemble method, the termination statuses generated in the RNN models aretaken as inputs of features. The outputs are the termination statuses of jobs. In addition torandomly selecting the subset of data for training, we use the subsets of one user or similarusers.4.3 Experimental SetupThe traces are originally stored in comma separated value files of approximate 200GB,and the data attributes are represented by key-value pairs. We read in these traces into aMySQL database for ease of analysis. Due to the large scale involved, we deploy databaseson Amazon Web Services (AWS) [1] for queries. In the prediction experiments, we join these374.3. Experimental Setuptables, and store the transformed two-level data of job traces. The job failure predictionmodule leverages machine learning packages in Python [4, 5] for the prediction. We evaluatethe performance overhead of the prediction on a 12-core processor that consists of two 6-coreIntel Xeon e5-2630 processors running at 2.30 GHz each.Evaluation Metrics The predictors are evaluated in the following aspects:Prediction coverage: The target jobs include long and part of the medium jobs, especiallyin the categories of heavy resource consumptions. Predicting failures of long jobs can yieldhigher benefits.Prediction times: The prediction should be conducted early so that proactive actions canbe taken.Prediction metrics: We define a good predictor as generating high true positive rate (TPR)and low false positive rate (FPR)TPR =# successful failure predictions# failures(4.1)FPR =# finished predicted as failures# finished(4.2)For the classification problems of two classes, sensitivity and specificity are the statisticalmeasures of the performance.sensitivity = TPR (4.3)specificity = 1 − FPR (4.4)Resource savings: To estimate the potential resource savings benefited from the prediction,we consider a simple proactive strategy of saving resources, i.e., killing the jobs that arepredicted to fail (as permitted by users). Assuming that a job can be killed at most once,we use the following metrics:R+: resource saved by stopping failed jobsR−: resource wasted by stopping finished jobs384.3. Experimental SetupRall: resource consumed by failed and finished jobsRratio, which represents the relative resource savings is therefore calculated as:Rratio =R+ −R−Rall(4.5)We select the multi-task/single-task batch long jobs, and simulate the outcomes ofpredicting at the half times of job lengths. Figure 4.3 plots the potential relative resource(CPU usage) savings with respect to TPR and FPR.(a) Multi-task batch long jobs(b) Single-task batch long jobsFigure 4.3: Approximate relative savings in CPU usage in the predictor designsGiven the same values of TPR and FPR, the relative resource savings can be ratherdifferent in the two categories. In the best case, about 32% and 13% of the CPU usage394.4. Experiment Resultscould be saved respectively; and in the worst case, about 15% and 35% of the CPU usagecould be wasted. To save more resources, TPR is more important in multi-task long batchjobs, while FPR is more important in single-task long batch jobs. In the predictor design,we can have trade-offs between TPR and FPR by varying classification thresholds, andcome up with a conservative predictor (low TPR/FPR) and an aggressive predictor (highTPR/FPR). Separate predictors can be used in different categories to maximize the resourcesavings.Experiments on the Workloads Considering the large size of the original data, weconduct the following tests in the first half of the data:1. We select the failed and finished medium/long jobs, and partition them into trainingand test sets in the chronological order. At different time slots (quarter, half andthe end) within a job, we make the predictions at the task level and job level, andcalculate the prediction metrics.2. We conduct early prediction at the quarter and half times, and then calculate therelative resource savings. The target jobs of high resource consumptions are themulti-task, batch/free, and medium-length/long jobs, as discussed in the previoussection.4.4 Experiment ResultsTask Level At the task level, we classify the termination statuses of task submissionsbased on the attributes and performance data. In all the target classes, the status finish isconsidered as one class, and the other three classes, i.e., evict, kill and fail, are consideredas a single class due to the reliability and severity. We evaluate the task level classificationin Figure 4.4.We observe that the classification achieves around 84% of the accuracy, 86% of the404.4. Experiment ResultsFigure 4.4: Task level results of metricssensitivity and 80% of the specificity. With the high true positive rate and low false positiverate, the task level classification serves as the foundation of job level prediction.Job Level At the job level, we classify the termination statuses of jobs into two classes:failed and finished. Figure 4.5 shows the prediction results of the conservative and aggressivepredictors at different time slots of the jobs.We observe distinct prediction results from the two predictors at the end of jobs. Theconservative predictor has a low FPR of less than 10%, and TPR stays more than 40%. Incomparison, the aggressive predictor has around 72% of TPR and 56% of FPR.The Effects of Selection on Predictors and Predicting Times In Figure 4.5, themetrics, particularly sensitivity, gradually increase as the prediction time advances, andthey do not reveal significant differences across the quarter, half times and the end. Itindicates that the jobs have a high possibility of being correctly predicted at the half timesif they could be predicted at the end.We further evaluate the resource savings using the two predictors at the quarter andhalf times. Figure 4.6 shows the relative resource savings of CPU usage, memory usage andtask hour in the three most heavy resource consuming job categories: multi-task batch longjobs, multi-task batch medium length jobs, and multi-task free long jobs.We find that the overall savings in the CPU usage, memory usage and task hours arearound 6% to 9% for this predictor at the half times in batch jobs. In comparison, the414.4. Experiment Results(a) Conservative predictor(b) Aggressive predictorFigure 4.5: Job level results of metricsaggressive predictor either saves or wastes more resources. For example, the aggressivepredictor saves about 4.3% and 10% more resources than the conservative predictor at thehalf times in the multi-task batch long and medium-length jobs. However, it wastes anadditional 17% resources in the multi-task batch medium-length jobs.In all the three job categories, the conservative predictor at the half time is the onlypredictor that generates positive savings, and can hence be regarded as a stable predictor.Meanwhile, conservative predictors are more friendly to users and job schedulers, as theydo not kill jobs unless they are absolutely certain of the job’s failure.424.4. Experiment Results(a) multi-task batch long(b) multi-task batch medium(c) multi-task free longFigure 4.6: Relative savings of resources (CPU usage, memory usage and task hour) in thegroups of high resource consumption434.4. Experiment Results4.4.1 Prediction OverheadsIn this section, we evaluate the performance overhead of running our predictor. While weperform offline prediction in this study, in reality, our technique will be deployed online andhence it is important to assess its performance overhead.The time taken by the prediction module for the first 50% of the jobs in the trace isaround 20.4 hours. Not accounting for the times in loading data from disks, the timestaken are 17.08 hours for the training phase, and 9.52 minutes for the test phase. In thereal-time prediction, trained models are normally used, and hence we do not count the timetaken for training in our analysis. On average, it costs the method 1 second to processaround 2268 seconds of the job data after the model has been trained. Thus, it is possibleto deploy the method to perform online prediction. Note that this result was obtained ona single machine. It could be made even faster using more cloud resources when deployingthe predictor on the production cloud.4.4.2 User Based OptimizationIn this experiment, to reduce the heterogeneity of the training data, we use the previous jobsfrom the same user to build the model. Only users with more than 1000 jobs are consideredfor this optimization, while the other users continue to use the model derived from theentire set of users. Figure 4.7 shows the resource savings of the user-based optimization,compared with the original conservative predictor at the half times.The overall savings in the user-based optimization, i.e., CPU usage, memory usage andtask hour, are around 7% to 10.7% for this predictor at the half times in batch jobs. Theextra saved resources are achieved through an additional 11% of increase in the true positiverate at the job level. Since the jobs from the same user may have higher similarity thantwo random jobs, finer grained categorization of the data may yield better results.444.5. SummaryFigure 4.7: Resource savings of original predictor and user-based optimization4.5 SummaryIn summary, we propose a framework of prediction for cloud failures, or particularly cloudapplication failures. Using the Google cluster trace, we find that the prediction methodcould generate an accuracy rate of around 84% at the task level. Furthermore at the joblevel, around 6% to 10% of resources are saved when predicting failures at the half times.More discussions of the prediction are presented in Chapter 5.3.45Chapter 5DiscussionIn this chapter, we discuss the implications of the failure characterization results (Chap-ter 5.1), followed by the threats to validity of the failure characterization study (Chap-ter 5.2). Then we discuss the advantages and limitations of the prediction approach andthreats to validity of the prediction (Chapter 5.3).5.1 Implication of the Characterization ResultsOur analysis results are useful for failure-aware resource provisioning [24] and failure pre-diction. Such policies have also been used in failure-aware scheduling and energy-awarescheduling [33] to mitigate the effects of failed and killed jobs. We find that finished jobshave much shorter running times and consume fewer resources than failed and killed jobsin Chapter 3.1. This implies that a lot of resources may be wasted on jobs that do notfinish, except those that are for debugging or testing purposes. Nevertheless, this indicatesthe need for early failure prediction at the infrastructure provider level.We also found that the termination statuses of jobs are influenced by the job’s pre-launchattributes (namely, priority and resubmission rule) in Chapter 3.2. For example, failed andkilled jobs have high number of resubmissions. To save resources, it may be a good idea tolimit the number of job resubmissions (for some classes of jobs) if a job is predicted to failor terminate unsuccessfully, especially for automated resubmissions.Another issue is that low-priority jobs contend for resources with high-priority jobs,465.1. Implication of the Characterization Resultsmaking it more likely for high-priority jobs to possibly fail and thus waste resources. Further,both low- and high-priority jobs experience high failure ratios (Chapter 3.3), and hence thereis a need for a scheduler that can adjust job priorities based on their failure histories.Although Google does not disclose how they maintain and update machines in thecluster, we find that machines and containers that experience removals or updates are lessprone to failures (Chapter 3.4), suggesting that these operations improve reliability. Thisis similar to the idea of software rejuvenation [8], but at the container level.We also observed correlations between the resource consumption of jobs and theirpropensity for failure in Chapter 3.5. While these correlations depend on the job’s pri-ority class and whether the job is single- or multi-task, there are significant differencesbetween the resource consumptions of failed and finished tasks, both in CPU and memoryconsumption, so a good failure prediction could help the resource scheduler to allocate theresources differently among predictably faulty and successful jobs. Further, when thesecorrelations manifest, they do so as early as 50% into the job’s run time, thereby indicatingthe potential for early failure prediction for long jobs. We define a threshold of 1 hour,to filter long jobs and apply the failure prediction algorithm. Such jobs can last anywherefrom a few hours to a few days, so one can wait till the threshold, and still get significantresource savings.Finally, we find that job failure behaviour can be clustered into six categories based onthe users submitting the jobs, and that each category has distinctive patterns in terms ofjob attributes and resource consumption in Chapter 3.6. This information can be used inanomaly detection, for example, to detect jobs that deviate significantly from the character-istics of their categories and perhaps terminate them early. This would allow more efficientresource utilization in the cluster.475.2. Threats to Validity of the Characterization5.2 Threats to Validity of the CharacterizationOur study focuses on the Google cluster, and hence may not be generalizable to other cloudinfrastructures. This is an external threat to the validity. One way to mitigate this threat isto study other cloud infrastructure failures. However, there is no publicly available failuredata from real-world cloud deployments on the same scale as the Google cluster.The main internal threat to validity is that the Google dataset is both incomplete andanonymized (out of privacy concerns). In particular, there are four limitations:1. It is not clear that who the users are, what their workflows are, and why the userswere running the jobs. Therefore, it is difficult to say anything about the effect offailures on the overall user experience.2. A job can fail either because of performance reasons (e.g., lack of resources) or reli-ability reasons (hardware/software/network failures) or simply for testing/debuggingpurposes. The traces do not have enough information to infer the job failure causes.3. The dataset does not have program or application information, such as whether theprograms were MapReduce jobs. It does not have any information about the jobschedulers, or other software running on the nodes.4. The resource consumption is normalized by the corresponding maximum values, andthe raw values are not provided. Hence, we cannot understand the reasons behindwhy certain consumption patterns are correlated with failures.To mitigate the above threat, we need more information about the traces, but unfortunately,this is not publicly available.There is a construct threat to validity in that we have assumed that resource conservationis a desired goal for the users of the cluster. However, this need not be the case as the clustermay be used purely for debugging or testing tasks, where job failures are the expected485.3. Discussions on the Predictorbehaviour. As such, this threat can be mitigated if we knew what the cluster is used for,but this is not the case.5.3 Discussions on the Predictor5.3.1 Advantages of the PredictorWe list the advantages of the predictor as follows.First, we find that the combination of RNNs and ensemble methods are capable ofhandling the scale and heterogeneity of the performance data empirically. The runtimeoverhead of the strategy is low so that it can be extended to a larger scale of cluster withmore complexity.Second, the predictor can be adapted to other kinds of failures and anomalies in thecloud. The target cloud stack has a layered structure of physical machine, OS, container andapplication from bottom to top. Given explicit labels of failures, the method predicts thefailures; without labels of failures, the method can be supervised by detecting the anomaliesthat are induced by fault injection. Besides, the method can easily capture the status ofthe fundamental layer when the monitoring metrics of the physical machines are provided.Third, the outcomes of the failure predictors can be used by more strategies such asfailure/energy-aware scheduling [33], mitigation and resource provisioning [24].5.3.2 Limitations of the PredictorThere are three kinds of limitations, one with regard to the trace itself, and the second withregard to our prediction and mitigation strategy, and the last with regard to the modelselection.The Trace First, the resource consumption is normalized by the maximal values of theresource consumption, and hence some of the original features are lost. Second, although495.3. Discussions on the Predictorjob failures are identified, the fundamental reasons, i.e., performance reasons or hardware/-software related reasons are not distinguished in the trace. As a result, we cannot furtherseparate the dataset to provide finer-grained predictions.Mitigation Strategy First, the basic proactive fault management we propose is tosimply kill the jobs that are predicted to fail. However, if the prediction is wrong, it wastesresources as the killed jobs would probably be restarted. This strategy is sensitive to thefalse positive rate, as killing a job is controversial in the scheduler. Second, the failureprediction may not work when the failures happen soon after the faults manifest. It isdifficult to predict early enough to avoid the failure in these cases.Model Selection First, we choose the conservative predictor at the half times formaximum resource savings. The selection of predicting time is not proved optimal, butthe solution is empirically good. Second, the TPR, for example, ranges from 0.25 to 0.4,while FPR has to be at a low rate. However, the low TPR helps reduce the variance in thelearning. It calls for deep structures in the RNN and a larger set of features.5.3.3 Threats to Validity of the PredictionInternal Threats Using the Google cluster trace, the prediction share the same internalthreats to validity as the threats in the characterization study (discussed in Chapter 5.2).In addition, three internal threats to validity are described as below.One internal threat comes from the method itself and the features/attributes we use.We can not prove that the method based on RNN is necessary the best. But we alreadyselect the features that enlarge the differences between different jobs (fail, finish, etc.) toattempt to minimize the threat of validity. There are two extra solutions: (1) comparingthe results using multiple machine learning algorithms in Chapter 2.3 instead of only onealgorithm, (2) using deep RNN to generate more and better features for prediction.A second internal threat is that failed and finished tasks may have similar properties505.3. Discussions on the Predictorand resource usage measures inside a job. There are no guarantees on using these propertiesand measurements to identify failures and further generate correct predictions.To deploy the prediction modules in the real-time clusters, it is necessary to estimatehow long a job runs because of the potential resource savings before failed terminations. Thefailure prediction should be combined with techniques such as predicting the job completiontimes [25]. The third internal threat is that benefits of prediction (e.g., resource savings)also depend on completion time prediction. The resource savings may be reduced when thetime prediction is not accurate.External Threats There are two external threats. First, we do not use the entiredataset. Only part of the long and medium length batch jobs are selected for evaluatingrelative resource savings since they consume majority of resources. However, some produc-tion jobs (e.g., critical web services) are supposed to keep running and do not finish, andthus there are no meaning in judging if a job will successfully finish. In this situation, thejob prediction method can not be directly applied for validation. Second, only the Googlecluster traces are used. We plan to evaluate the predictor on other cloud clusters, e.g., thecloud trace from IBM research [48].51Chapter 6Related WorkThis chapter provides an overview of related work in the areas of failure analysis andprediction methods, the studies on the Google dataset and the machine learning techniques,and how this work differs from them.Failure analysis. Prior studies characterize failures in supercomputers and cloudsfrom the perspective of system failures [13, 16, 49] and application failures [25, 42]. ElSayedet al. [16] perform a comprehensive statistical analysis on supercomputer logs from LosAlamos National Labs, presented in the computer failure data repository (CFDR) [2]. Theyalso explore the impact of environment issues on failures. Vishwanathan et al. [49] explorethe hardware reliability of clouds. They find that disks are the main culprit in node failures.Unlike our work, these studies focus on hardware reliability rather than job failures, whichcan be caused by hardware, software and configuration failures.Kavulya et al. [25] analyze logs from Hadoop applications and characterize their jobpatterns and the failure causes. Ren et al. [43] study the logs collected from Hadoop clustersrunning e-commerce applications. In contrast to these workloads, the Google dataset hasa more diverse workload, and hence our findings are applicable to a broader range of cloudapplications.Using workload traces from The Grid Workload Archive project [23], Fadishei et al. [17]find correlations between job failures and attributes including CPU intensity, memory usage,CPU utilization, queue utilization, exit hour and migration of jobs. Williams et al. [50]empirically analyze the fault-free and faulty performance data from a replicated middleware-52Chapter 6. Related Workbased system, and find that unstable performance is a precursor of failures. While theseworks have all investigated the relationship between resource consumption and job failures,they have been confined to particular classes of jobs. In contrast, our work is the first toexplore such correlations in a diverse workload in a production cloud.Failure Prediction. Online failure prediction based on runtime monitoring is apopular research area. There has been a variety of models and methods that use thecurrent state of a system and, frequently, past experience as well, for example the work bySalfner et al. [46]. The methods can be generally categorized into 4 groups: (1) rule-basedapproaches, including expert knowledge, (2) classifiers, such as Bayesian classifier and fuzzyclassifier, (3) statistical tests, and (4) time series analysis. Different from restricting to asingle category, our prediction method builds on the last three categories as a combination.Prior results on failure analysis or characterization have been applied to failure diagnosisand prediction in supercomputers and cloud clusters [17, 29, 35, 50]. Liang et al. [29] usetagged logs from the BlueGene machine to discover failures recurrences and correlationsbetween fatal and non-fatal events, and thus predict failures. Pan et al. [35] use the differ-ences in the behavior of faulty and normal nodes in a MapReduce environment to identifyfailures. However, problems arise when nodes are heterogeneous or few similar nodes can betreated as references. Williams et al. [50] extract the features of unstable performance thatis a precursor of failures empirically. They build a black-box method, and predict failurein a window ahead of impending crash failures. In summary, these works predict systemfailures, or are confined to particular classes of jobs. In contrast, our work is the first topredict application failures in a diverse workload in the cloud.The results of failure prediction can be applied to improve the performance of the entiresystem. Oliner et al. [34] demonstrates that failure-aware scheduling can be effective evenwith the modest prediction accuracy. They show that improved scheduling of paralleljobs has a significant impact on the job response time and overall system utilization. Liu53Chapter 6. Related Worket al. [28] focuses on adjusting the placement of active or running jobs in response tofailure prediction, and proposes an application-level job migration and processor swappingapproach to diminish the impact of failures. Our work is orthogonal to the above techniques,as it deals with failure prediction, but can facilitate failure aware scheduling and placement.Machine Learning for Time Series Data. Modeling the time series data hasbeen a subject of active research in the past decades [27]. Classical time series problemsare composed of video/speech recognition, stock market prediction, motion capture datarecognition, and physiological data (e.g., EEG) recognition. Similar to these topics, theperformance data in the cloud cluster have a large volume and high heterogeneity, and arecertainly a problem for machine learning.Previous studies apply the recurrent neural networks to the speech recognition [19,20] and polyphonic music prediction [36] and generation [7]. Graves et al. [19] propose amethod to train RNNs to label unsegmented sequences. The results outperform the previousdominant method HMMs [38]. Graves et al. [20] combine multiple levels of representationin the networks to build an end-to-end approach for the recognition. Different from the wellstudied areas in machine learning, we adapt the methods to the performance data.Studies of Google Cluster Dataset. There have been a number of studies on theGoogle cluster dataset focussing on the workload characterization and machine utilization.Liu et al. [30] perform a statistical analysis of node, job and task level workload withrespect to resource utilization. Reiss et al. [39] study the heterogeneity of tasks in theGoogle dataset. They find that the resources and the tasks executed vary widely. Khan etal. [26] propose an accurate characterization that can faithfully reproduce the performanceof historical workload traces in terms of key performance metrics, such as task wait time andmachine resource utilization. Zhang et al. [52] propose a model for runtime task resourceusage that is able to reproduce aggregate resource usage and scheduling delays. They findthat using the mean and coefficient of variation within each task can generate synthetic54Chapter 6. Related Workworkload traces, reproducing accurate resource utilizations and task waiting time. Di etal. [12] compare the differences between the Google data center and a Grid system. Theyfind that the Google dataset exhibits finer granularity resource allocation with respect toCPU and memory than the selected Grid system. The main difference between these papersand ours is that none of them study failures or failure-related attributes.Recently, there have been a few studies on understanding failures in the Google dataset.We explain the differences between these papers and ours below.Di et al. [11] use job-specific information and the termination statuses of tasks, andapply the K-means clustering to characterize the jobs. However, their analysis is basedon logical job names, which are not guaranteed to be unique. The application propertiescould be dominated by a few jobs, and thus applications provide less characteristics thanall separated jobs included. In contrast we use job IDs which are guaranteed to be unique,and provide a higher coverage on characterizing jobs and tasks. Further, rather than simplyclustering task events to get centroids of clusters, we correlate the clusters of failures withuser profiles, and we consider job events as well. Finally, job attributes such as priorities,resubmissions and run time are not considered in their paper.Guan et al. [21] use principal component analysis on the task resource consumption toidentify the features most likely to influence failures. They find that the average correla-tions of the raw resource usage to the failures are around 0.07 in all tasks. In contrast, weperform finer grained analysis on different classes of jobs and resources, and we find muchhigher correlations and more significant differences between failures and successful termi-nations. For example, in our analysis, at least 34.8% of the jobs have significant differencesbetween the resource consumptions of failed and finished tasks. They further propose aprincipal component analysis based algorithm to identify anomalies (failures) by monitor-ing performance metrics. Their algorithm is essentially built on dimension reduction, whichis oriented to their self-collected data with hundreds of dimensions, but show much less55Chapter 6. Related Workaccuracy in the Google trace with only 12 dimensions of resource measures. Their goal isequivalent to the task level classification in our algorithm, while we have higher accuracy.More importantly, we predict job failures and propose applying early prediction results tosave resources.In very recent work, Garraghan et al. [18] study the node and task failures’ statisticaldistributions, including in mean time between failures (MTBF) and mean time to repair(MTTR). However, distributions are not enough to characterize machine and task failures,as the workload is highly diverse. In contrast, we use job and cloud system attributesto understand the correlations between job failures and attributes. They also label thenode maintenance as failures. However, the work by Reiss et al. [39] has shown that nodemaintenance is mostly planned downtime, and hence different from failures. Finally, theydo not consider the correlations between resource consumption of the jobs and their failures.56Chapter 7Conclusion and Future WorkThis thesis presents a characterization study on cloud failures and a failure predictor forcloud applications.We investigate the characteristics of failed and killed jobs in Google’s production cloudsystem. We characterize failures of jobs with respect to their attributes, and study theeffects of attributes, such as priority, task submissions, and resource consumptions, on jobfailures. Failed and finished jobs and tasks have different characteristics of resource usage,and these differences have a high probability of manifesting well before the jobs’ end. Thestudy points to the importance of failure prediction for resource provisioning and schedulingin compute clouds.We then present a prediction approach, which builds on the recurrent neural networksand the ensemble methods, for predicting failures via various attributes and performancetime series data. We successfully predict the termination statuses of tasks and jobs in theGoogle cluster traces. Experiments show a true positive rate of more than 84% and a falsepositive rate of 20% at the task level. At the job level, 6% - 10% of resources are savedusing early prediction at the halfway points of job executions.The work in the thesis will be extended in the following directions:1. To extend the characterization study to a wider range of cloud systems. A compre-hensive study of typical cloud clusters benefit building future reliable cloud.2. To improve the accuracy in the failure prediction algorithms for better cloud utilization57Chapter 7. Conclusion and Future Workand reliability. One solution is to fully implement the parameter update model andto expand the set of features in the learning module. A lower rate of false positivescan help the proactive failure management based on prediction results become moreeffective and less controversial.3. To extend the prediction framework to general cloud clusters beyond the Googlecluster.58Bibliography[1] Amazon web services (aws) - cloud computing services. http://aws.amazon.com/.Accessed: 2014-10-21.[2] The computer failure data repository (cfdr). http://www.usenix.org/cfdr. Accessed:2014-10-21.[3] Docker. https://www.docker.com/. Accessed: 2014-10-21.[4] scikit-learn: Machine learning in python. http://scikit-learn.org/stable/. Ac-cessed: 2014-10-21.[5] James Bergstra, Olivier Breuleux, Fre´de´ric Bastien, Pascal Lamblin, Razvan Pas-canu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio.Theano: a CPU and GPU math expression compiler. In Proceedings of the Python forScientific Computing Conference (SciPy), June 2010.[6] Robert Birke, Ioana Giurgiu, Lydia Y Chen, Dorothea Wiesmann, and Ton Engbersen.Failure analysis of virtual and physical machines: Patterns, causes and characteristics.In Dependable Systems and Networks (DSN), 44th Annual IEEE/IFIP InternationalConference on, pages 1–12. IEEE, 2014.[7] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling tem-poral dependencies in high-dimensional sequences: Application to polyphonic musicgeneration and transcription. arXiv preprint arXiv:1206.6392, 2012.59Bibliography[8] Dario Bruneo, Salvatore Distefano, Francesco Longo, Antonio Puliafito, and MarcoScarpa. Workload-based software rejuvenation in cloud systems. IEEE Transactionson Computers, 62(6):1072–1085, 2013.[9] Xin Chen, Charng-Da Lu, and Karthik Pattabiraman. Failure analysis of jobs incompute clouds: A google cluster case study. In the International Symposium onSoftware Reliability Engineering (ISSRE). IEEE, 2014.[10] Xin Chen, Charng-Da Lu, and Karthik Pattabiraman. Failure prediction of jobs incompute clouds: A google cluster case study. In the International Workshop on Relia-bility and Security Data Analysis (RSDA). IEEE, 2014.[11] Sheng Di, Derrick Kondo, and Franck Cappello. Characterizing cloud applications on agoogle data center. In Parallel Processing (ICPP), 2013 42nd International Conferenceon, pages 468–473. IEEE, 2013.[12] Sheng Di, Derrick Kondo, and Walfredo Cirne. Characterization and comparison ofcloud versus grid workloads. In Proceedings of the 2012 IEEE International Conferenceon Cluster Computing, CLUSTER ’12, pages 230–238, Washington, DC, USA, 2012.IEEE Computer Society.[13] Catello Di Martino. Lessons learned from the analysis of system failures at petascale:The case of blue waters.[14] Thomas G Dietterich. Ensemble methods in machine learning. In Multiple classifiersystems, pages 1–15. Springer, 2000.[15] Florin Dinu and TS Ng. Understanding the effects and implications of compute noderelated failures in hadoop. In Proceedings of the 21st international symposium onHigh-Performance Parallel and Distributed Computing, pages 187–198. ACM, 2012.60Bibliography[16] N. El-Sayed and B. Schroeder. Reading between the lines of failure logs: Understandinghow hpc systems fail. In Dependable Systems and Networks (DSN), 2013 43rd AnnualIEEE/IFIP International Conference on, pages 1–12, 2013.[17] H. Fadishei, H. Saadatfar, and H. Deldari. Job failure prediction in grid environmentbased on workload characteristics. In Computer Conference, 14th International CSI,pages 329–334, 2009.[18] Peter Garraghan, Paul Townend, and Jie Xu. An empirical failure-analysis of a large-scale cloud computing environment. In High-Assurance Systems Engineering (HASE),2014 IEEE 15th International Symposium on, pages 113–120. IEEE, 2014.[19] Alex Graves, Santiago Ferna´ndez, Faustino Gomez, and Ju¨rgen Schmidhuber. Con-nectionist temporal classification: labelling unsegmented sequence data with recurrentneural networks. In Proceedings of the 23rd international conference on Machine learn-ing, pages 369–376. ACM, 2006.[20] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition withdeep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP),IEEE International Conference on, pages 6645–6649. IEEE, 2013.[21] Qiang Guan and Song Fu. Adaptive anomaly identification by exploring metric sub-space in cloud computing infrastructures. In Reliable Distributed Systems (SRDS),International Symposium on, pages 205–214. IEEE, 2013.[22] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning:Data Mining, Inference, and Prediction, Second Edition. Springer series in statistics.Springer, 2009.[23] Alexandru Iosup, Hui Li, Mathieu Jan, Shanny Anoep, Catalin Dumitrescu, LexWolters, and D. H. J. Epema. The grid workloads archive, 2008.61Bibliography[24] Bahman Javadi, Jemal Abawajy, and Rajkumar Buyya. Failure-aware resource provi-sioning for hybrid cloud infrastructure. Journal of parallel and distributed computing,72(10):1318–1331, 2012.[25] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. An analysis of traces from aproduction mapreduce cluster. In Cluster, Cloud and Grid Computing (CCGrid), 201010th IEEE/ACM International Conference on, pages 94–103, 2010.[26] A. Khan, X. Yan, Shu Tao, and N. Anerousis. Workload characterization and predictionin the cloud: A multiple time series approach. In Network Operations and ManagementSymposium (NOMS), 2012 IEEE, pages 1287–1294, 2012.[27] Martin La¨ngkvist, Lars Karlsson, and Amy Loutfi. A review of unsupervised featurelearning and deep learning for time-series modeling. Pattern Recognition Letters, 42:11–24, 2014.[28] Yawei Li, Prashasta Gujrati, Zhiling Lan, and Xian-he Sun. Fault-driven re-schedulingfor improving system-level fault resilience. In Parallel Processing, 2007. ICPP 2007.International Conference on, pages 39–39. IEEE, 2007.[29] Y. Liang, Y. Zhang, M. Jette, Anand Sivasubramaniam, and R. Sahoo. BlueGene/Lfailure analysis and prediction models. In International Conference on DependableSystems and Networks (DSN), pages 425 – 434, 2006.[30] Zitao Liu and Sangyeun Cho. Characterizing machines and workloads on a google clus-ter. In Parallel Processing Workshops (ICPPW), 2012 41st International Conferenceon, pages 397–403, 2012.[31] James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on MachineLearning, pages 1033–1040, 2011.62Bibliography[32] Joseph McKean and Thomas Hettmansperger. Robust nonparametric statistical meth-ods. CRC Press, 2011.[33] Ramesh Mishra, Namrata Rastogi, Dakai Zhu, Daniel Mosse´, and Rami Melhem. En-ergy aware scheduling for distributed real-time systems. In Proceedings of the 17thInternational Symposium on Parallel and Distributed Processing, pages 21–2. IEEEComputer Society, 2003.[34] Adam J Oliner, Ramendra K Sahoo, Jose´ E Moreira, Manish Gupta, and AnandSivasubramaniam. Fault-aware job scheduling for bluegene/l systems. In Paralleland Distributed Processing Symposium, 2004. Proceedings. 18th International, page 64.IEEE, 2004.[35] Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. Gane-sha: Blackbox diagnosis of mapreduce systems. SIGMETRICS Perform. Eval. Rev.,37(3):8–13, January 2010.[36] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How toconstruct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013.[37] Cuong Pham, Phuong Cao, Zbigniew Kalbarczyk, and Ravishankar K Iyer. Toward ahigh availability cloud: Techniques and challenges. In Dependable Systems and Net-works Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pages1–6. IEEE, 2012.[38] Lawrence Rabiner. A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.[39] Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and MichaelKozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. InProceedings of the Third ACM Symposium on Cloud Computing, page 7. ACM, 2012.63Bibliography[40] Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A.Kozuch. Towards understanding heterogeneous clouds at scale: Google trace analysis.Technical Report ISTC–CC–TR–12–101, Intel science and technology center for cloudcomputing, Carnegie Mellon University, Pittsburgh, PA, USA, April 2012.[41] Charles Reiss, John Wilkes, and Joseph L. Hellerstein. Google cluster-usage traces:format + schema. Technical report, Google Inc., Mountain View, CA, USA, November2011. Revised 2013.05.06.[42] Kai Ren, YongChul Kwon, Magdalena Balazinska, and Bill Howe. Hadoop’s adoles-cence: An analysis of hadoop usage in scientific workloads. Proc. VLDB Endow.,6(10):853–864, August 2013.[43] Zujie Ren, Xianghua Xu, Jian Wan, Weisong Shi, and Min Zhou. Workload char-acterization on a production hadoop cluster: A case study on taobao. In WorkloadCharacterization (IISWC), International Symposium on, pages 3–13. IEEE, 2012.[44] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation ofcluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.[45] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representa-tions by back-propagating errors. Nature, 323:533–536, 1986.[46] Felix Salfner, Maren Lenk, and Miroslaw Malek. A survey of online failure predictionmethods. ACM Comput. Surv., 42(3):10:1–10:42, March 2010.[47] Taghrid Samak, Dan Gunter, Monte Goode, Ewa Deelman, Gideon Juve, Fabio Silva,and Karan Vahi. Failure analysis of distributed scientific workflows executing in thecloud. In Proceedings of the 8th International Conference on Network and ServiceManagement, pages 46–54. International Federation for Information Processing, 2012.64Bibliography[48] B. Sharma, P. Jayachandran, A Verma, and C.R. Das. Cloudpd: Problem determi-nation and diagnosis in shared dynamic clouds. In Dependable Systems and Networks(DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pages 1–12, June2013.[49] Kashi Venkatesh Vishwanath and Nachiappan Nagappan. Characterizing cloud com-puting hardware reliability. In Proceedings of the 1st ACM symposium on Cloud com-puting, pages 193–204. ACM, 2010.[50] A.W. Williams, S.M. Pertet, and P. Narasimhan. Tiresias: Black-box failure predic-tion in distributed systems. In Parallel and Distributed Processing Symposium. IEEEInternational, pages 1–8, 2007.[51] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence classifi-cation. ACM SIGKDD Explorations Newsletter, 12(1):40–48, 2010.[52] Qi Zhang, J Hellerstein, and Raouf Boutaba. Characterizing task usage shapes ingoogles compute clusters. Proceedings of LADIS, pages 2–3, 2011.65


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items