Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Support for configuration and provisioning of intermediate storage systems Costa, Lauro Beltrão 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_february_costa_lauro.pdf [ 2.08MB ]
JSON: 24-1.0165568.json
JSON-LD: 24-1.0165568-ld.json
RDF/XML (Pretty): 24-1.0165568-rdf.xml
RDF/JSON: 24-1.0165568-rdf.json
Turtle: 24-1.0165568-turtle.txt
N-Triples: 24-1.0165568-rdf-ntriples.txt
Original Record: 24-1.0165568-source.json
Full Text

Full Text

Support for Configuration and Provisioning ofIntermediate Storage SystemsbyLauro Beltrão CostaB.Sc. Computer Science, Universidade Federal de Campina Grande, 2003M.Sc. Computer Science, Universidade Federal de Campina Grande, 2005A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Electrical and Computer Engineering)The University Of British Columbia(Vancouver)November 2014c© Lauro Beltrão Costa, 2014AbstractThis dissertation focuses on supporting the provisioning and configurationof distributed storage systems in clusters of computers that are designedto provide a high performance computing platform for batch applications.These platforms typically offer a centralized persistent backend storagesystem. To avoid the potential bottleneck of accessing the platform’sbackend storage system, intermediate storage systems aggregate resourcesallocated to the application to provide a shared temporary storage spacededicated to the application execution.Configuring an intermediate storage system, however, becomes increas-ingly complex. As a distributed storage system, intermediate storagecan employ a wide range of storage techniques that enable workload-dependent trade-offs over interrelated success metrics such as responsetime, throughput, storage space, and energy consumption. Because it isco-deployed with the application, it offers the user the opportunity to tailorits provisioning and configuration to extract the maximum performancefrom the infrastructure. For example, the user can optimize the performanceby deciding the total number of nodes of an allocation, splitting these nodes,or not, between the application and the intermediate storage, and choosingthe values for several configuration parameters for storage techniques withdifferent trade-offs.This dissertation targets the problem of supporting the configuration andprovisioning of intermediate storage systems in the context of workflow-based scientific applications that communicate via files – also known asiimany-task computing – as well as checkpointing applications. Specifically,this study proposes performance prediction mechanisms to estimate per-formance of overall application or storage operations (e.g., an applicationturn-around time, application’s energy consumption, or response time ofwrite operations). By relying on the target application’s characteristics, theproposed mechanisms can accelerate the exploration of the configurationspace. The mechanisms use monitoring information available at the appli-cation level, not requiring changes to the storage system nor specializedmonitoring systems.The effectiveness of these mechanisms is evaluated in a number ofscenarios – including different system scale, hardware platforms, andconfiguration choices. Overall, the mechanisms provide accuracy highenough to support the user’s decisions about configuration and provisioningthe storage system, while being 200x to 2000x less resource-intensive thanrunning the actual applications.iiiPrefaceDuring my PhD studies, I have conducted the research presented in thisdissertation and collaborated with other researchers on various projects.This preface presents the publications related to these projects. Acceptancerates are provided when available.I was the main author of the research described in this dissertation,which is also the result of collaboration with several people, mostly from theNetworked Systems Laboratory (NETSYSLAB) at the University of BritishColumbia (UBC). Additionally, part of the materials included in thisdissertation have been either published or submitted for publication.The list below presents, for each chapter, the corresponding publicationsand my role in each one. Appendix A briefly describes other researchprojects [26, 53, 78–81] in which I was involved during my PhD studies.Chapter 3 “Predicting Performance for I/O Intensive Workflow Appli-cations” has been partially published according to the list below. Iwas the main contributor for the research presented in this chapter,including the initial idea, system development, and evaluation. I hadthe collaboration of Hao Yang, Samer Al-Kiswany, Abmar Barros, andMatei Ripeanu to discuss the project, execute experiments, or makesome edits in the text. The following publications present parts of theresearch behind Chapter 3 of this dissertation.Supporting Storage Configuration for I/O Intensive Workflows[57]. Lauro Beltrão Costa, Samer Al-Kiswany, Hao Yang, andivMatei Ripeanu. In Proceedings of the 28th ACM InternationalConference on Supercomputing, ICS ’14. Pages 191–200. Accep-tance rate: 20%. ACM, June 2014.Predicting Intermediate Storage Performance for Workflow Appli-cations [55]. Lauro Beltrão Costa, Samer Al-Kiswany, AbmarBarros, Hao Yang, and Matei Ripeanu. In Proceedings of the 8thParallel Data Storage Workshop, PDSW ’13, Pages 33–38. ACM,November 2013.Energy Prediction for I/O Intensive Workflows [171] Hao Yang,Lauro Beltrão Costa, and Matei Ripeanu. In Proceedings of the7th Workshop on Many-Task Computing on Clouds, Grids, andSupercomputers, MTAGS ’14. Pages 1–6. ACM, November 2014.Chapter 4 “Using a Performance Predictor to Support Storage SystemDevelopment” was partially published in the report below. I led thisproject including the initial idea, system development, and evaluation.I also had the collaboration of João Arthur Brunet, Lile Palma Hattori,and Matei Ripeanu to discuss the project, and edit the text.Experience with Applying Performance Prediction during Devel-opment: a Distributed Storage System Tale [59]. Lauro BeltrãoCosta, João Brunet, Lile Hattori, and Matei Ripeanu. In Proceed-ings of the 2nd International Workshop on Software Engineeringfor High Performance Computing in Computational Science andEngineering, SE-HPCCSE ’14. Pages 13–19. IEEE, November2014.Chapter 5, “Automatically Enabling Data Deduplication”, was partiallypublished in the papers listed below. I led the work presented inthis chapter including the initial idea, system development, andevaluation. I also had the collaboration of Raquel Vigolvino Lopes,Samer Al-Kiswany, and Matei Ripeanu to discuss the project, executeexperiments, or make some edits in the text.vAssessing Data Deduplication Trade-offs from an Energy and Per-formance Perspective [54]. Lauro Beltrão Costa, Samer Al-Kiswany, Raquel Vigolvino Lopes, and Matei Ripeanu. In Pro-ceedings of the 2011 International Green Computing Conferenceand Workshops, Pages 1–6. IEEE, July 2010.Towards Automating the Configuration of a Distributed StorageSystem [52]. Lauro Beltrão Costa and Matei Ripeanu. In Pro-ceedings of 11th ACM/IEEE International Conference on GridComputing. Grid ’10. Pages 201–208. Acceptance rate: 23%. IEEE,October 2010.MosaStore Storage SystemThis research is part of a larger project related to storage systems thatincludes MosaStore1, a versatile storage system that harnesses resourcesfrom network-connected machines to build a high-performance, yet low-cost, storage system. A more detailed description of MosaStore appears inChapter 2. In MosaStore project, I have collaborated with Samer Al-Kiswany,Emalayan Vairavanathan, Abmar Barros, Hao Yang, and Matei Ripeanu.As common in Computer Systems research, part of the research project’scontribution to the community is a working software system. In thiscase, the MosaStore storage system is a working prototype based on aset of proposed research ideas, which is used as a research platform forthe research presented in this dissertation. As a result of this project, inaddition to the working software system, we have published, or submittedfor publication materials that describe the research ideas implemented inMosaStore as described below.Chapter 2, “Research Platform and Background”, is based on part of mycontribution to this project which includes an extended opportunity studythat I led on optimizations for workflow-aware storage systems [60].1http://www.mosastore.netviA Software Defined Storage for Scientific Workflow Applications. [18].Samer Al-Kiswany, Lauro Beltrão Costa, Hao Yang, Emalayan Vaira-vanathan and Matei Ripeanu. Under Review. Pages 1–14.The Case for Workflow-Aware Storage: An Opportunity Study usingMosaStore [60] Lauro Beltrão Costa, Hao Yang, Emalayan Vairavana-than, Abmar Barros, Kethan Maheshwari, Gilles Fedak, Daniel Katz,Michael Wilde, Matei Ripeanu and Samer Al-Kiswany. Journal of GridComputing. Pages 1–19. Accepted in June 2014. Available online.A Workflow-Aware Storage System: An Opportunity Study [159] Ema-layan Vairavanathan, Samer Al-Kiswany, Lauro Beltrão Costa, ZhaoZhang, Daniel Katz, Michael Wilde and Matei Ripeanu. In Proceedingsof the 12th International Symposium on Clusters, Cloud, and GridComputing (CCGrid ’12). Pages 326–334. Acceptance rate: 27%. IEEE,May 2012.viiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xixDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Why Tuning is Hard . . . . . . . . . . . . . . . . . . . . . . . 61.2 Solution Requirements . . . . . . . . . . . . . . . . . . . . . . 91.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . 122 Research Platform and Background . . . . . . . . . . . . . . . . . 142.1 Research Platform: MosaStore Storage System . . . . . . . . 142.1.1 MosaStore Architecture . . . . . . . . . . . . . . . . . 162.1.2 MosaStore Configuration . . . . . . . . . . . . . . . . 18viii2.1.3 Current Implementation . . . . . . . . . . . . . . . . . 212.1.4 Related Systems and Approaches . . . . . . . . . . . . 212.2 Target Execution Environment: Platform, Metrics, and Appli-cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.1 Execution Platform and Intermediate Storage Systems 242.2.2 Success Metrics . . . . . . . . . . . . . . . . . . . . . . 252.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . 252.2.4 Integrating Applications and the Storage SystemCross Layer Communication . . . . . . . . . . . . . . 292.3 Related Work on Supporting Provision and Configuration . 342.3.1 Measuring System Activity . . . . . . . . . . . . . . . 342.3.2 Predicting Performance . . . . . . . . . . . . . . . . . 342.3.3 Placing this Dissertation in Context . . . . . . . . . . 403 Predicting Performance for I/O Intensive Workflow Applications 423.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 The Design of a Performance Prediction Mechanism . . . . . 453.2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . 483.2.2 Object-based Storage Systems . . . . . . . . . . . . . . 493.2.3 System Model . . . . . . . . . . . . . . . . . . . . . . . 493.2.4 Model Seeding: System Identification . . . . . . . . . 513.2.5 Workload Description . . . . . . . . . . . . . . . . . . 543.2.6 Model Implementation: The Simulator . . . . . . . . 563.2.7 Modeling Methodology Remarks . . . . . . . . . . . . 593.2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 603.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.3.1 Synthetic Benchmarks: Workflow Patterns . . . . . . 623.3.2 The Pipeline Benchmark at Scale . . . . . . . . . . . . 673.3.3 Supporting Decisions for a Real Application . . . . . 683.3.4 Increasing Workflow Complexity and Scale: Montageon TB101 . . . . . . . . . . . . . . . . . . . . . . . . . . 743.3.5 Predictor Response Time and Scalability . . . . . . . . 793.4 Predicting Energy Consumption . . . . . . . . . . . . . . . . 83ix3.4.1 Energy Model Extension . . . . . . . . . . . . . . . . . 843.4.2 Energy Evaluation . . . . . . . . . . . . . . . . . . . . 863.4.3 Energy Extension Summary . . . . . . . . . . . . . . . 923.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.6 Predictor Development . . . . . . . . . . . . . . . . . . . . . . 963.7 Summary and Discussion . . . . . . . . . . . . . . . . . . . . 974 Using a Performance Predictor to Support Storage System De-velopment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.2.1 Object System . . . . . . . . . . . . . . . . . . . . . . . 1084.2.2 Performance Predictor . . . . . . . . . . . . . . . . . . 1094.3 Experience Using the Predictor during Development . . . . 1124.3.1 Case 1: Lack of Randomness . . . . . . . . . . . . . . 1144.3.2 Case 2: Lock Overhead . . . . . . . . . . . . . . . . . . 1154.3.3 Case 3: Connection Timeout . . . . . . . . . . . . . . . 1164.4 Problems Faced . . . . . . . . . . . . . . . . . . . . . . . . . . 1174.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1214.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 1235 Automatically Enabling Data Deduplication . . . . . . . . . . . . 1255.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.3 Control Loop for Configuring Similarity Detection . . . . . . 1315.3.1 Prediction Model for the Controller . . . . . . . . . . 1315.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . 1345.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 1365.4 Data Deduplication Trade-offs from an Energy Perspective . 1415.4.1 Assessing Performance and Energy Consumption . . 1425.4.2 Evaluation of the Energy Consumption Results . . . 144x5.4.3 Modelling Data Deduplication Trade-offs for EnergyConsumption . . . . . . . . . . . . . . . . . . . . . . . 1475.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1495.5.1 Data Deduplication . . . . . . . . . . . . . . . . . . . . 1505.5.2 Energy Optimized Systems . . . . . . . . . . . . . . . 1505.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . 1516 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 1566.1 Contributions and Impact . . . . . . . . . . . . . . . . . . . . 1586.1.1 Performance Prediction Mechanisms: Models andSeeding Procedures . . . . . . . . . . . . . . . . . . . . 1586.1.2 Energy Trade-offs Assessment and Extension of thePrediction Mechanisms . . . . . . . . . . . . . . . . . 1616.1.3 Supporting Development . . . . . . . . . . . . . . . . 1616.1.4 Storage System and Prediction Tool Prototypes . . . . 1636.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . 1646.2.1 Complete Automation of User Decisions . . . . . . . 1646.2.2 Prediction Mechanism Support for HeterogeneousEnvironments . . . . . . . . . . . . . . . . . . . . . . . 1656.2.3 Support for Virtual Machines and More ComplexNetwork and Storage Device Interactions . . . . . . . 1666.2.4 Support for GPU and Content-Based Chunking forData Deduplication . . . . . . . . . . . . . . . . . . . . 1676.2.5 Study on the Use of Performance Predictors to Sup-port Development . . . . . . . . . . . . . . . . . . . . 168Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169A Research Collaborations . . . . . . . . . . . . . . . . . . . . . . . . 193xiList of TablesTable 2.1 MosaStore’s currently supported configuration. . . . . . . 22Table 2.2 Common workflow patterns. . . . . . . . . . . . . . . . . . 28Table 2.3 Examples of MosaStore API calls. . . . . . . . . . . . . . . 31Table 3.1 Model parameters describing the platform. . . . . . . . . 50Table 3.2 Summary of the workload description’s variables. . . . . 54Table 3.3 Summary of the operations modeled. . . . . . . . . . . . . 58Table 3.4 Characteristics of Montage workflow stages. . . . . . . . . 76Table 3.5 Energy parameters and values. . . . . . . . . . . . . . . . . 86Table 3.6 Characteristics of the Montage workflow stages for theenergy evaluation. . . . . . . . . . . . . . . . . . . . . . . . 90Table 3.7 Summary of the main source of inaccuracies predictionmechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Table 5.1 Terms of the data deduplication prediction model. . . . . 132xiiList of FiguresFigure 1.1 Different configurations deliver widely different applica-tion turnaround time. . . . . . . . . . . . . . . . . . . . . 5Figure 1.2 Typical configuration loop. . . . . . . . . . . . . . . . . . 7Figure 2.1 MosaStore deployment. . . . . . . . . . . . . . . . . . . . 15Figure 2.2 MosaStore components. . . . . . . . . . . . . . . . . . . . 18Figure 2.3 Integration of cross-layer communication with workflowexecution system. . . . . . . . . . . . . . . . . . . . . . . . 20Figure 2.4 A Workflow Optimized Storage System (WOSS) deploy-ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 3.1 The predictor’s input and possible use-cases. . . . . . . . 47Figure 3.2 A queue-based model of a distributed storage system. . 52Figure 3.3 Pipeline, Reduce, and Broadcast benchmarks. . . . . . . 64Figure 3.4 Actual and predicted average execution time for thepipeline benchmark with a medium workload. . . . . . . 64Figure 3.5 Actual and predicted average execution time for thereduce benchmark for the medium, large workloads, andper stage for large workload. . . . . . . . . . . . . . . . . 66Figure 3.6 Actual and predicted performance for the broadcastbenchmark with a medium workload. . . . . . . . . . . . 67Figure 3.7 Actual and predicted average execution time for thepipeline benchmark - Weak Scaling on TB101 . . . . . . . 68Figure 3.8 The BLAST application. . . . . . . . . . . . . . . . . . . . 70xiiiFigure 3.9 BLAST Application runtime for a fixed-size cluster of 20nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72Figure 3.10 Allocation cost and application turnaround time for BLAST 73Figure 3.11 Montage workflow. . . . . . . . . . . . . . . . . . . . . . . 75Figure 3.12 Montage time to solution on TB101. . . . . . . . . . . . . 77Figure 3.13 Montage running cost in CPU allocated time vs. time-to-solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Figure 3.14 Montage time-to-solution on TB20. . . . . . . . . . . . . . 79Figure 3.15 Predictor response time for 20 nodes with increasing theamount of data (varying the workload) in the system(log-scale for both axes). . . . . . . . . . . . . . . . . . . . 81Figure 3.16 Response time in a weak scaling scenario . . . . . . . . . 82Figure 3.17 Prediction time for 20 nodes and increasing amount ofdata in the system. . . . . . . . . . . . . . . . . . . . . . . 83Figure 3.18 Integration of the performance predictor and the energymodel for workflow applications. . . . . . . . . . . . . . 87Figure 3.19 Actual and predicted average energy consumption forthe synthetic benchmarks. . . . . . . . . . . . . . . . . . . . 89Figure 3.20 Actual and predicted average energy consumption forthe real applications. . . . . . . . . . . . . . . . . . . . . . . 90Figure 3.21 Actual and predicted average energy consumption andexecution time for BLAST for various CPU frequencies. . 91Figure 3.22 Actual and predicted average energy consumption andexecution time for the pipeline benchmark for various CPUfrequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . 92Figure 3.23 Actual and predicted performance for the reduce bench-mark on spinning disks. . . . . . . . . . . . . . . . . . . . 101Figure 3.24 Actual and predicted performance for the pipeline bench-mark using Ceph. . . . . . . . . . . . . . . . . . . . . . . . 103Figure 4.1 The use of the performance predictor as part of thedevelopment cycle. . . . . . . . . . . . . . . . . . . . . . . 112Figure 4.2 Impact of fixing performance issues. . . . . . . . . . . . . 115xivFigure 5.1 Control loop based on the monitor-control-actuate archi-tectural pattern. . . . . . . . . . . . . . . . . . . . . . . . . 129Figure 5.2 Average time to write a snapshot of 256MB. . . . . . . . . 138Figure 5.3 Average relative utility to write a snapshot of 256MB . . 139Figure 5.4 Average energy consumed and time for different similari-ties in the ‘new’ testbed. . . . . . . . . . . . . . . . . . . . 145Figure 5.5 Average energy consumed and time for different similar-ity levels in the ‘old’ testbed. . . . . . . . . . . . . . . . . . 146xvGlossaryAPI Application Programming InterfaceBLAST Basic Local Alignment Search Tool, finds regions of local similaritybetween DNA sequences.CART Classification And Regression TreesDAG Directed Acyclic Graph, a directed graph with no directed cycles.DSS Default Storage System configurationDVFS Dynamic Voltage and Frequency ScalingEXT3 Third extended filesystem, a journaled file system commonly usedby the Linux kernel.FUSE Filesystem in UserspaceFIFO First In, First OutGFS Google File SystemGIS Grid Information ServiceGPFS General Parallel File SystemGPU Graphics Processing UnitHDF5 Hierarchical Data Format, version 5xviMD5 Message Digest AlgorithmNDBM New Database Manageris a 1986 Berkeley version of the AT&Tdbm database.NETSYSLAB Networked Systems LaboratoryNFS Network File SystemNIC Network Interface Controller, also known as a Network InterfaceCard.NS2 Network Simulator 2POSIX Portable Operating System InterfacePVFS Parallel Virtual File SystemQOS Quality-of-ServiceQN Queuing Network-BasedRAID Redundant Array of Independent DisksRRS Recursive Random SearchSAI System Access InterfaceSHA Secure Hash AlgorithmSLA Service Level AgreementSLO Service Level ObjectiveSPE Software Performance EngineeringSSH Secure ShellSVN Apache Subversion, a software versioning and revision controlsystem.xviiTDP Thermal Design Power also known as Thermal Design Point, it isthe maximum amount of heat required to be dissipated.UBC The University of British ColumbiaUFCG Universidade Federal de Campina Grande - in English, FederalUniversity of Campina Grande.UML Unified Modeling Language, a visual language for modeling thestructure of software artifacts.VFS Virtual FilesystemWOSS Workflow Optimized Storage SystemxviiiAcknowledgmentsMy PhD journey has involved support from many people, and, without suchsupport, this work would not be possible. It is hard, perhaps impossible, toacknowledge everyone properly here. My attempt to do so follows:After thanking God, I would like to thank my advisor Matei Ripeanuwho always gave me freedom to choose the – sometimes sinuous – paththat I wanted to take. I am also grateful for him making time wheneverI knocked his door for two quick questions, or to go upstairs for a cup ofcoffee and a longer discussion.I thank the members of my examining committee. I thank Sathish,Karthik, and Ali, for the enriching feedback they provided during theconversations we had; and Mike Feeley, for the valuable points raised andpleasant discussion. I deeply appreciate the detailed feedback that AngelaDemke Brown provided.Raquel Lopes, Jacques Sauvé, and Walfredo Cirne played a mixed roleof external advisors and gurus in different moments during the past fewyears. I would like to thank them for the technical discussions and, mainly,for the personal conversations we enjoyed.My friends from TOTEM, Abdullah Gharaibeh and Elizeu Santos-Neto,thank you very much for making the environment fun and supportive inthe NetSysLab and beyond.To those who came as visitors and contributed to my work and labenvironment, I thank you very much – especially João Brunet, MarcusCarvalho, Nigini Oliveira, and Thiago “Manel” Perreira.xixTo the MosaStore team – Abmar Barros, Hao Yang, Emalayan Vaira-vanathan, Samer Al-Kiswany, and Bo Fang – thank you for providing thebase layer on which I could build my work, and for our conversations andenriching technical discussions. Scott Sallinen, thank you for your feedbackon my presentations.I thank my family, always available to give me support, words of comfort,and love in different forms throughout these years. Particularly, I deeplyappreciate the never-ending patience I have received from my parents, Zé,and my wife. Minha gratidão por vocês é inefável.I also would like to thank my “ Vancouverite” friends, Alyssa Satterwhiteand Mohammad Afrasiabi, for our brunches and dinners, GuilhermeGermoglio for a great internship experience, Jean and Priscila for givingme shelter, and those who were physically far, but always close: CássioRodrigues, Eloi Rocha, Eulália Nogueira, Fábio Leite, and Karina Rocha.Finally, I thank my little fluffy creatures, Nori and Pipa, for the joy you showand bring every day.xxPara Lile, Zé e Lourdes.xxiChapter 1IntroductionDistributed storage systems emerged to replace centralized storage solutionsfor large computing systems [10]. Indeed, aggregating available storagespace from network-connected nodes to provide a distributed storagesystem has several appealing benefits: (i) low cost – it is cheaper than adedicated storage solution; (ii) high performance – applications benefit froma wider I/O channel, obtained by distributing data across several nodes;(iii) high efficiency – it improves resource utilization; and (iv) incrementalscalability – with distributed storage systems, it is possible to have a fine-grain increase in system capacity to match the demand.These benefits, however, come at a price: decisions about resourceprovisioning and storage system configuration become increasingly complex[14, 27, 52, 122, 152, 156]. First, coordination of the distributed compo-nents makes managing data across several nodes more complex than ina centralized solution. Second, several distributed storage techniquespresent trade-offs that rarely exist in centralized solutions. Finally, differentapplications have different requirements, which makes these decisions aboutprovisioning and configuration hard to implement.To illustrate these points, consider the following trade-offs of severalstorage techniques employed in distributed storage systems. Data dedu-plication may save storage space and network bandwidth when there ishigh similarity across write operations, yet deduplication implies higher1computational costs to detect similar data blocks [12, 52]. Higher redun-dancy levels (replication and erasure codes) may accelerate data access andincrease reliability [93, 164], yet they may require more complex consistencyprotocols, as well as additional storage space, network bandwidth and timeto create the replicas or encode data [10]. Specific caching and prefetchingpolicies may improve application’s performance, yet the wrong choice forthese parameters can cause extra, unnecessary, usage of resources, degradingperformance [95, 134]. Different data-placement policies benefit differentworkloads depending on the data access pattern [14]. Large chunk sizes canreduce overhead and better use the network, yet small chunk sizes can avoidthe waste of data transfers and provide more opportunities for parallel datatransfers [23].To avoid the complexity of such a large decision space, storage systemdesigners typically opt for a design (e.g., GFS [82], NFS, and PVFS [85]) withfew configuration options, which are chosen at system deployment time(e.g., , replication level, cache size, data placement policy, and chunk size).The rationale, for these designers, is to address the common use case andsimplify system management, while keeping the system useful for a broadrange of applications. This is known as a “one size fits all” approach [10, 14].As an alternative to this “one size fits all” approach, previous work[10, 14] proposes a versatile storage system approach whose goal is to allowspecialization of the storage system to each in a broad range of applications.Versatility in these systems means the ability to provide a set of storagetechniques that can be activated or configured at compile time, deploymenttime, or even at runtime, depending on the application’s requirements andgenerated workload.This versatility provides benefits to the storage system user since itimproves application execution, due to its ability to tailor the storage systemto meet application-specific requirements. Thus, the user can analyze theapplication workload to determine the configuration that is most suitablefor the workload present in her system. Tailoring versatility is especiallyappealing in the context of the intermediate storage systems approach, whichis the focus of this research. This approach consists of aggregating resources2of the application’s allocation to provide a shared temporary storage spaceco-deployed with the application execution [14, 31, 60, 112, 174]. As a result,the application prevents the potential bottleneck of accessing the platform’sbackend storage system, the initialization of the storage system is coupledwith the application’s execution, and, more importantly for this research,the storage system is dedicated to a specific application.In this environment, the user can, for example, configure the storagesystem to use more nodes, use replication only for files that are anticipatedto have high read demand [10, 14, 156], enable or disable data deduplication[12] to save storage space and/or writing time, or choose the most suitabledata-placement policy to improve performance [95].In practice, however, benefiting from a versatile storage system is achallenging task. Properly setting the broad range of configuration optionsrequires a careful workload analysis and a strong understanding of storagetechnique trade-offs on the optimization criteria (e.g., response time, storagefootprint, energy consumption).Manual configuration is undesirable for several reasons [20, 22, 27, 152,156]: First of all, the user may lack the necessary knowledge about theapplication and its generated storage workload. Also, variations in theworkload, changes in the infrastructure, or new application versions canmake one-time tuning meaningless. Finally, performance tuning is time-consuming, due to the application runtime and potentially large space ofconfigurations to be considered.The Problem. In this scenario of running applications on top of interme-diate storage systems, the role of the application administrator, or user, isnon-trivial: if a user wants to extract maximum performance, in additionto being in charge of running the application, the user has to make a setof important decisions to achieve the desired performance (e.g., in termsof application turnaround time, energy consumption, storage footprint,network usage, or financial cost). This involves allocating resources andconfiguring the storage system appropriately (e.g., chunk size, stripe width,data placement policy, and replication level).Thus, the decision space revolves around: (i) provisioning the allocation –3setting the total number of nodes, and deciding on node type(s) (or nodespecification) for cloud environments [49]; (ii) partitioning the allocation –splitting these nodes, or not, between the application and the storage system;and (iii) configuring the storage system parameters – choosing the values forseveral configuration parameters for the storage system, e.g., choosingchunk size, replication level, cache/prefetching and data placement policies.Consequently, provisioning the system entails searching a complex multi-dimensional space to determine the set of choices that delivers the user’sideal utility [152, 167] (e.g., fastest turnaround time or cost vs. time balancepoint). Figure 1.1 shows an example of this scenario for the Basic LocalAlignment Search Tool (BLAST) application [19].Formally,Ddesired = argmaxDU(D) (1.1)where U(D) is a utility function representing the user’s goal (e.g.,lowest financial cost, fastest turnaround time, or some financial cost vs.performance balance), and D is a tuple representing the set of decisionsrelated to (i) allocation provisioning, (ii) allocation partitioning, and (iii)storage system configuration parameters.Research Goal. This work addresses the following high level question:How can one support provisioning and configuration decisions for a distributedstorage system with minimal human intervention?The end goal is to provide mechanisms to support provisioning and con-figuration decisions for versatile storage systems to meet application-specificrequirements, delivering results for the success metrics (e.g., responsetime, storage footprint, energy consumption) close to the user-definedoptimization criteria. This research proposes performance-prediction mech-anisms in two contexts: (i) workflow applications, the main target of thiswork, in which the processing flow is structured as several computingtasks communicating via temporary files on a shared storage system; and(ii) data deduplication operations. Specifically, a performance-predictionmechanism in this context consists of a model, a procedure to seed the model40 5 10 15 2010002000500010000Application Nodes (Storage Nodes = 20 − Application Nodes)Application time (in sec)lllll l l lll 256KB512KB1024KBChunk SizeFigure 1.1: Different configurations deliver widely different application turn-around time and the choice of the optimal configuration is not intuitive.Figure shows BLAST workflow [19] execution time (log-scale on Y-axis) on topof an intermediate storage system with different configurations: differentways of partitioning the number of nodes allocated to the application amongstorage and applications nodes (X-axis), and different chunk sizes. As thenumber of nodes allocated to application increases (from left to right), theoverall response time decreases until an optimal point when the I/O pressureon the storage nodes cannot increase without application performance loss.The chunk size has a high impact (up to 2x speedup) between the defaultconfiguration (1024KB) and its optimal value (256KB).(i.e., perform system identification), and an implementation of these two,providing a software tool to predict the behaviour of intermediate storagesystems.The mechanisms are useful in various scenarios, including solutions thatcan automate user choices. Chapter 2 presents the target context in detail.More concretely, the solution described in this dissertation supports theuser in answering questions in different domains. For example:Storage Configuration. What should the storage system configuration be toprovide the highest application performance? Would data deduplica-tion, for an average data similarity of 40%, reduce energy consumptionon a given platform?51.1. WHY TUNING IS HARDResource Allocation. Given a fixed size cluster, how should the nodes bepartitioned between the application and the storage, and what shouldbe the storage system configuration to yield the most cost-efficientscenario (i.e., lowest cost per unit of performance)?Provisioning. In a cloud or cluster environment, what is the best allocationsize, and how should it be partitioned and configured to best fit myrequirements?Purchasing. What is the impact of purchasing a specific set of machines onthe performance of a given application?Development. What performance should an implementation deliver to beconsidered efficient enough to stop a debugging effort? What is theperformance impact of implementing a specific feature (e.g., a newdata placement policy)?Chapter structure. The rest of this chapter argues why tuning storagesystem configurations is a difficult and time-consuming task (Section 1.1),describes the requirements for a solution to support configuration (Section1.2), summarizes the contributions of this research (Section 1.3), and presentsthe structure of this dissertation (Section 1.4).1.1 Why Tuning is HardA versatile storage system enables applications to harness existing storageresources more effectively by exposing several configuration parameters tofacilitate per-application tuning or even per-stored object (e.g., a file) tuning[14, 95, 156]. Although versatility can improve an application’s performance,it gives the user the task of deciding on the configuration of the storagesystem.This task requires that the user go through a process, summarized inFigure 1.2, that involves the following steps:1. Identify the storage techniques available and their parameters.61.1. WHY TUNING IS HARD2. Choose the objective of the tuning operation – identify key successmetrics (e.g., execution time, storage space), and possibly the desiredtarget level (i.e., specific value) for each metric.3. Identify the storage techniques to be enabled and, if needed, theirparameters to be tuned. Enable these techniques and set their parame-ters.4. Request an allocation, wait for the allocation, and run the application.5. Measure the performance impact of the configuration changes byanalyzing the system activity.6. Repeat steps (2) to (5) until the desired performance level is obtained.Define a target performanceAnalyze system activityRun applicationIdentify parametersConfigure the parametersFigure 1.2: Typical configuration loop.This tuning process is potentially complex, time-consuming, and error-prone for several reasons [20, 22, 27, 52, 54, 91, 121, 152, 156] – especially ifthe process is not automated.First, the user may not be familiar with the deployment environment,the requirements of the running applications, or their workload. Suchknowledge is essential to identify the most frequent and costly operations in71.1. WHY TUNING IS HARDorder to understand which storage techniques and parameters have a largerimpact on the success metrics.Second, defining the desired target values for the success metrics is com-plex. There are situations in which the user does not have a specific targetlevel in mind. In fact, she may be interested in finding the configuration thatcan provide the fastest response time for a given combination of applicationand storage platform, instead of a specific value for response time. In morecomplex scenarios, the user has more than one target metric to consider andchanging the desired level of one metric affects the others. For example,consider a case in which the user wants the shortest possible executiontime, but the application produces a large amount of data, and she also hasconstraints on the storage space used. Enabling data deduplication can savestorage space, but it introduces an additional computational overhead thatmay lead to a longer response time. Similarly, using more machines tendsto decrease execution time while increasing the financial cost or energyconsumption of running the application.Third, the workload and platform characteristics may change. The usermay go through the whole effort of configuring the system and have itperfectly tuned for a given platform, application version, and workload.However, the workload or the platform can change in an unpredictablemanner. For example, the application needs to process a different dataset,or the platform changes as a result of purchasing new machines, a networkupgrade, or even an operating system upgrade. In these cases, the config-uration may not attain the desired objective and the user would need torestart the tuning of the system.Finally, the storage system may have a large configuration space to beexplored. In case of manual tuning, users can spend a long time in the tuningloop to evaluate different configurations and handle the aforementioneddifficulties. Even in the case where automation is possible, the process cantake too long, leading to a waste of resources [27, 28].81.2. SOLUTION REQUIREMENTS1.2 Solution RequirementsBased on the scenario described in Section 1.1, “Why Tuning is Hard”, thetarget solution to support the configuration of a storage system should meetthe following requirements:Reduce human intervention. The proposed solution should not impose aburdensome effort to be used and should make the user’s job easier.Thus, the goal of this research is to design a solution that requiresminimal human intervention to enable or disable various storagetechniques and to choose their configuration parameters. Specifically,any system identification (also known as model seeding) procedureneeded should be simple: it should not require storage system redesignor a particular initial design to collect performance measurements, andit should be automated. Additionally, the solution should capture thebehaviour of a generic object-based distributed storage design, andusing it should not require in-depth knowledge of storage systemprotocols and architecture.Allow identification of a satisfactory configuration. The proposed solu-tion should provide accuracy high enough to allow the detection of aconfiguration that brings the performance of the system close to theuser’s intention. Note that the goal is not an optimal configuration,as this might be too costly to determine or simply not feasible toachieve, but a configuration that brings the system ‘close’ to the user-defined success criteria. For instance, if two configurations offer closeperformance, making a precise decision is less important as long as theproposed solution places choosing one of them as similar to choosingthe other.Have a low exploration cost. The overhead of using the proposed solutionshould be low. In other words, the cost required to use the proposedsolution, to explore the decision space and tune the system shouldbe small when compared to the cost of the running application or itsI/O operations several times. For example, assume that the goal is to91.3. CONTRIBUTIONSminimize the application execution time. As the cost of discoveringparameters that minimize runtime increases, the proposed solutionbecomes less desirable. Consequently, the proposed mechanismshould scale with: (i) the system size; and (ii) the I/O intensity ofI/O applications.1.3 ContributionsThis study proposes a solution to support the user in making decisionsabout the provisioning and configuration of intermediate storage systems,while meeting the requirements described in Section 1.2. Specifically, thisdissertation shows that performance prediction mechanisms can leveragethe target application’s characteristics to accelerate the exploration of theprovisioning and configuration space. Additionally, it demonstrates thatsuch mechanisms can rely on a model of a generic distributed storagesystem architecture and on monitoring information available at the highlevel operations of the system, not requiring changes to the underlyingstorage system or specialized monitoring systems for the target context.This research presents the following contributions:Performance Prediction Mechanisms: Models and Seeding Procedures.This research proposes performance prediction mechanisms that relyon some characteristics of workflow and checkpointing applicationsto reduce the model’s complexity and enable a fast exploration ofthe decision space for traditional performance metrics (e.g., appli-cation turnaround time). As a result, this model relies on a system-identification procedure to provide its parameters (i.e., seed the model)that has two key features: (i) it relies on application-level operations,instead of detailed intrusive measurements of the internal executionpath, and (ii) it relies only on a small deployment of the system, despitethe goal of predicting various scenarios of different scales.Additionally, this dissertation demonstrates the effectiveness of theproposed solution in a broad range of scenarios, including several101.3. CONTRIBUTIONSdifferent hardware platforms, scales, and possible user decisions. Thisalso identifies a solution space where the proposed model, despitebeing based on a generic object-storage system architecture and coarsegranularity in terms of internal behaviour of system components, canbe effective in predicting the application’s behaviour for the targetcontext.Energy Trade-offs Assessment and Extension of the Prediction Mecha-nisms. This dissertation presents an assessment of energy consump-tion, demonstrating how the differences in power proportionality ofhardware platforms impact the relation between energy consumptionand response time as part of the optimization criteria. Additionally,this study demonstrates that the proposed prediction mechanisms –the model and system identification procedure – can be augmented toestimate the energy consumption while keeping the characteristics ofbeing simple, lightweight, and non-intrusive.Storage System Development Support. This dissertation presents a suc-cessful use-case of a performance prediction mechanism, beyond theoriginal design goal of supporting user choices, where it is incor-porated as part of the development process of distributed storagesystems. To this end, this research presents an experience of usingits prediction mechanism to support the development process byapplying it during the development of a distributed storage system atthe NetSysLab.Prediction Tool and Storage System Prototypes. This work offers aprototype tool that serves to verify the feasibility of the proposedsolution, to validate it, and to evaluate it according to the requirementsdescribed in Section 1.2, using MosaStore – a distributed storagesystem that the NetSysLab group has developed [14, 159]. As a resultof this research, MosaStore’s usability, performance and stability haveimproved.111.4. DISSERTATION OUTLINE1.4 Dissertation OutlineThis dissertation is organized as follows:Chapter 2, “Research Platform and Background”, presents the target exe-cution environment, applications and success metrics of this researchin detail. Chapter 2 briefly describes MosaStore storage system,which is used as a research platform to verify the feasibility, validate,and evaluate this work. MosaStore is also a system in which I wasactively involved, leading it to several improvements [56, 60] as aconsequence of the work described in this dissertation. Chapter 2also presents an overview and evolution of past work that aimedto support the configuration of computing systems with a focus onstorage systems. Related work is also presented in more detailed foreach of the following three chapters.Chapter 3, “Predicting Performance for I/O Intensive Workflow Appli-cations”, addresses the following questions: How can one leverage thecharacteristics of I/O intensive workflow applications to build a predictionmechanism for traditional performance metrics (e.g., time-to-solution,network usage)?, and Which extensions are needed to capture energyconsumption behaviour in addition to traditional performance metrics?Chapter 3 presents the focus of this research: a low-cost performancepredictor that estimates the total execution time of I/O intensiveworkflow applications in a given setup based on a simple seedingprocedure. This chapter presents an evaluation of the predictionmechanism in a number of combinations for storage configurationparameters, execution platforms, synthetic benchmarks and real ap-plications. Chapter 3 also presents an extension and evaluation of theperformance predictor in estimating the energy consumption of I/Ointensive workflow applications, from an effort led by Hao Yang [171].Chapter 4, “Using a Performance Predictor to Support Storage SystemDevelopment”, sheds light on the following question: Which, if any,are the benefits that this proposed performance predictor brings to the software121.4. DISSERTATION OUTLINEdevelopment process of a storage system?. To this end, Chapter 4 discussesthe experience of using the prediction mechanism beyond its originaldesign goal: the use of the performance predictor to better understandand debug MosaStore, the distributed storage system that NetSysLabhas developed.Chapter 5, “Automatically Enabling Data Deduplication”, focuses on apreliminary experience in this research that explored data deduplica-tion in the context of repetitive write operations (e.g., checkpointingapplications) to enable a simple predictor for an online configurationtargeting the following question: What are the challenges of an automatedsolution for the online configuration of an intermediate storage system?Chapter 5 also addresses the question Is energy consumption subjectto different trade-offs than response time, or are optimizing for energyconsumption and response time coupled goals? It does so by exploringhow online monitoring can enable online adaptation, and presentsthis research’s initial experience of applying a power-profile-basedapproach to estimate energy consumption of a data deduplicationtechnique – used as the basis for the energy extension presented inChapter 3.Chapter 6, “Concluding Remarks” summarizes the results of this research,discusses its limitations, the challenges and the lessons learned, andsuggests directions for future work.13Chapter 2Research Platform andBackgroundThis chapter contextualizes the research presented in this dissertation. First,Section 2.1 presents the MosaStore storage system prototype used in thisresearch. This prototype is the result of research in which I was activelyinvolved. Then, Section 2.2 describes the target deployment scenario,applications and success metrics on which this work focuses. Finally, Section2.3 presents an overview of past research efforts towards supporting theconfiguration of computer systems with a focus on storage systems, andhighlights the relationship between those past research efforts and thisresearch.2.1 Research Platform: MosaStore Storage SystemThis section describes MosaStore1 [14], an experimental distributed storagesystem that can be configured to enable workload-specific optimizations. Itsdesign leverages unused storage space from network-connected machines,and offers this space as a high-performance, yet low-cost, shared storagesystem across these same connected machines (see Figure 2.1).1http://mosastore.net142.1. RESEARCH PLATFORM: MOSASTORE STORAGE SYSTEMApp. taskLocal storageApp. taskLocal storageApp. taskLocal storageShared Storage SystemCompute Node 1…POSIX  APICompute Node nCompute Node 2Figure 2.1: MosaStore deployment. It aggregates resources from network-connected machines to provide a high-performance, yet low-cost sharedstorage system across these same connected machines.As is common in computer systems research, part of a research project’scontribution to the field is a working prototype. In this case, the MosaStorestorage system is a prototype based on a set of proposed research ideasinitially proposed by Al-Kiswany et al. [14], and, despite being a prototype,it has been used in several different projects from multiple institutions [11,14, 17, 52, 54, 60, 76, 115, 159, 171]. I was actively involved in the research [60,159] that led MosaStore to enable optimization opportunities for workflowapplications (described in Section 2.2.3 and the focus of Chapter 3), includingan extension of an opportunity study that I led2.The research presented in this dissertation is part of a larger projectrelated to storage systems that includes MosaStore. The goal is to support theconfiguration of a versatile storage system specifically for the environmentdescribed in Section 2.2. MosaStore is the versatile storage system used as aresearch platform for the evaluation of the research ideas presented in thisdissertation.The rest of this section is organized as follows: it describes the archi-tecture of MosaStore (Section 2.1.1), the configuration versatility (Section2.1.2), and details on its implementation (Section 2.1.3), and then it brieflycompares this system with other related systems and approaches (Section2Case for Workflow-Aware Storage: An Opportunity Study using MosaStore [60] Lauro BeltrãoCosta, Hao Yang, Emalayan Vairavanathan, Abmar Barros, Kethan Maheshwari, GillesFedak, Daniel Katz, Michael Wilde, Matei Ripeanu and Samer Al-Kiswany. Journal of GridComputing. Accepted for publication in June 2014.152.1. RESEARCH PLATFORM: MOSASTORE STORAGE SYSTEM2.1.4). More details about MosaStore can be found NetSysLab’s and my pastwork [14, 17, 60, 143, 159].2.1.1 MosaStore ArchitectureMosaStore employs a widely-adopted object-based storage system archi-tecture, similar to that adopted by the Google File System (GFS) [82],Parallel Virtual File System (PVFS) [85], and UrsaMinor [10]. To speed updata storage and retrieval, the architecture relies on data striping [88]: filesare split into fixed-size chunks stored across several storage nodes. Thisarchitecture includes three main components: a centralized metadata manager,storage nodes, and a client-side System Access Interface (SAI) (see Figure 2.2).The list below describes the role of each component:The metadata manager keeps track of the metadata for the entire system:the status of the storage nodes, mapping of file chunks to storagenodes, access control information, and data object attributes. Themetadata service and the requests to access data chunks are com-pletely decoupled to provide high scalability: once the client obtainsthe metadata from the manager, all subsequent access to the dataitself is performed directly between the client and the storage nodes,decreasing the chances of a potential metadata manager bottleneck.A New Database Manager (NDBM) compatible library stores themetadata in the manager node. Additionally, the metadata managernode is responsible for running house-keeping tasks such as garbagecollection, and storage node failure detection.The storage nodes allocate part of their local storage space to the sharedstorage system that MosaStore builds. The storage nodes serveclients’ file-chunk store/retrieve requests, and also interact with themanager by publishing their status using a soft-state registrationprocess. Finally, the storage nodes also participate in the replicationprocess and in the garbage collection mechanism.162.1. RESEARCH PLATFORM: MOSASTORE STORAGE SYSTEMClient-side SAI uses a Virtual Filesystem (VFS) via the Filesystem inUserspace (FUSE) [2] kernel module to implement a user-level file-system that provides a Portable Operating System Interface (POSIX),which is an Application Programming Interface (API), in MosaStore.Specifically, FUSE provides a set of callback functions that should beimplemented by the underlying storage system. For example, considera write operation called by the application. FUSE receives the write calland its parameters, and forwards the call to the SAI implementation.The SAI implementation contacts the metadata manager to reservestorage space and obtains a list of storage nodes to be used, thenit stripes the data into chunks and connects to the storage nodeswith allocated space for that operation. Once all those chunks arestored, the callback implementation returns to FUSE which in turnreturns to the application. Additionally, the SAI implementationis the part of the system responsible for providing these client-sideoptimizations: caching, pre-fetching, and data deduplication. ClientSAI implementations support data access protocols:• Data placement. The default data placement generally adoptedin this type of architecture is round-robin: when a new fileis created on a stripe width of n nodes, the file’s chunks areplaced in a round-robin fashion across those nodes. Addition-ally, application-driven data-placement policies, which optimizefor specific application access patterns, have seen increasingadoption [159, 174]. For instance, both local and co-locate dataplacement policies can optimize workflow applications’ dataaccess patterns (detailed in Section 2.2.3).• Replication. Data replication is often used to improve reliabilityor access performance. However, while a higher replication levelreduces contention on the node storing a popular file, it increasesthe file write time and the storage space consumption.172.1. RESEARCH PLATFORM: MOSASTORE STORAGE SYSTEMFigure 2.2: MosaStore components as described by Al-Kiswany et al. [17]2.1.2 MosaStore ConfigurationAs explained in Chapter 1, distributed storage systems offer a wide set ofstorage techniques and trade-offs in terms of response time, throughput,energy, or cost when compared to centralized solutions. In this set-up,each application may obtain peak performance at a different configurationpoint as a consequence of different I/O access patterns or target metrics[10, 49, 54, 95, 152, 156].MosaStore offers storage configuration versatility since it providesstorage techniques that can be turned on or off, or with parameters that canbe set at deployment time or runtime, allowing applications to improve theirperformance according to the workload. The configuration at deploymenttime is given via configuration files as typically happens in many systems.The runtime settings are provided according to a cross-layer communicationsolution proposed by Santos-Neto et al. [143]. My collaborators and I[60, 159] have evaluated the feasibility and assess the benefits of thisapproach. Al-Kiswany et al. [17] present a full design and implementationof this approach for MosaStore.182.1. RESEARCH PLATFORM: MOSASTORE STORAGE SYSTEMEnabling Per-File Configuration VersatilityMosaStore employs a hint-based mechanism to provide a specific optimiza-tion per file by applying different storage techniques according to the hint.The clients provide these hints via POSIX extended attributes, tagging thefiles with key-value pairs. For example, the application can inform the storagesystem - that is, give a hint - that a set of files will be consumed by the sameclient and, therefore, should be placed on the machine that will consumethose files.MosaStore has a flexible design that supports exploration and facilitatesadding new custom metadata and their corresponding functionality. Be-cause of the storage system’s distribution, enabling a hint based mechanismrequires support by each of the main components: the manager, clientSAI, and storage nodes. The metadata manager is responsible for keepingthe extended attributes; any client can set or retrieve the values of theextended attributes. Additionally, the actual functionality associated witha specific key-value pair is spread among the components and depends onthe functionality, which may include the storage nodes. For example: dataplacement resides on the manager, the client SAI implements caching orprefetching, and storage nodes handle replication.The overall flow of storage operations in MosaStore works as follows:The client SAI initiates file-related operations (e.g., create file), obtainingthe file’s metadata. The first time an application gets metadata of a file(e.g., open a file), it caches all the metadata information - including theapplication hints in the form of file’s extended attributes. The SAI tags allsubsequent inter-component communication related to that file with thefile’s extended attributes, which can trigger the corresponding callbacks ateach component.To enable the optimizations in the storage system, each componenthas a dispatcher following a design similar to a strategy design pattern [73],allowing a set of implementations to be selected on-the-fly at runtime (seeFigure 2.3). Whenever a request arrives, the dispatcher checks the tags ofthat request and triggers the associated functionality (strategy) by forwarding192.1. RESEARCH PLATFORM: MOSASTORE STORAGE SYSTEMFigure 2.3: Integration of cross-layer communication with workflow executionsystem as described by Al-Kiswany et al. [17]: (i) the solid lines originatingfrom the SAI represent the path followed by a client chunk allocation request:the request is processed by a pattern-specific data placement ‘DP’modulebased on the corresponding file tags/hints, (ii) the solid lines going to storagenodes represent the data path as data is produced by the clients, and (iii) thedashed lines represent path of a request to retrieve file location information.the request to the specific optimization module(s) associated with the hint.If no tags are provided, or there is no specific functionality associated to thegiven tags, the component uses a default implementation.To add a new functionality to the system (e.g., a new optimization fora specific operation), the developer needs to decide the key-value pair thatwill specify the application hint and trigger the optimization. Then, thedeveloper implements a callback function on the needed system compo-nents. Every callback function can access the storage component’s internalinformation, the metadata manager, and stored blocks through an API.Al-Kiswany et al. [17] discuss the architecture’s extensibility in more detail.The tagging mechanism used to give hints about the application is atwo-way cross-layer communication mechanism: (i) it allows the applicationto pass information to the storage system by tagging files with key-valuepairs, and (ii) it allows the applications - including a scheduler - to retrievethe tags (the GetAttrib module in Figure 2.3). Section 2.2.3 describes anintegration of a workflow runtime environment with the MosaStore taggingmechanism to improve workflow performance via data-aware scheduling202.1. RESEARCH PLATFORM: MOSASTORE STORAGE SYSTEM[60, 142].2.1.3 Current ImplementationThe MosaStore implementation follows the extensible storage system de-sign principles as described in the previous sections. The prototype 3 ismainly implemented in C, and has approximately 10,000 lines of code.It has been used in several different projects from several institutions[11, 14, 17, 52, 54, 60, 76, 115, 159, 171]. Although to date more thanfifteen developers were involved in different versions of its implementation,most of the development was carried out by Samer Al-Kiswany, EmalayanVairavanathan, Hao Yang, and myself. MosaStore is open-source, has anApache Subversion (SVN) repository available online4, and has been usingcode reviews which are also available online5.The current configuration options available are described in Table 2.1.As noted in Section 2.1.2, however, adding new functionality and theirparameters to optimize specific workload should be a simple process.Currently, the MosaStore implementation has two main limitations. First,the data placement tags are effective at file creation only and, thus, changingthe data placement tag for existing files does not have any impact. Second, itutilizes a centralized metadata manager, which can be a potential bottleneckat large scale, though it has not been observed [17].2.1.4 Related Systems and ApproachesIn storage systems, the term versatility is used to describe the ability toprovide techniques that can be configured at either deployment or runtime,which allows applications to improve their performance. As with MosaStore[14], other systems provide some degree of versatility.Other systems, however, in the past have proposed a smaller set ofconfiguration parameters that can be tuned for specific applications. Forexample, Huber et al. [95] offer an API that allows the application to choose3 RESEARCH PLATFORM: MOSASTORE STORAGE SYSTEMConfigurationParameterConfiguration Space PerFileSystemWidePerNodeNumber ofNodesInteger, up to number ofavailable nodesXClient SAI andStorageCollocationBoolean X XChunk Size Integer, typically a multipleof 64KBX XCache Size Integer, number of chunks X X XDatadeduplicationTwo values {on, off}, can becombined with chunk sizeX XData placementpolicyThree options {local,striped, collocated}X XPrefetching Integer, number of chunks X X XReplicationLevelInteger, typically up to 4 X XReplicationPolicyTwo values {chain, parallel} XStripe-width Integer, up to the numberof storage nodesX XTable 2.1: MosaStore’s currently supported configuration. Note that some pa-rameters may be affected by others. For example, a different data placementpolicy impacts on the improvement of a larger cache placement and data management policies. Similarly, ADIOS [113]allows the selection of an optimal set of I/O routines for a given platform.In fact, several systems proposed in the storage area in the past two decadeshave offered similar approaches and have been partially incorporated intosystems in production (e.g., pNFS [9], PVFS [85], GPFS [144], BAD-FS [30]and Lustre [145]); this corroborates the importance of this research.Perhaps the closest system to MosaStore is UrsaMinor [10], which is a222.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSsimilar distributed storage system that also applies an object system archi-tecture. UrsaMinor was the first system to propose the term versatility forreferring to enabling specific optimization storage techniques for differentapplications. Examples of techniques provided include erasure coding,consistency models, and data placement policies. Compared to MosaStore,the main difference is that the optimization modules in UrsaMinor are nottriggered via a POSIX API, and it has been proposed as a persistent storage,rather than an intermediate storage system. For both systems - as well asthose listed earlier - the user is responsible for setting the configurationparameters.In this context, MosaStore’s approach has three main advantages overpast work. The first two are interrelated: (i) application-agnostic mechanism- it requires annotating files with arbitrary <key, value> pairs via POSIXextended attributes, (ii) incremental - enables evolving applications andstorage-systems independently while maintaining the current POSIX in-terface. The advantage is that MosaStore provides an extensible storagesystem architecture that can easily accommodate new application-specificoptimizations.These advantages of MosaStore, a wide set of configurable storagetechniques (i.e., a high degree of versatility), and its focus on interme-diate storage make it an excellent platform for the research goal of thisdissertation. Additionally, these same advantages and the success of theperformance improvement of target applications [60] have led us to extendthe research project, including the proposal of a software-defined storagesystem approach [18].2.2 Target Execution Environment: Platform, Metrics,and ApplicationsThere is a wide range of applications that can leverage the optimizationtechniques provided by distributed storage systems on cluster platforms,and a range of metrics that can be used as optimization criteria [10, 14, 159].This research focuses on a subset of applications and metrics to drive the232.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSeffort of supporting the configuration of versatile storage systems. Thissection describes the target execution platform (Section 2.2.1), applications(Section 2.2.3), and success metrics (Section 2.2.2) - discussing why each isrelevant.2.2.1 Execution Platform and Intermediate Storage SystemsThis research focuses on versatile storage systems in the context of acluster of computers (from small clusters in university labs to large parallelsupercomputers), designed to deliver high performance computing for batchapplications that run on a dedicated subset of allocated machines. Thesecomputing platforms have storage systems composed of a global persistentfile system (e.g., Network File System (NFS) in small clusters or GeneralParallel File System (GPFS) [144] for large supercomputers), which is storedin just a few I/O servers and accessible from all nodes of a cluster. Theyalso have a local file system per compute-node, which is accessed directlyby tasks running on that compute node and allows better I/O performancethan the global file system.To avoid accessing the platform’s backend storage system (e.g., NFSor GPFS), recent proposals [14, 169] advocate for using some of the nodesallocated to the application to deploy an intermediate storage system. That is,aggregating (some of) the resources of an application allocation to provide ashared storage system dedicated to (and co-deployed with) the application[14, 30, 39, 108, 159, 174] to be used as a temporary scratch space. The trade-off is the loss of computing resources by taking some compute-nodes to beused as a staging or temporary scratch area, which may pay off given theperformance improvement. As a result, this intermediate storage systemhas a lifetime coupled to the application’s lifetime, can be optimized forthe application usage patterns, and is accessible from all compute nodesrunning the same application with a better performance than the backendstorage. The approach described here as an intermediate storage system is alsoknown as burst buffers [31, 112].This research uses MosaStore [14] as an application-optimized stor-242.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSage system intended to be configured and deployed together with thedistributed application, making MosaStore’s lifetime and performancerequirements coupled with the application lifetime.2.2.2 Success MetricsThe main metrics that this study addresses as the optimization criteriaare a primary concern for many systems: traditional performance metrics.Specifically, the target success metrics traditionally are response time anddata throughput for storage system operations, and turn-around time forworkflow applications. This research also enables better reasoning aboutnetwork bandwidth use and the resource allocation cost. Another metricis the storage footprint, since it is a typical constraint for the users and itis involved in the trade-offs of some optimization techniques (e.g., datacompression and replication).This study also targets energy consumption as part of the optimizationcriteria. Taking energy consumption into account when deciding on systemdesign and configuration has received increasing attention, since energy costis a growing concern in current supercomputers and data centers [25, 43, 71].While the impact of traditional metrics and the existence of their trade-offs are well understood, previous studies leave a gap in energy consump-tion analysis. Therefore, it offers a new challenge in terms of modeling andunderstanding such trade-offs to determine the balance between a system’sperformance and its energy bill [54, 146].2.2.3 ApplicationsThis research focuses on supporting the storage configuration of highperformance computing applications. Specifically, the target applicationsare workflow-based scientific applications and checkpointing applications,based on their popularity, on the characteristics of these applications, andon the focus of the NetSysLab and MosaStore research [11, 12, 14, 52, 53, 76,143].252.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSThe rest of this section briefly describes the domain of the target appli-cations, and how these applications can leverage an intermediate storagesystem.Workflow ApplicationsAssembling workflow applications by putting together standalone binarieshas become a popular approach to support large-scale science applications(e.g., BLAST [19], modFTDock [124], or Montage [32]). Many scientificapplications use the workflow paradigm [6, 33, 37, 38, 147, 169], in whichthe processing flow is structured as several computing tasks spawned fromthese binaries and communicating via temporary files stored on a sharedstorage system.This approach is also known as many-task computing [136], and theexecution dependency of tasks in the workflow forms a Directed AcyclicGraph (DAG). In this setup, the user allocates a set of dedicated machinesand runs workflow runtime engines, which are basically schedulers thatbuild and manage a task-dependency graph based on the tasks’ input/out-put files (e.g., Pegasus [67] and SWIFT [166]) to launch each task’s execution.In the context of many-task computing, a shared storage system ab-straction brings key advantages: simplicity, support for legacy applicationsand support for fault-tolerance. The workflow development, deployment,and debugging processes are simpler since the workflow application canbe developed on a workstation and, then, deployed on a cluster withoutchanging the environment. New stages or binaries can be easily integratedinto the workflow, since the communication via files using POSIX API allowsthe tasks to be loosely coupled. Finally, task faults can be tolerated by simplykeeping the task’s input files in the shared storage and launching a newexecution of the task, potentially on a different machine.The performance drawback of using a backend storage system asthe shared file system is addressed by the intermediate storage systemapproach described in Section 2.2.1. Workflow applications, however, stilloffer opportunities for performance improvement that vary depending262.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSon the workload and could benefit from a versatile configuration [60, 159],especially in a scenario where cross-layer communication, like that describedin Section 2.1.2, is available.Cross-layer communication enables the exchange of information be-tween the storage system and the workflow execution engine. In this context,a workflow execution engine can, for example, guide the data placementof a set of files, and use data-location information to enable a data-awarescheduling approach. Such optimization is not possible in traditional filesystems.Workflow Data Access Patterns. A typical task in a workflow applica-tion progresses as follows: (i) each node brings the input data from theintermediate storage to memory (likely through multiple I/O operations),then (ii) the processor loads the data from the memory and processes it, andfinally, (iii) the output is pushed back to the intermediate storage. Addition-ally, all files generated during the workflow execution are temporary, exceptthe files containing the final results of the application.The structure of the tasks’ data-dependency graph creates a set ofcommon data access patterns. Table 2.2 describes the main commonworkflow data-access patterns, lists storage configuration parameters thataffect each pattern’s performance, and explains how the workflow executionengine can be used to enable optimizations. These are among the most usedpatterns uncovered by studying over 20 scientific workflow applications byWozniak and Wilde [169], Shibata et al. [147], and Bharathi et al. [33]. Moreimportantly, the typical structure of a workflow application is a combinationof these patterns (see Section 3.3.4 for an example).My collaborators and I [60] have presented a more detailed descriptionof these patterns, as well as how a workflow-aware storage system canleverage them, and have evaluated their performance when applying theoptimizations described in Table TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSPattern Description Pattern Details and OptimizationsPipelineA set of computetasks are chained ina sequence suchthat the output of atask is the input ofthe next task in thechain.Node-local data placement (if possi-ble), caching, and data location-awarescheduling. An optimized system storesan intermediate output file on the storagenode on the same machine that executesthe task that produced the file, if spaceis available, to increase access localityand efficiently use local caches. Ideally,the location of the data is exposed to theworkflow scheduler, so that the task thatconsumes this data is scheduled on thesame node.BroadcastA single file is usedby multiple tasks.An optimized replication taking into ac-count the data size, the fan-out and thenetwork topology. An optimized storagesystem can create enough replicas of theshared file to eliminate the possibilitythat the node(s) storing the file becomeoverloaded.ReduceA single task usesinput filesproduced bymultiple tasks froma previous stage.Reduce-aware data placement: co-placement (also known as co-locate) of alloutput files from a stage on single node,and data location-aware scheduling. Anoptimized storage system can place allthese input files on one node, and exposetheir location. Then, the scheduler placesthe task that takes these files as inputon the same node, increasing data-accesslocality.Table 2.2: Common workflow patterns as described by Al-Kiswany et al. [17] andCosta et al. [60]. Circles represent computing tasks, outgoing arrows indicatethat data is written to a temporary file, and an incoming arrow indicates thatdata is consumed from a temporary file.282.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSCheckpointing ApplicationsCheckpointing is the process of persistently storing snapshots of an ap-plication’s state. These snapshots (or checkpoint images) may be used torestore the application’s state in case of a failure or as an aid to debugging.Checkpointing is widely adopted by long-running applications in the fieldof high performance computing [12].Depending on the checkpointing technique used, the application charac-teristics, and the time interval between checkpoint operations, checkpointingmay result in large amounts of data to be written to the storage systemin bursts, and successive checkpoint images may have a high degree ofsimilarity. A previous study by NetSysLab has collected and analyzed [12]checkpoint images produced using VM-supported checkpointing (usingXen), process-level checkpointing (using the BLCR checkpointing library[86]), and application-based checkpointing. Depending on the checkpoint-ing technique, the time interval between checkpoints, and the similarity-detection technique used, the detected similarity between consecutive filesvaried between no similarity to 82% similarity (for the BLAST bioinformaticsapplication checkpointed using BLCR at 5-min. intervals).These bursty writes can take advantage of the intermediate storagesystem (also known as burst buffer) to improve performance. Moreover,the data similarity can leverage data deduplication techniques of the filesystem to reduce its storage footprint and speed-up writes at the cost ofextra compute cycles [11, 12, 125, 135, 139]. To enable data deduplicationor not is a user’s decision and depends on her optimization criteria and isone of the subjects of study in this research (details in Section 2.2.4 and inChapter 5).2.2.4 Integrating Applications and the Storage System Cross LayerCommunicationAs explained in Section 2.1.2, an application can tag a file to express a hintof its data access pattern and, thus, have the storage system optimized for it.292.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSThe integration that follows between application and storage system variesin complexity.Workflow ApplicationsFor workflow applications, as described in this section, integrating theapplication with the storage system may involve more iterations thanenabling a single optimization.Currently, the MosaStore prototype is integrated with Swift - a popularlanguage and workflow runtime system [166], and pyFlow - a similar, yetmuch simpler, workflow runtime that NetSysLab has developed. Suchintegration allows us to demonstrate the end-to-end benefits of MosaStoreand the cross-layer communication approach while not requiring anymodification to the application tasks. Specifically, the integration enablestwo main functionalities:Location-aware scheduling. Swift and pyFlow did not support location-aware scheduling since most shared storage systems do not exposedata location [159]. The modification consists of querying the metadatamanager for location information of the input files of a given task, thenattempting to schedule the task on the node storing the files.Smart data placement. The runtime engine adds extra calls to explicitlyindicate the data access hints (see Table 2.3). The storage systemthen uses this information to change the data placement policy. Thesolution described in Chapter 3 can also be used to evaluate differentdata placement policies and, when it is worthwhile, passes these hintsalong as part of the workflow. Alternatively, the runtime engine couldreceive the hints as input, or build a task-dependency graph [162],which could be used to insert these hints in an approach that does notnecessarily improve performance [17, 57].Al-Kiswany et al. [17] present a detailed description of the integrationof MosaStore with workflow applications, including the limitations and aperformance evaluation.302.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSAction API Call DescriptionBroadcasthint/Replicationset(“Replication”,level)Replicate the chunks of thefile level times.Enable/Disablededuplicationset(“deduplication”,boolean)Enable or disable deduplica-tion for a specific file.Get Location get(“location”,null)Retrieves the location infor-mation of a specific file.Manage CacheSize Per Fileset(“CacheSize”,size)Suggest a cache size for a spe-cific file.Manage ChunkSize Per Fileset(“ChunkSize”,size)Suggest a chunk size for aspecific file.Pipelinehint/Localityset(“DP”, “local”) Indicates preference to allo-cate the file chunks on thelocal storage node.Reduce hint/Collocationset(“DP”,“collocation |groupName ”)Preference to allocate chunksfor all files within the samegroupName on one node.Scatter hint set(“DP”, “scatter| size”)Place every group of contigu-ous size chunks on a storagenode.Table 2.3: Examples of MosaStore API calls used to integrate the applications andthe storage systems. Most of the calls represent interactions between theworkflow runtime schedulers and the storage system. Note that these callsoccur on a per-file basis, and have an additional parameter to specify thetarget file, which is omitted to simplify the presentation. “DP” means dataplacement.Figure 2.4 depicts the architecture of a Workflow Optimized StorageSystem (WOSS) in this context.Checkpointing Applications and Data DeduplicationFor checkpointing applications, the integration is as simple as using a tagto instruct the storage system to enable an optimization: in this case, datadeduplication. Data deduplication is a method to detect and eliminatesimilarities in the data, which affects the time of storage operations, energyconsumption and storage usage. Enabling data deduplication, or not, is auser’s decision, which depends on her optimization criteria. Supportingthe user in this decision is a subject of study in this research (see Chapter312.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSApp. taskLocal storageApp. taskLocal storageApp. taskLocal storageWorkflow - Optimized Storage (shared)Backend Filesystem (e.g., GPFS, NFS)Compute Nodes…Workflow Runtime E ngineStage In/OutStorage hints(e.g., loc ation i nfor mati on)Applic ation  hints (e.g., indicating acc ess patt erns)POSIX  APIFigure 2.4: A WOSS, as described by Al-Kiswany et al. [17], aggregates the storagespace of the compute nodes and is used as an intermediate file-system.Input/output data is staged in/out from the backend storage. The workflowscheduler queries WOSS for data location to preform location-aware schedul-ing. The scheduler submits tasks to individual compute nodes and includeshints that indicate the future data usage patterns and are used by WOSS fordata placement decisions.5). The rest of this section summarizes how data duplication works, andhow its integration with the storage system affects the target metrics of thisdissertation.Briefly, deduplication works as follows: when a new file is stored, thestorage system divides the file into chunks, computes identifiers for eachchunk based on its content by hashing the data, compares the identifiersnewly obtained with the identifiers of the chunks already stored, andpersistently stores only the new chunks (i.e., those with different identifiers).Similar chunks are, hence, not stored, saving storage space, reducing theI/O load, and also reducing the network load.Chunking. The process of detecting the data similarity (i.e., data redun-dancy) involves breaking the data into chunks and giving a content-basedidentifier for each chunk. The content-based chunk identifier is typically322.2. TARGET EXECUTION ENVIRONMENT: PLATFORM, METRICS, AND APPLICATIONSbased on a strong hash function, e.g., Message Digest Algorithm (MD5)or Secure Hash Algorithm (SHA), to avoid collisions, and this can becomputationally costly. The deduplication process then uses such identifierto verify if there is a similar chunk already stored in the system.The chunking strategy plays an important role in deduplication. It is theprocess of breaking the data into chunks and it can be performed accordingto two main approaches: (i) fixed-size chunking, where every chunk has thesame pre-specified size and (ii) content-based boundaries, where specificpatterns of data determine a chunk boundary.The first approach, fixed-size chunking, creates a new chunk for everychunk size amount of bytes. It has the advantage of being computationallycheap since the chunking scheme does not need to analyze the data contents.It is, however, not robust for data insertions or deletions that are not multipleof the chunk size since a single byte insertion can change the detectablesimilarity of the data. Content-based boundaries [105], on the other hand,are more robust to data insertions and deletions since the chunk-boundary isnot bound to a specific size, but to specific data patterns. The disadvantageof content-based boundaries is the high cost of analyzing all the data contentin order to define the chunks.Inline data deduplication. When a system performs inline data dedu-plication, the storage system deduplicates the data while the applicationis writing the data to the storage system. This can reduce the number ofbytes sent over the network and written to storage (i.e., reduce the I/Ooperations). However, note that is does not necessarily render faster writeoperations, since the cost of creating the identifiers and chunking the data isexpensive and, therefore, it is not clear that the performance cost is paid off[12, 52]. Similarly, it is not clear when energy consumption can be reduced[54].332.3. RELATED WORK ON SUPPORTING PROVISION AND CONFIGURATION2.3 Related Work on Supporting Provision andConfigurationImproving the efficiency of computing systems, via tuning their configu-ration, is a common goal of users, developers, and system administrators.Consequently, overcoming the difficulties of manually tuning computersystems has been the focus of many research efforts for many years, andhas become more important as systems increase in scale complexity andmaintenance costs [5, 35, 36, 43, 63, 64, 68, 137, 151, 152, 156, 158].This section describes different approaches employed in past works tofacilitate the process of tuning the performance of computer systems, andcompares them with the proposed solution described in this dissertation.Chapters 3, 4 and 5 discuss in more detail related work in connection to thematerial presented in each chapter.2.3.1 Measuring System ActivityThe most basic way to support the user in the task of tuning a system isby measuring system activity for later analysis. In this approach, one canuse code instrumentation or simple scripts to collect system activity. Somesystems already have instrumentation for this in the code and provide moresophisticated monitoring tools (e.g., Stardust [158]), allowing the user tospecify the type or granularity of the activity to be measured. This approach,however, can often lead to a burdensome configuration process where theuser is in charge of verifying and understanding the impact of configurationchanges via new measurements, and does not scale well with system size,complexity, and load - requiring specialized tools to analyze the data andsupport the user.2.3.2 Predicting PerformanceKnowing the level of activity in the system helps to understand the workloadcharacteristics and the impact of configuration changes (Section 2.3.1). Theuser, however, still has to change the parameters and verify the impact ofsuch changes, as described in the loop in Figure 1.2. To avoid these draw-342.3. RELATED WORK ON SUPPORTING PROVISION AND CONFIGURATIONbacks, several approaches to estimating the performance of a computingsystem have been proposed.Modeling a System’s behaviourOne approach is still using direct measurements, but reducing the actualexecution cost by using just the application’s kernel (a small portion of thesystem) and/or a representative part of the workload to infer the impactof configuration changes in the performance of the complete applicationand workload. Although this approach can be effective for small systems, itbecomes less effective, and harder to handle, for larger and more complexsystems.As solutions move away from solely direct measurements, they relyon models of the system to capture the most important behaviour, whileabstracting the details of the system’s internal workings and of the targetworkload. The two typical approaches to model the behaviour of computingsystems in order to predict performance based on some previous knowledgeof the system’s internals, also known as white-box, are analytical modelsand simulations. A third one, such as machine learning, relies on directmeasurements to build a model of the system’s behaviour, it is also knownas a black box approach.Analytical models are traditional mechanisms for predicting the perfor-mance of computing systems without requiring an actual execution [20, 117].They are represented by mathematical formulas that have a closed-formsolution providing an approximation for the capacity of system componentsunder a different configuration or workload, without any implementationeffort needed.Analytically modeling a system, however, can be a complex task. Itis difficult to capture the impact of the several configuration parameters,so it becomes necessary to make non-realistic assumptions regarding thecharacteristics of the workload, the system, and the platform. As a result,analytical models often include simplifications that limit their usability inrealistic settings.352.3. RELATED WORK ON SUPPORTING PROVISION AND CONFIGURATIONSimulation is another alternative for predicting system performance [8, 70,122, 123]. Compared to analytical models, simulations allow us to capturethe behaviour of more complex systems and, since a simulator can handle anear-real workload, it does not require the same non-realistic assumptionsabout the system, as it is the case for analytical modeling.For example, Network Simulator 2 (NS2) [8] is a discrete event simulatorfor computer networks typically used to support research and provide someapproximation of the actual behaviour of the system. NS2 provides widesupport for the simulation of several protocols in different layers of thenetwork stack. Also used to support research, PeerSim [123] simulatespeer-to-peer systems, which commonly require simulation to scale to alarge number of nodes. DiskSim [1] simulates hard disks behaviour. Otherapproaches [70, 122] also target the simulation of parallel file systems.Despite its advantages over analytical models, properly capturing thenecessary details of the system to result in useful output is challenging,especially in distributed environments, due to the complexity and scaleof the system. As a result, the cost of implementing the simulator andexecuting simulations can be a drawback. PeerSim, for example, providestwo operation modes that vary in simulation details: the simplest one is lessaccurate, but faster; the more complex one is more realistic, but may takelonger, resulting in it being less scalable.The machine learning approach tries to cover the gap between the need toknow the systems’ internal details and still providing useful informationabout the system behaviour. To this end, machine learning relies on directmeasurements of the system to derive the expected behaviour based ondifferent techniques (e.g., linear regressions, decisions trees, and neuralnetworks).For example, Crume et al. [61] use machine learning approach to definea function capturing hard disk access times. Mesnier et al. [118, 120] rankdifferent types of machines, based on a set of benchmarks, to describethem in terms of relative performance among the different types (say ‘A’and ‘B’). Later, it tries to derive the behaviour of a workload in a platform‘A’, given the actual behaviour on platform ‘B’ and a description of their362.3. RELATED WORK ON SUPPORTING PROVISION AND CONFIGURATIONrelative performance. IronModels [155] use a Classification And RegressionTrees (CART) technique to derive black-box models of storage systems.Machine learning is an attractive approach, given the promise of re-ducing the mismatch between performance given by the model and actualperformance. Its various techniques, however, may require a large set ofmeasurements to train their models, and typically need more data for eachscenario that they try to model. Therefore, it needs data in a wide rangeof scenarios (e.g., a number of possible configurations), and this makes itharder to be used to reason about the behaviour of new system functionality,configuration, or deployment scenarios.Building Prediction MechanismsPart of this past work (e.g., NS2 [8] and PeerSim [123]) is research oriented,having the goal of obtaining a first approximation of performance, ratherthan a more accurate performance in configuration-oriented scenarios.Towards supporting configuration, some systems rely on one of theaforementioned modeling approaches, or combine several of them, toprovide tools that support “what...if...” queries. These tools allow usersto estimate the impact of configuration changes by submitting probe queriessuch as “What would be response time if the number of storage nodeschanges from 3 to 5?”.Some database management systems (DBMS) also support “what...if...”queries in their systems (see past work [45–47, 157, 163]). Examples ofsupported decisions are the choice of cache size [126], and the choice ofindices [44, 148].In storage systems that provide file- or object-oriented use, Thereskaet al. [156] add support to “what...if...” queries to aid in the configurationof the software layer for their UrsaMinor storage system [10], focusing onresponse time and network usage metrics. Such queries help the user tomake decisions regarding data encoding (e.g., parameters for erasure codes).The prediction tool models storage nodes as the CPU, network interface,buffer cache, and disks. Each of these components is associated with a372.3. RELATED WORK ON SUPPORTING PROVISION AND CONFIGURATIONprediction mechanism that relies on a specific technique: analytical modeling(network, and disks), simulation (buffer caches), or direct measurements(CPU). The system relies on a monitoring system [158] to feed the predictiontool with detailed measurements based on the requests’ path through thesystem, including kernel instrumentation. Other approaches also targetstorage systems to automate decisions, but focusing on different metrics; forexample, Keeton and Merchant [101], Gaonkar et al. [74], and Keeton et al.[102] target how configuration can impact reliability.Chapter 3, in Section 3.5, presents additional discussion of differentapproaches for predicting performance for storage systems.Automating the Configuration ChoiceA naïve approach to automating configuration is to automatically vary thevalues of configuration parameters to perform an exhaustive search, executethe application, measure system activity, and choose the parameters thatbest match the optimization criteria. The drawback of this approach is thetime that it takes.Most prediction mechanisms can aid in configuration by reducing thetime required to explore the configuration space, but they do not automatethe choice of parameters. The user has to decide the best configurationparameters by exploring several possible configurations, which can beeasily handled, or automated, if the cost to predict is low. The rest of thissection gives an overview on different approaches that rely on predictionmechanisms to automate the configuration choice.Exploring the configuration space with direct measurements is feasiblewhen the cost of executing the application is low, or when it is possibleto extract just the application’s kernel and try different parameters. Forexample, Datta et al. [64] use this approach to optimize the data structuredecomposition of stencil computations at compile time in order to better fitthe memory bus and cache mechanism of a given computer architecture.Analytical performance models can provide the best possible configurationfor a given parameter. The drawback, however, besides the complexity of382.3. RELATED WORK ON SUPPORTING PROVISION AND CONFIGURATIONanalytically modeling the system, is that solving the necessary equationsmay require a closed-form expression with a well known method to calculatetheir roots, for example. Even if the equations are provided in this form,solving them can require long time, or even leading them to be intractable,depending on the complexity of the analytical model.Another approach is to use control theory to perform online adaptation,which is effective if the system has a target value as a goal, instead ofan optimization goal. This model does not provide the final value for aparameter, but it can indicate the expected behaviour of varying a givenparameter. In this scenario, a feedback control loop continuously adjusts aconfiguration parameter until the system delivers a specific performancelevel. For example, Storage@desk [94] controls the network bandwidth; theuser specifies a target value for bandwidth and the control loop tries to keepthe bandwidth close to this value by delaying write or read requests.Some solutions use a model-driven approach. For example, Oracle9i [63]and DB2 [151] use a model to predict performance before the system acts, todetermine the buffer size allocated to execute different SQL queries in orderto attend new requests with no abrupt impact on the performance. Mian et al.[121] use a set of heuristics to search for a cost-effective provisioning decisionin the domain of data analytic workloads based on empirical response timefor the target queries.In the context of distributed services, dynamic provisioning [35, 36, 43, 68,137] has also used model-driven or control theory approaches for applica-tions that run as services in data centers. Online monitoring and historicalworkload models provide information about the workload characteristics.The main goal of dynamic provisioning is to reduce the cost of running thesystem by dynamically determining the minimum amount of resources (e.g.,memory or number of web and application servers) that can meet ServiceLevel Objective (SLO)s predefined in a Service Level Agreement (SLA).Applying heuristics to explore the configuration space is one more approachtowards automation, since exploring the whole configuration space canbe time-consuming. In storage systems, this approach has been used withanalytical models or simulations. For example, Strunk et al. [152] describe a392.3. RELATED WORK ON SUPPORTING PROVISION AND CONFIGURATIONtool that helps administrators in the task of buying a cost-effective storagesolution. The tool uses prediction mechanisms similar to those describedby Thereska et al. [156]. It receives the datasets to be used in the futuresystem, a utility function as optimization criteria, and explores the potentialconfigurations.To reduce the number of possible configurations to evaluate, suchsolution uses genetic algorithms to guide the configuration choice. Similarapproaches (e.g., Hippodrome [21], Minerva [20], and Ergastulum [22])target systems based on Redundant Array of Independent Disks (RAID).2.3.3 Placing this Dissertation in ContextThe work presented in this dissertation differs from past work in severalways.The typical target context in past work on storage systems is based onthe work environment of a company with an enclosed storage solution,or on predicting the performance of a small scale deployment from theperspective of a single, or few, client(s), differing from a cluster-based [14]context with tens of machines acting as storage servers.This work proposes prediction mechanisms for different optimizationtechniques, with a focus on distributed storage systems - specifically in-termediate storage systems in the context of workflow and checkpointingapplications. As a result, it targets a wider set of configuration parametersfor storage systems, and different combination of possible optimization crite-ria. Specifically, it targets predicting performance of a complete execution ofworkflow applications – focus of this dissertation – as well as performanceof storage operations when employing data deduplication.One of the main challenges for past approaches has been seeding themodel, based on simulation or analytical modeling, so it can be used topredict performance, while still offering accuracy good enough to supportconfiguration. One common assumption in these past approaches is thatrelying on human intervention is needed to seed the models or providedetails about the workload. This assumption prevents some of the past402.3. RELATED WORK ON SUPPORTING PROVISION AND CONFIGURATIONapproaches from being useful in practice, since it assumes that the userknows the details about the performance of different components in thesystem, or has an approach to perform system identification (i.e., identify theproper values for model’s parameters for a given system), as, without it, themodel is useless. Other approaches rely on a fine-grain monitoring systemin order to identify all of a model’s parameters, which can lead to changesin the target system itself, or the underlying platform (e.g., to instrument thecode or the system or the kernel), and more overhead to collect the neededmeasurements (e.g., Stardust [158]). Sections 1.2 and 3.2.1 present a furtherdiscussion of these problems.This work is based on models that have a number of attractive properties:simple – by not capturing the details on the internal behaviour of each sys-tem component, uniform – by modeling each system component similarly,and generic – by relying on a high-level representation of object-storagesystems.To provide these properties, this work proposes performance mech-anisms that leverage the target application’s characteristics to acceleratethe exploration of the provisioning and configuration space. Specifically,a performance-prediction mechanism consists of a model, a procedure toseed the model, and an implementation of these two, providing a softwaretool to predict the behaviour of intermediate storage systems. In this way,these predictors can be used by the user to answer her specific “What...if...”questions, or as building blocks for an automated configuration solutionthat applies some heuristic to explore the configuration space.Additionally, these mechanisms rely on monitoring information avail-able from application-level measurements, not requiring changes to thestorage systems nor specialized monitoring systems, making it possible forthe seeding information to be extracted by the simple described benchmarks.In providing these solutions, this study also identifies a solution space, forthe target context, where it is possible to provide a prediction mechanismwith the characteristics described above while still providing predictionswith enough accuracy to support user’s decisions about provisioning andconfiguration of storage system.41Chapter 3Predicting Performance for I/OIntensive WorkflowApplicationsThis chapter presents a solution to address the problem of supportingstorage provisioning, allocation and configuration decisions for workflowapplications in the context of many-task computing and an intermediate stor-age system (see Section 2.2). It focuses on the answering the following twohigh-level questions: “How can one leverage the characteristics of I/O intensiveworkflow applications to build a prediction mechanism for traditional performancemetrics (e.g., time-to-solution and network usage)”? and “Which extensions doesthe performance predictor need in order to capture energy consumption behaviour,in addition to traditional performance metrics?”To enable selecting a good choice in a reasonable time, the proposedapproach accelerates the exploration of the configuration space using alow-cost performance predictor that estimates, among other metrics, thetotal execution time of a workflow application in a given setup1. Thepredictor is lightweight (200x to 2000x less resource intensive than runningthe application itself) and accurate (80% of the evaluated scenarios have1Predicting Intermediate Storage Performance for Workflow Applications [55] Lauro BeltrãoCosta, Samer Al-Kiswany, Abmar Barros, Hao Yang, and Matei Ripeanu. In Proceedings ofthe 8th Parallel Data Storage Workshop, PDSW ’13, pages 33 – 38. ACM, November 2013.423.1. MOTIVATIONprediction errors smaller than 10%, and in the worst case scenario theprediction error is still within 20%) 2,3.This chapter is organized as follows: Section 3.1 presents a summaryof the problem. Section 3.2 presents the requirements and the designof the proposed prediction mechanism. Section 3.3 presents experiencewith using this prediction mechanism when evaluated independently withsynthetic benchmarks, and in the context of making configuration choicesfor real applications focusing on turn-around time and cost. Section 3.4presents an extension of the initial model presented in Section 3.2.3 thattakes energy predictions into consideration, and summarizes a preliminaryenergy evaluation led by Hao Yang4. Section 3.5 describes further relatedwork. Finally, Section 3.7 summarizes the chapter, limitations, and discussessome lessons learned during the exercise of designing the performancepredictor.3.1 MotivationAs discussed in Section 2.2.3, assembling workflow applications by puttingtogether standalone binaries has become a popular approach to supportlarge-scale science [33, 147, 169] (e.g., modFTDock [124], Montage[32] orBLAST [19]). The processes spawned from these binaries communicatevia temporary files stored on a shared storage system. In this setup, theworkflow runtime engines are basically schedulers that build and manage atask-dependency graph based on the tasks’ input/output files (e.g., SWIFT[166], Pegasus [67]).To avoid accessing the platform’s backend storage system (e.g., NFS orGPFS or Amazon S3), an approach using what is called an intermediate storage2Supporting Storage Configuration and Provisioning for I/O Intensive Workflows [58] LauroBeltrão Costa, Samer Al-Kiswany, Hao Yang, and Matei Ripeanu. Under Review.3Supporting Storage Configuration for I/O Intensive Workflows [57] Lauro Beltrão Costa,Samer Al-Kiswany, Hao Yang, and Matei Ripeanu. In Proceedings of the 28th ACMInternational Conference on Supercomputing, ICS ’14. ACM, June 2014.4Energy Prediction for I/O Intensive Workflows [171] Hao Yang, Lauro Beltrão Costa, andMatei Ripeanu. Proceedings of the 7th Workshop on Many-Task Computing on Clouds,Grids, and Supercomputers, MTAGS ’14. pages 1 – 6. ACM, November 2014.433.1. MOTIVATIONsystem has been proposed [14, 169]. This approach advocates using some ofthe nodes allocated to the application to provide a shared storage system.That is, aggregating (some of) the resources of an application allocation toprovide a shared temporary storage system dedicated to (and co-deployedwith) the application.Aggregating node-local resources to provide the intermediate storageoffers a number of advantages discussed in Chapter 1, and in more detailin Section 2.2, for example: higher performance, higher efficiency, andincremental scalability. This scenario also opens the opportunity to optimizethe intermediate storage system for the target workflow application sincea storage system used by a single workflow, and co-deployed on theapplication nodes, can be configured specifically for the I/O patternsgenerated by that workflow [159].Configuring the intermediate storage system, however, becomes increas-ingly complex. Users have to tune the configuration of storage techniques.These techniques have trade-offs that benefit both applications and themetrics of the optimization criteria differently. Further complicating thisscenario, the user faces resource allocation decisions that often entail trade-offs between cost and application turn-around time [49, 91, 100]. Typicalallocation choices involve deciding on the number of nodes to be used forthe application in batch-computing environments, and specifying the nodes’type in cloud-computing environments (i.e., the hardware capabilities per-node in terms of compute, memory, storage and network capabilities).This involves allocating resources and configuring the storage system(e.g., chunk size, stripe width, data placement policy, and replicationlevel). Thus, the decision space revolves around: provisioning the allocation -deciding on the total number of nodes, deciding on node type(s), or nodespecification, for cloud environments; allocation partitioning – splitting thesenodes or not between the application and the intermediate storage system;and setting storage system configuration parameters – choosing the valuesfor several configuration parameters, e.g., , chunk size, replication level,cache/prefetching and data placement policies, for the intermediate storagesystem. In this context, manually fine-tuning the storage system configura-443.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMtion parameters and allocation decisions is undesirable for multiple reasons(see Section 1.1).In this space, typically the user’s goal is to optimize a multi-objectiveproblem, in at least two dimensions: to maximize performance (e.g., reduceapplication execution time) while minimizing cost (e.g., reduce the totalCPU hours or dollar amount spent). The user can commonly describe hergoal in the form of a utility function [152, 167], which reduces this multi-dimensional space (e.g., performance, cost, and energy consumption) intojust one dimension: the utility5.More concretely, the user is often interested in answering specificquestions. For example:• How should the storage system be configured to achieve the fastestexecution?• What is the allocation that can achieve the lowest total cost?• How should I partition the allocation among application and storagenodes to achieve the highest performance?• What is the allocation that is most cost efficient (i.e., has lowest costper unit of performance)?• Which configuration can lead to energy savings?• What is the time-to-solution impact of energy-centric tuning?3.2 The Design of a Performance PredictionMechanismThis section presents the design of a performance prediction mechanismfor object-based storage systems in the context of workflow applications bytargeting the following question “How can one leverage the characteristics ofI/O intensive workflow applications to build a prediction mechanism for traditional5This dissertation does not discuss the mapping of predicted performance metrics toutility in order to support configuration, which can be found in past work [152, 167].453.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMperformance metrics (e.g., time-to-solution and network usage)”? Specifically,given a particular storage system configuration, a workload description,and a characterization of the deployment platform based on a simplesystem identification process (e.g., storage nodes service time, networkcharacteristics), the mechanism predicts the total application turnaroundtime.Making accurate performance predictions for distributed systems ischallenging. Since purely analytical models cannot provide adequateaccuracy in most cases, simulation is the most commonly adopted solution.At the one end of the design spectrum, current practice (e.g., NS2 simulator[8]) suggests that while simulating a system at low granularity (e.g., packet-level simulation in NS2) can provide high accuracy, the complexity of themodel and the seeding process, as well as the number of events generated,make accurately simulating large-scale systems unfeasible, and may reducethe applicability of this approach to small systems and/or short traces. Atthe other end of the spectrum, coarse grained simulations (e.g., PeerSim[123] or SimGrid [42]) scale well, but at the cost of lower accuracy.Two observations were key to enable a solution that reduces simulationcomplexity and increases scalability: First, as the goal is to support configura-tion choice for a specific workload, achieving perfect accuracy is less critical,as long as it can support the user’s configuration decisions (see Section3.3). Second, this solution takes advantage of workload characteristicsgenerated by workflow applications: relatively large files, distinct I/Oand processing phases, single-write-many-reads, and specific data accesspatterns. These observations enable reduction in the simulation complexityby not simulating in detail some of the control paths that do not significantlyimpact accuracy. For example, the chunk transfer time is dominatedby the time to send the data, hence not accounting for the time of theacknowledgment messages and all metadata messages will not tangiblyimpact accuracy).The proposed solution uses a queue-based storage system model. Themodel requires three inputs from the user: (i) the storage system config-uration for the scenario to be explored, (ii) a workload description, and463.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMPredictorWhat...If… Analysis for ConfigurationTests for Development SupportApplicationSeeding ScriptsTracePlatformdescriptionRecommendedConfiguration( s)Cases to DebugFigure 3.1: Predictor’s input and possible use-cases. To predict an application’sperformance, the predictor (Section 3.2.3 and Section 3.2.6) receives theplatform (Section 3.2.4) and workload (Section 3.2.5) descriptions. Theuse-cases include (i) “What...If...” analysis for supporting provisioning andconfiguration decisions - the focus of this chapter, and (ii) performance testsfor supporting the development of a storage system, as described in Chapter4.(iii) the performance characteristics of each storage system component(i.e., system identification based on seeding scripts described in Section3.2.4). The predictor instantiates the storage system model with the specificcomponent characteristics and configuration, and simulates the applicationrun as described by the workload description. Figure 3.1 shows how thesedifferent components interact with the predictor.This remainder of this section discusses the requirements for a practicalperformance prediction mechanism (Section 3.2.1), and briefly presents thekey aspects of the object-based storage system architecture modeled in thisstudy (Section 3.2.2). Then, it focuses on the proposed solution: it presentsthe model (Section 3.2.3), the system identification process to seed the model(Section 3.2.4), an overview of the workload description (Section 3.2.5), and,finally, its implementation (Section 3.2.6).473.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISM3.2.1 RequirementsA practical performance prediction mechanism should meet the following,partially conflicting, requirements that bind the solution space:Accuracy (R1). The mechanism should provide adequate accuracy. Al-though higher accuracy is always desirable, in the face of practi-cal limitations to achieving perfect accuracy, there are decreasingincremental gains for improved accuracy. For example, to supportconfiguration decisions, a predictor only needs to correctly estimaterelative performance or trends resulting from changing a configurationparameter. Similarly, if two configurations offer near performance,perfect accuracy is less important as long as the prediction mechanismplaces their performance as similar. In fact, Pereira et al. [131] showthat even a perfect replay of the storage system operations on theactual system, dedicated to the application, does not predict with100% accuracy the performance of the storage system. (See evaluationpresented in Sections 3.3 and 3.4.2).Response Time and Scalability (R2). The predictor should enable ex-ploration of the configuration space at a low cost. The mechanismshould offer performance predictions quickly, have low resource usageand scale with both (i) the system size, and (ii) the I/O intensity ofworkflow applications. (See evaluation presented in Section 3.3.5).Usability and Generality (R3). The predictor should not impose a burden-some effort to be used. Specifically, the bootstrapping/seeding processshould be simple and it should not require storage system redesign (ora particular initial design) to collect performance measurements. Addi-tionally, the predictor should model a generic object-based distributedstorage design and using it should not require in-depth knowledge ofstorage system protocols and architecture. A discussion of usabilityis presented in the remaining part of this section as well as in Section3.4.1.483.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMAbility to explore “What...if...” scenarios (R4). A prediction mechanismshould be able to support exploring hypothetical scenarios, such asscenarios that assume new or simply different platform (e.g., moremachines, or faster machines). For example, “What would be theturnaround time of the application A if it uses 20 instead of 10machines?” (See evaluation presented in Sections 3.3 and 3.4.2).3.2.2 Object-based Storage SystemsThe predictor focuses on the widely-adopted object-based storage systemarchitecture (e.g., UrsaMinor [10], PVFS [85], and MosaStore [14]) describedin detail in Section 2.1. This architecture includes three main components:a centralized metadata manager, storage nodes, and a client-side systemaccess interface (SAI). The manager maintains the stored files’ metadata andsystem state. Files are split into chunks stored across several storage nodes,and the client SAIs implement data access protocols.Currently, the prediction mechanism assumes that the parameters pro-vided by MosaStore (see Section 2.1.3) are configurable - e.g., chunk size,stripe width, replication level, cache size, and data placement policies. Thisapproach can be extended to support other configuration parameters.3.2.3 System ModelThe proposed solution uses a queue-based storage system model for thesystem components’ operations and their interactions as summarized inFigure 3.1.All participating machines are modeled similarly, regardless of theirspecific role (Figure 3.2): each machine hosts a network component, and canhost one or more system components - each modeled as a service with itsown service time per request type and a First In, First Out (FIFO) queue.Each system component and its infinite queue represent a specificfunctionality: The manager component is responsible for storing files’ andstorage nodes’ metadata. The storage component is responsible for storingand replicating data chunks. Finally, the client component receives the493.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMSystem DeploymentNumber of Storage Nodes NsmNumber of Client Nodes NcliCollocation of Storage and Client Modules CollocPerformanceManager Service Time µmaStorage Module Read Service Time µsmReadStorage Module Write Service Time µsmWriteClient Service Time µcliRemote Network Service Time µremNetLocal Network Service Time µlocNetTable 3.1: Model parameters describing the platform. To instantiate the storagesystem model, one needs to specify these parameters. The number ofcomponents in the system can be freely specified, the only restriction ishaving Nsm > 0 and Ncli > 0. The service times are part of the platformperformance description and its identification is described in Section 3.2.4.For simplicity, part of the description in this chapter uses simply µsm referringto the service time of a storage operation in the storage module, which iseither read or write; similarly, µnet refers to µremNet or µ and write operations from the application, implements the storagesystem protocols at high-level by sending control or data requests to otherservices, and it communicates again with the application driver once astorage operation finishes. Each of these storage system components ismodeled as a service that takes requests from its queue and uses the networkservice to send requests and responses (see Table 3.1). The application drivercan also issue requests directly to the client service queue.The model captures the four main storage operations: open, close, read,write. As a rule, the prediction mechanism accurately models the data pathsof the storage system at chunk-level granularity, and the control paths at acoarser granularity: it models only one control message to initiate a specificstorage function while an implementation may have multiple rounds ofcontrol messages.The network component and its in- and out- queues model the network-related activity of a host. Key here is to model network-related contentionwhile avoiding modeling the details of the transport protocol (e.g., dealingwith packet loss and re-transmission, connection establishment and tear-503.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMdown details). These details can improve the accuracy of the predictor, butat a cost of longer simulations. The additional accuracy is not needed toguide the configuration for the applications that this research targets (theevaluation in Section 3.3 provides evidence for this observation).The requests in the out-queue of a network component are broken insmaller pieces that represent network frames, and sent to the in-queue ofthe destination host. Once the network service processes all the frames of agiven request in the in-queue, it assembles the request and places it in thequeue of the destination service.In addition to the storage system modeling, the prediction mechanismcaptures part of the execution behaviour of the application. The predictoralso performs scheduling according to the input describing the application,which includes a tasks’ dependency graph (capturing workflow executionplan) used for scheduling and data placement purposes (see Figure 3.1 andSection 3.2.5 for more details).The system components can be collocated on the same host (e.g., theclient and storage components running on the same host). In this situation,requests between collocated services also go through the network, but havea faster service time than remote requests - representing a loopback datatransfer (Section 3.2.4).3.2.4 Model Seeding: System IdentificationTo instantiate the storage system model, one needs to specify the number ofstorage (Nsm) and client components (Ncli) in the system, the service timesfor the network (µnet, it currently captures latency and bandwidth together),and the system components (storage - µsm, manager - µman, and client -µcli), and the storage system configuration. The number of componentsin the system and the storage system configuration (Section 2.1.3) can befreely specified and depends on the scenario currently under the “What. . .if. . . ” analysis. The service times are part of the platform description and itsidentification is described in this section6.6In fact, the storage module has two main operations with different service times (µsmReadand µsmWrite), which are seeded in a similar way. For simplicity, this section uses µsm to513.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMNetManagerServiceNetStorageService Network  coreIn queueOut queueService queueNetClientServiceApplication DriverSchedulerFigure 3.2: A queue-based model of a distributed storage system. Each com-ponent (manager, client component, and storage component) has a singlesystem service that processes requests from its queue. Additionally, eachhost has a network component with an in- and out- queue. The network coreconnects and routes the messages between the different components inthe system and can model network latency and contention at the aggregatenetwork fabric level. Solid lines show the flow going out from a storagesystem component, while dashed lines show the in-flow path. The applicationdriver and the scheduler are responsible for decide which client will executea specific task and issue the operations of such task. Table 3.1 lists the modelparameters describing the platform performance.Compared to past work, this approach focuses on making the systemidentification process simple, by not being intrusive as no changes are required tothe storage system or kernel modules in order to satisfy R3. Additionally, theseeding process relies on application-level calls and on a deployment of upto only three machines regardless of the size of the system simulated, tokeep the process simpler and less costly.The system identification process is automated with a script as follows.To identify the service time per chunk/request (µnet - either local or remote),a script runs a network throughput measurement utility tool (e.g., iperf), tomeasure the throughput of both remote and local (loopback) data transfers.describe the seeding procedure; similarly it uses µnet referring to µremNet or µlocNet.523.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMThen, this script measures the time to read/write a number of files toidentify client and storage service time per data chunk. To this end, thesystem identification script deploys one client, one storage node and themanager on different machines, and writes/reads a number of files. Foreach file read/write, the benchmark records the total operation time. Thescript computes the average read/write time (T tot). The number of filesread/written is set to achieve 95% confidence intervals with ±5% relativeerrors.The operation total time (T tot) includes the client side processing time(T cli), the storage node processing time (T sm), the total time related to themanager operations (Tman), and the network transfer time (Tnet).T tot = T cli + T sm + Tman + Tnet (3.1)The network service time for the network (µnet) uses a simple analyticalmodel based on network throughput, and is proportional to the amount ofdata to be transferred in a packet.To isolate just T cli + Tman, the script runs a set of reads and writes of0-size. This forces a request to go through the manager, but it does nottouch the storage modules. Since decomposing T cli and Tman is not possiblewithout probes in the storage system code, the script assumes T cli = 0 andassociates the whole cost of 0-size operations with the manager to obtainµman from Tman. While network measurement utility tool can estimate Tnet,the script can infer T cli + Tman, and thereforeT sm = T tot − Tnet − Tman (3.2)To obtain the service time per chunk, the times are normalized by thenumber of chunks. Therefore,µsm =T sm(dataAmountchunkSize) (3.3)In addition to the possibility of seeding the service times based on533.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMItem DescriptionA = (usa, ufa, T ) Application AT the set of tasks in application Ausa application start timeufa application finish timetk = (vik, vclik , vek, vck,Ok,Dk) tk ∈ Tvik task start timevclik machine that executes tkvek tasks’ processing timevck workflow runtime overhead for tkOk task’s storage operations (client trace)Dk tk’s dependenciesojk = (type, timestamp, file, size, offset) ojk ∈ OkTable 3.2: Summary of the workload description’s variables.the average obtained by the seeding scripts, the service times can alsobe provided by a statistical distribution. The current implementation of theseeding scripts extracts an empirical distribution.3.2.5 Workload DescriptionThe predictor takes as input a description of the workload. This descriptioncontains two types of information: (i) a trace of storage operations pertask/client (i.e., open, read, write, close calls with timestamp, operationtype, size, offset, and client id), and (ii) a task-dependency graph capturingthe workflow execution plan for scheduling and data placement purposes(see Figure 3.1). Table 3.2 summarizes the information used to describe andsimulate the workload.Formally, let A = (usa, ufa, T ), where T is the set of tasks in applicationA, usa is the start time of the application A and of the simulation. ufa is thefinish time of A, it is not known at instant usa, and estimating it is the maingoal of the prediction mechanism.Let tk ∈ T and tk = (vik, vclik , vek, vck,Ok,Dk), where vik is the starttime of the task, vclik is the machine that executes tk, vek is the executiontime of the task, not including the storage operations, vck is the workflowruntime overhead, such as scheduling operations and task execution launch543.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISM(e.g., Secure Shell (SSH) invocation) related to tk, Ok is the set of storageoperations performed by task k, and Dk is the set of tasks on which tkdepends.Let ojk ∈ Ok and ojk = (type, timestamp, file, size, offset), where ojkdescribes a storage operation of a given type, size, and offset on file at aspecific timestamp after vik. vik and vclik are not known at instant usa, anddepend on how the simulation progresses.vek and Ok are part of the workload description that serves as input forthe predictor. Note that Ok may also contain operations to specify a specificfile configuration (e.g., set an extended attribute in MosaStore as describedin Section 2.2.4).⋃|T |k=0Dk can also be represented as a DAG and is used forscheduling in the simulation.The client traces (Ok) are obtained by running and logging the applica-tion storage operations. The execution plan can be provided by the workflowdescription (as the one used by Swift [166]), by an expert user, derived fromthe workflow runtime and the storage system information [162], or extractedfrom log files (Dk is directed related to the input files of tk since the tasksuse files to communicate).Currently, the workflow execution plan is obtained from pyFlow sched-uler [60] and client traces from MosaStore storage logs, which require nofurther modification for this work. A FUSE wrapper was also developed atNetSysLab to log the storage operations, in case the storage system does notprovide the needed information.Preprocessing Storage OperationsThe predictor preprocesses the logs to infer the storage operations’ elapsedtimes, the workflow runtime overhead (vck), each task’s execution time (vek),and inter-arrival times.In a subsequent step, the predictor preprocesses the description ofthe storage operations to potentially reduce the amount of events to besimulated. Specifically, this preprocessing phase aggregates some of thestorage operations issued by the client, and analyses the impact of the cache553.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMsize on the amount of data that each operation would transfer. In thisway, the simulations focus on predicting the behaviour of the system whenbringing the data from the storage modules to the client module, which canbe an operation local to the machine.Regarding the aggregation, the predictor relies on the fact that workflowapplications have almost distinct phases of reads, processing and writes,focusing on a small set of files each. Relying on this characteristic, thepredictor aggregates all the immediately subsequent operations of the sametype on a given file. For example, suppose that an application performs 10write operations of 256KB each; instead of passing those 10 write operationsto the simulator, the preprocessing phase aggregates them, and passes justone write operation of 2.5MB to be simulated.The amount of the data to be transferred, from storage modules toclient modules or vice versa, depends on the cache size and on the orderof the operations, but not on the network nor on the data-distribution –which still affect whether the transfers are local or remote. In this way, thepreprocessing phase verifies the cache-hit ratio of the operations based onthe cache size and order of the operations, and infers the actual amount ofdata to be transferred. It then aggregates the result of this analysis in one orfew storage operations. Consequently, the simulator has to process fewerstorage operations and fewer network events.3.2.6 Model Implementation: The SimulatorThe above model is implemented as a discrete-event simulator in Java. Thesimulator receives as input: (i) a summarized description of the applicationworkload (Section 3.2.5), (ii) the system configuration (currently, it supportsthe parameters described in Section 2.1.3), (iii) the deployment parameters(number of storage nodes and clients, whether they are collocated on thesame hosts - colloc), and (iv) a performance characterization of systemcomponents: service times for network, client, storage, and manager (Section3.2.4).Once the simulator instantiates the storage system, it starts the appli-563.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMcation driver that emulates processing the application workload (Section3.2.5). The driver functionalities can be divided into two main types: (i)functionalities related to the workflow runtime, and (ii) functionalitiesrelated to the operations of the actual task execution.The workflow runtime functionalities focus on scheduling and currentlysupports two heuristics: (i) workqueue [62], and (ii) data-location aware[60, 142]. In the workqueue alternative, whenever the simulated environ-ment has client machines available (i.e., not running any task), the driverschedules a task to that machine. For the location-aware, the driver givespriority in the allocation to the machines that already have the input filesin an approach similar to the one described by Santos-Neto et al. [142] andevaluated in the context of many-task applications by my collaborators andI [60].Once the task is scheduled (vclik is chosen), the driver creates the eventscorresponding to the workflow runtime overhead, task processing, andstorage operations (i.e., vck, vek, and Ok) and places them in the client servicequeue. File-specific configuration (as proposed by UrsaMinor [10] andMosaStore [159]) is described as part of the operations in the workloaddescription.As in a real system, the manager component maintains the metadata ofthe system (i.e., implements data placement policies, and keeps track of fileto chunk mapping and chunk placement).The processing of these initial events may result in the creation of otherevents. To make the process clearer, consider the following example for afile write operation:1. A client contacts the manager asking for free space, the managerreplies specifying a set of storage services with free chunks.2. The client requests each storage service to store chunks in a round-robin fashion based on the set received from the manager.3. After processing a request to store a chunk, each storage service repliesto the client acknowledging the operation success.573.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMOperation Description SummaryallocateChunks A client may request the manager to allocatechunks on the storage systems when the previ-ously allocated chunks are used. The managerperforms such allocation according to the data-placement policy in use. It is not available to The application issues a read operation on a client,which results in several readChunk from theclient to storage modules storing the requestedchunks. A previous call to open may be neces-sary.readChunk A client requests a specific chunk from a storagemodule. It is result of a read operation. It is notavailable to the A client requests the chunk map and other meta-data information(e.g., file tags) from the manager.write The application issues a write operation on aclient, which results in several writeChunkfrom the client to storage modules accordingto the chunks received from a previous call toallocateChunks. In the end of this operation,the client send an updated version of the chunkmap to the manager. A previous call to open orto allocateChunks may be necessary.writeChunk A client sends a chunk to be stored by a storagemodule. If replication is enabled, the storagemodule may issue other requests of writeChunkto other storage modules.Table 3.3: Summary of the main operations modeled in the simulator; theseoperations also have a callback associated with each one.4. After sending all the chunks, the client sends the chunk-map to themanager.5. Once the manager acknowledges, the client returns a notice of successto the application driver.A write operation generates two requests to the manager and one requestper chunk to the storage nodes. Table 3.3 summarizes the main operationsmodeled for this study and their key interactions.583.2. THE DESIGN OF A PERFORMANCE PREDICTION MECHANISMThe manager implements a number of data placement policies. Thedefault policy selects, for a write operation, a “stripe-width” of storageservices. To model per-file optimizations, the client can overwrite system-wide configurations by requesting the manager to use a specific dataplacement policy. For example, the client may require that a file is storedlocally, that is, on a storage service that is located on the same host. In thiscase, the manager attempts to allocate space on that specific storage servicefor that write operation. The file-specific data placement policy request ispart of the workload description.3.2.7 Modeling Methodology RemarksDuring the modeling phase, the requirements were partially conflicting: toprovide adequate accuracy (R1) while keeping the model and its seedingprocedure simple enough to be fast and scalable (R2), and easy to use (R3).The focus was to be as simple as possible to satisfy R1 for the targetdomain, not to be a comprehensive solution for distributed storage systemsnor for all possible domains of applications. Thus, the initial phase ofmodeling targeted only the key interactions between system components.Modeling all system subcomponents and all their interactions in detailwould be too complex. Such complexity could improve prediction accuracy,but would have significant drawbacks: a significantly more complex model- as complex as the actual storage system and the underlying environment(e.g., network protocols, operating system buffers, scheduling), a complexseeding process, lower scalability, and the loss of the model’s generality.Furthermore, the improvement in accuracy may not add much value(e.g., when the prediction mechanism is used to decide among systemconfigurations).From the initial phase, the modeling exercise followed a top-downapproach: from the simplest model to adding more components or in-teraction details until the accuracy of all the predictions was within 10%of actual performance (and the median error was within 5%) for a set ofmicrobenchmarks that covered approximately 30 different scenarios.593.3. EVALUATION3.2.8 SummaryThe predictor takes as input a description of the workload in the form of(i) a per client I/O operations trace, and (ii) the task-dependency graph forscheduling (see Figure 3.1).Section 3.3 evaluates requirements R1 (Accuracy) and R2 (Response Timeand Scalability) and discusses how they are met by this proposed predictionmechanism. Requirement R4 is a functional requirement and, besides thesupported range of different scales and configurations described earlier,Section 3.3 also assesses R4 in the context of R1 and R2.Requirement R3 (Usability and Generality) regards mainly how one canprovide the input necessary to the predictor with little human effort. For R3,two main points were discussed in the beginning of Section 3.2 and in Section3.2.4: (i) how to avoid a particular initial system design or a specializedmonitoring system, and (ii) how the process of gathering the input for theperformance predictor can be completely automated. Additionally, meetingR3 is also a consequence of the simplicity aspects discussed in Section 3.2.7.Section 3.7 summarizes how some decisions made during the model-ing exercise affect the requirements, and what are the limitations of thisapproach. Section 6.2 discusses further the main limitations and proposeshow to address them as future work.3.3 EvaluationThis section presents the evaluation of the mechanism’s prediction accuracyand, more importantly, it demonstrates the mechanism’s ability to supportcorrectly identifying a quasi-optimal configuration for a specific application(Requirement R1 from Section 3.2.1). To this end, this evaluation coversa set of synthetic benchmarks and real applications. The synthetic bench-marks are designed to mimic the access patterns [159] of real workflowapplications.To understand how the prediction mechanism can be used in a real set-up, two real workflow applications are used: BLAST [19] and Montage [106].603.3. EVALUATIONThe goal is to evaluate the mechanism’s ability to predict time-to-solution tosupport user’s decisions about storage configuration and allocation.Storage system. The deployment uses MosaStore as an intermediatestorage system. The storage nodes use RAMDisks, which are frequentlyused to support workflow applications as intermediate storage (refer toSection 2.2.1 for a description of intermediate storage) since they havehigher performance and are the only option in most supercomputers thatdo not have spinning disks (e.g., IBM BG/P machines).Testbeds. To demonstrate the ability of the prediction mechanism toperform well independently of hardware deployment and scale, this sectionfocuses on two testbeds. The first testbed (TB20), with 20 machines, is part ofthe NetSysLab cluster. Each machine has Intel Xeon E5345 4-core, 2.33-GHzCPU, 4GB RAM, and a 1Gbps Network Interface Controller (NIC) runningFedora 14 Linux OS (kernel The second testbed (TB101), used forlarger scale experiments, includes 101 nodes on Grid5000 ‘Nancy’ cluster[41]. Each machine has Intel Xeon X3440 4-core, 2.53-GHz CPU, 16GB RAM,1Gbps NIC, and 320GB SATA II.Evaluation Metric. The evaluation focuses on prediction accuracy bycomparing the predicted execution time and the actual execution time.Formally, this section reports prediction error as defined by |1− TimepredTimeactual |,where Timepred = ufa (Section 3.2.5) and Timeactual is the actual runtime ofthe benchmark.Plots report the average of 10 trials. For actual performance, the figuresshow the average execution (turnaround) time and standard deviation (inerror bars). The number of trials used to calculate the average varies andis presented in each scenario below. The sample size used is enough toguarantee a 95% confidence level according to the procedure described byJain [98].Deployment details. In all the experiments, one node runs the metadatamanager and the workflow coordination scripts, while the other nodes runthe storage nodes, the client SAI, and the application processes. The networkwas shared with other applications out of our control (5 other machines for613.3. EVALUATIONTB20, and 43 machines for TB101). The simulator is seeded according to theprocedure described in Section Synthetic Benchmarks: Workflow PatternsThis section evaluates the accuracy of the prediction mechanism in capturingthe system behaviour with multiple clients, multiple applications, anddifferent data-placement policies designed to support workflow applications[159]. The synthetic benchmarks mimic the common data access patterns ofworkflow applications: pipeline, reduce, and broadcast (Figure 3.3). Theseare the most popular patterns uncovered by studying over 20 scientificworkflow applications by Wozniak and Wilde [169], Shibata et al. [147],and Bharathi et al. [33]. Additionally, this past work shows that workflowapplications are typically a combination of these studied patterns, with onepattern per stage.The synthetic benchmarks are designed to explore the limitations of the predictoras they are composed exclusively of I/O operations, which generates highnetwork and storage contention in the system.Summary of results. The predictor has good accuracy: our approachleads to prediction errors of 5% on average, lower than 8% in 80% of thestudied scenarios, and within 14% in the worst case. More importantly, themechanism correctly differentiates between the different configurations andcan support choosing the best configuration for each evaluated scenario.Experimental setup. The label DSS is used for experiments with theDefault Storage System configuration (DSS): client and storage modulesrun on all machines, client stripes data over all 19 machines of the TB20testbed, and no optimizations are enabled for any data-access pattern. Thelabel WOSS refers to the system configuration optimized for a specific accesspattern (including data placement, stripe width or replication) [159]. AllWOSS experiments assume data location-aware scheduling: for a givencompute task, if the input file chunks exist on a single storage node, the taskis scheduled on that node to increase access locality. For actual performance,the figures presented in this section show the average execution (turnaround)623.3. EVALUATIONtime and the standard deviation for 15 trials. Section 2.2.3 presents a moredetailed description of these patterns and their optimizations.The goal of showing results for two different configuration choicesis two-fold: (i) to demonstrate the accuracy of the predictions for twodifferent scenarios, and (ii), more importantly, to show that the predictionscorrectly indicate which configuration is the most desired. To understandthe impact of data size, for each benchmark, the experiments use threeworkloads labeled as medium (data size is indicated in Figure 3.3), the small(10x smaller than medium), and where possible (i.e., workload fits in theRAMDisks), a 10x larger, large workload. These sizes cover a range ofdifferent applications, as the ones presented in Sections 3.3.3 and 3.3.4 andstudied by Wozniak and Wilde [169], Shibata et al. [147], and Bharathiet al. [33]. Results for a small workload are omitted since they exhibit asimilar performance between different configurations and the predictionsare inside the confidence interval. Thus, while the predictions are accurateand potentially useful, analyzing them does not make a useful scenario forthis evaluation.Pipeline benchmark. A set of compute tasks are chained in a sequencesuch that the output of one task is the input of the next task in the chain(Figure 3.3). A pipeline-optimized storage system will store the intermediatepipeline files on the storage node co-located with the application. Later,the workflow scheduler places the task that consumes the file on the samenode, increasing data access locality. Here, 19 application pipelines run inparallel and go through three processing stages that read input from theintermediate storage and write the output to the intermediate storage. Thelarge workload produces too much data to fit in an in-memory intermediatestorage system in the TB20 testbed. Section 3.7 discusses initial results forspinning disks.Evaluation of the results. For the optimized configuration (WOSS), thepredictor has almost perfect accuracy (Figure 3.4). For the default dataplacement policy (DSS), however, predicted times are 9% shorter than theactual results. For this case, all clients stripe (write) data to all machines inthe system; similarly, all machines read from all others. This creates complex633.3. EVALUATIONFigure 3.3: Pipeline, Reduce, and Broadcast benchmarks. Nodes representworkflow stages and arrows represent data transfers through files. The filesizes represent themedium workload. The part of the flow that is repeated,run on over 19 machines in this evaluation.DSSDSS PredWOSSWOSS Pred Actual PredictedTime (sec)0 5 10 15Figure 3.4: Actual and predicted average execution time for the pipeline bench-mark with a medium workload.interactions among all components in the system, leading to contentionand chunk transfer retries due to connection initiation timeouts causedby network congestion, which should be the main source of predictioninaccuracies.Reduce benchmark. In this benchmark, a single compute task uses inputfiles produced by multiple tasks. 19 processes run in parallel on differentnodes, consume an input file, and produce an intermediate file each. In the643.3. EVALUATIONnext stage of the workflow, a single process reads all intermediate files andproduces the final output file. A possible data placement optimization isto use collocation - placing all input files on one node and exposing theirlocation, which will later be used by the scheduler to place the reduce taskon that machine. For WOSS configuration, the collocation optimization isenabled for the files used in the reduce stage; for the remaining files, thelocality optimization is enabled.Evaluation of the results. Similar to the pipeline benchmark, predictionsfor the reduce benchmark are close to the actual performance. In fact,they are within 9% of the actual average for a medium workload, and13% of the actual performance for a large workload (Figure 3.5). Moreimportantly, they capture the relative improvements that the pattern-specificdata placement policies bring. Note that Figure 3.5b captures the behaviourof a heterogeneous scenario: to accommodate the amount of data produced,a faster machine with a larger RAMDisk runs the reduce stage. Despite thisheterogeneity, the predictor captures the system performance with accuracysimilar to a homogeneous system.When the collocation and locality optimizations are not enabled, thechallenge of capturing the system behaviour exactly is similar to the pipelinecase: capturing the complex interactions among all machines in the system.When the specific data placement is enabled, however, the challenge isdifferent: there is a high contention created by having several clients writingto the same storage node (the one that performs the reduce phase). Figure3.5c shows the results per-stage for the two stages of the large workloadseparately to show how the predictor captures these cases.Broadcast benchmark. Nineteen processes running in parallel on differ-ent machines consume a file that is created in an earlier stage by one task. Apossible optimization for this pattern is to create replicas of the file that willbe consumed by several tasks.Evaluation of the results. Figure 3.6 shows the results for the broadcastpattern with a medium workload using a WOSS system configured with 1, 2,or 4 replicas (the large workload shows a similar trend and is omitted here).For this benchmark, all predictions matched the actual results: predictions653.3. EVALUATIONDSS DSS Pred WOSS WOSS PredActualPredictedSystemsTime (sec)012345(a)MediumDSS DSS Pred WOSS WOSS PredActualPredictedSystemsTime (sec)010203040(b) Large (c) Large per Stage forWOSSFigure 3.5: Actual and predicted average execution time for the reduce bench-mark for the medium, large workloads, and per stage for large workload.are inside the interval of average ± standard deviation, with just a 1-3%difference from the average.This experiment highlights an interesting case for the predictor. Accord-ing to the structure of the pattern and the results reported by Vairavanathanet al. [159], creating replicas could improve the performance of the broadcastpattern. The results, however, show that creating replicas does not reallyimprove performance in this case. While creating replicas could improveperformance, the results show that creating replicas does not really help heresince data striping already avoids the contention on a single node. Creatingreplicas reduces the pressure on a given machine (as Vairavanathan et al.[159] show to be the case for this benchmark), but this benefit is cancelledby the overhead of creating a replica. The predictor captures the impact ofthese different configurations.663.3. EVALUATION   1 rep     2 rep    4 repActualPredictedTime (sec) 3.6: Actual and predicted performance for broadcast benchmark andmedium workload. The experiment uses the WOSS system while varying thereplication level.3.3.2 The Pipeline Benchmark at ScaleThis section expands the analysis of the synthetic benchmarks to answer thefollowing questions:• How accurate is the predictor’s estimates for a different platform?• How does the predictor capture the behaviour of larger scale systemsfor the synthetic benchmarks?To answer these questions, this section presents the results for thepipeline benchmark at a larger scale on a Grid 5000 testbed with 101machines (TB101). The pipeline benchmark is used because it is the onewith the largest gap between the predicted and the actual execution, and itis the most I/O intensive - stressing the network and the metadata manager,a component well-known for being a potential bottleneck for this typeof cluster-based storage systems. This section shows the results for thisbenchmark for a weak scaling set-up using three different scales (25, 50, and673.3. EVALUATIONDSSlll051015202525 50 100NodesAverage time (sec)l ActualPredictedWOSSll l0123425 50 100NodesAverage time (sec)l ActualPredictedFigure 3.7: Actual and predicted average execution time for the pipeline bench-mark with a medium workload for 25, 50, and 100 nodes on testbed TB101.Note the different scales for the x-axis.100 nodes), the medium workload, and the two configurations (DSS andWOSS). For this case, the DSS option uses stripe-width of 20 instead of 19 asfor TB20.Figure 3.7 presents the results. The predictor produces estimates thatdiffer on average by 15% from the actual time, are within 22% of the actualresults for all cases, and close to the interval delimited by the standarddeviation. More importantly, the predictions indicate which configurationslead to higher performance.3.3.3 Supporting Decisions for a Real ApplicationSection 3.3.1 presents an evaluation of the predictor’s ability to accuratelyestimate the turnaround time of synthetic benchmarks, and the impact ofdifferent data placement optimizations. This section targets more complexscenarios where the user has to deal with a real application, allocationdecisions, as well as the choice of the storage system configuration. Further,while the previous section 3.3.1 evaluates prediction accuracy when theapplication and storage system are co-deployed, this section evaluates683.3. EVALUATIONaccuracy when they are deployed on separate nodes.7Specifically, this section demonstrates the predictor’s ability to properlyguide the user, or a search algorithm, to the desired configuration, focusingon two provisioning scenarios:Scenario I assumes that the user has full access to a fixed-size cluster – acommon set-up in several university research labs. The problem: Howshould the system be partitioned between the application and the intermediatestorage and what will the intermediate storage system configuration be forthe best overall performance?Scenario II explores the provisioning problem with cost constraints (e.g.,in HPC centers with limited user budget or cloud environments). Theproblem: For a fixed workload, what is the cost vs. turnaround time trade-offspace among the deployment options?Workload. The experiments explore these two scenarios with a realworkflow application: BLAST [19] is a DNA search tool for finding similari-ties between DNA sequences. In the BLAST workflow, each node receives aset of DNA queries as input (a file for each node with 200 search queries)and all nodes search the same DNA database file stored on the intermediatestorage. Each machine produces one output file, and the files are combinedat the end of the application execution. The input files are transferred to theintermediate storage system prior to application execution. The traces arecollected based on one execution of the application, as described in Section3.2.5.Deployment scenario. Among the 20 machines in the testbed TB20,one node coordinates BLAST tasks’ execution and runs the storage systemmanager. The remaining nodes either execute tasks from the workload oract as storage nodes.Experimental methodology. The plots report the average of at least20 runs, leading to 95% confidence intervals for all experiments. Since7Several factors may influence the decision to collocating (or not) the components (e.g.,limited amount of RAM in BG/P). Past work employed both approaches (e.g., [14, 52, 156,174]) and, thus, this evaluation explores both.693.3. EVALUATIONOutput 1Search Search SearchOutput 2 Output nQuery file 1 Query file nQuery file 2Blast DBFigure 3.8: The BLAST application. The database is used by all nodes that searchin parallel for different queries.these confidence intervals are small (less than 5% of actual values), they areomitted to reduce clutter.Scenario I: Configuring a Fixed-size ClusterThis scenario explores the following question: Given a fixed size cluster,how should the nodes be partitioned between the application and the intermediatestorage, and what is the intermediate storage system configuration that yields thehighest application performance?Figure 3.9 shows the application execution time for different partitioningand storage system configurations. For this application, chunk size is theconfiguration parameter that has the highest impact on performance, thus,to limit the points plotted in the figure, this evaluation focuses on it only. Thepredictor correctly captures the lack of impact for other parameters thoughadditional runs exploring these parameters. Additionally, by evaluatingchunk size and not collocation of storage with client nodes, this sectioncovers results for a configuration parameter not covered in Section 3.3.1.Figure 3.9 highlights several important points. First, the performancedifference between the different configurations is significant: up to adifference of 10x between the best and the worst configuration, even forthe same chunk size. Second, the results show that the system achieves thefastest processing time with a partitioning of 14 application nodes and 5703.3. EVALUATIONstorage nodes, and a chunk size of 256KB (4x smaller than the default size) –a non-obvious configuration beforehand. Third, the experiment shows thatthe predictor accurately captures the system performance under differentpartitioning strategies, and storage system configurations. Actually, theoverall error of the predictions is small: up to 10% of the average, andalways within the standard deviation. By comparing those to the syntheticbenchmarks, they are also smaller since there is less stress on the storagesystem. Finally, the most important point is that the predictor can, in fact,correctly lead the user or a search algorithm to the desired configuration.713.3.EVALUATION0 5 10 15 2010002000500010000Application NodesApplication time in seclllllll lll ActualPredictedChunk (in KB) =  2560 5 10 15 2010002000500010000Application NodesApplication time in secllllll llll ActualPredictedChunk (in KB) =  5120 5 10 15 2010002000500010000Application NodesApplication time in secllllll llll ActualPredictedChunk (in KB) =  1024Figure 3.9: Application runtime (log-scale) for a fixed-size cluster of 20 nodes. X-axis represents number of nodes allocated forthe application (storage nodes = 20 - application nodes). The three plots represent runtime for different configurations(chunk sizes).723.3.EVALUATION0 5 10 15 205102050Application NodesCost(CPU hours)3060120180360Application time in minuteslllllll 256KB512KB1024KBTotal Nodes =  110 5 10 15 205102050Application NodesCost(CPU hours)193875112225Application time in minutesllllllllll 256KB512KB1024KBTotal Nodes =  170 5 10 15 205102050Application NodesCost(CPU hours)16326395189Application time in minuteslllllll lll 256KB512KB1024KBTotal Nodes =  20Figure 3.10: Allocation cost (total CPU-hours on the left Y axis, log-scale) and application turnaround time (right Y axis,log-scale, different scale among plots) for fixed size clusters of 11, 17, and 20 nodes while varying the chunk size. X-axisrepresents number of nodes allocated to the application (storage nodes = 20 - application nodes). The figures show theactual (lines) and the predicted (arrows) cost/performance.733.3. EVALUATIONScenario II. Provisioning in an Elastic and Metered EnvironmentThis scenario assumes an environment where users are charged proportionalto the cumulative CPU-hours used and have a more complex trade-offbetween cost and time-to-solution; for example, they aim for the bestapplication turnaround within a certain dollar budget. This scenario aimsto explore the user’s provisioning decisions by revealing the details of thistrade-off. Specifically, this scenario helps the user to answer the followingquestion: What is the allocation size, and how should it be partitioned andconfigured to best fit the constraints and optimization criterion?Figure 3.10 shows the application execution time and allocation cost(measured as number of nodes x allocation time) for different cluster sizes,different partitioning configurations, and different chunk size. Similarto Scenario I, Figure 3.10 shows that the predictor captures the systemperformance with significant accuracy.Figure 3.10 also shows that an allocation of 11 nodes, with partitioningof 8 application nodes, 2 storage nodes, and a chunk size of 256KB offersthe lowest cost. More importantly, this figure points out an interesting casefor the analysis of cost vs. time-to-solution: The user can analyze the plot toverify that an option with an allocation of 20 nodes actually offers almost 2xhigher performance at a marginal 2% higher cost.3.3.4 Increasing Workflow Complexity and Scale: Montage onTB101This section aims to answer the following question: “Can the predictor supportuser decisions for more complex scenarios?” Specifically, the goal is to evaluatea workflow composed of more stages, tasks, and patterns, executing a moredata-intensive workload at a larger scale. To answer this question, thissection focuses on evaluating how accurate the estimates are for Montage[106], a complex astronomy workflow composed of 10 different stages(Figure 3.11), with varying characteristics in terms of the number of tasks,volume of data, and data access pattern. The workflow uses the ‘reduce’pattern in two stages and the ‘pipeline’ pattern in 4 stages (as indicated by743.3. EVALUATIONImage-1mProjmOverlapsmDiffImage-2mProjmDiffImage-nmProjmDiffmConcatFitmBgModelmImgTblInformation is used to schedule the mDiff tasksmBackGroundmBackGroundmBackGroundmAddFinal ImageInformation is used to schedule the mBackground tasksmJPEGdp_localmFitPlanemFitPlanemFitPlanedp_local dp_localdp_localdp_local dp_localdp_localdp_collocate dp_collocatedp_collocatedp_collocate dp_collocatedp_collocatedp_localFigure 3.11: Montage workflow. Labels on the arrows represent the tags used toindicate the data usage patterns.the labels on the arrows in Figure 3.11). Note that Montage diversity offersa rich use case for the predictor and, in fact, each stage could be seen as adifferent application. Zhang et al. [175] present a detailed characterizationof Montage workflow.To verify that the predictor can support configuration decisions, Montageis executed in different deployments sizes on TB101. For this application,clients are collocated with storage nodes and MosaStore is deployed onspinning disks, verifying yet another configuration parameter. Chunk sizevariations are omitted since it does not impact actual Montage performance(this is well captured by the predictor).Workflow characteristics. The I/O communication intensity betweenworkflow stages in Montage is highly variable (presented in Table 3.4 forthe workload used). Overall, the workflow includes over 7,500 tasks and753.3. EVALUATIONStage Data #Files #Tasks File Size I/O TimestageIn 1.9GB 957 1 1.7MB – 2.1MB –mProject 8GB 1910 955 3.3MB – 4.2MB 10%mImgTbl 17KB 1 1 17KB 5%mOverlaps 336KB 1 1 336KB 0.3%mDiff 2.6GB 5640 2833 100KB – 3MB 48%mFitPlane 5MB 1420 2833 4KB 5%mConcatFit 150KB 1 1 150KB 0.5%mBgModel 20KB 1 1 20KB 0.1%mBackground 8GB 1913 955 3.3MB – 4.2MB 46%mAdd 5.9GB 3 1 165MB – 3GB 49%mJPEG 47MB 1 1 47MB 18 %stageOut 3GB 2 1 47MB – 3GB –Table 3.4: Characteristics of Montage workflow stages. Percentage of spent onI/O operations is based on an execution on the testbed TB20, using all nodes.generates over 10,000 files, with sizes ranging from 2KB to 3GB. In total,about 30GB of data is read from or written to storage.Evaluation of the results. Figure 3.12 shows the actual and predictedworkflow execution time, for Montage on allocations of different sizes. Plotsreport the average, summarizing 10 runs, for which the standard deviationis approximately 3% and omitted to reduce clutter. Figure 3.12 shows that,overall, the predictor captures the application performance well. Despitethe complexity of the workflow and the scale, the predictor is accurate, thewith average prediction error being 3%, the smallest is less than 1%, and themaximum prediction error is 7%).Prediction accuracy per stage. Since each stage can be very different fromothers, the evaluation also considers how the predictor performs per stagefor a different number of nodes. The average prediction error per stagewas 5%, with the 5 stages (mProject, mDiff, mFitPlane, mAdd, and mJpeg)that account for over 70% of the time having a maximum of 4% error. Thehighest error was 25% for mBackground on 5 nodes, which is due to theshort execution time of each task (less than 2 seconds), making it sensitiveto any variation in the platform.Time-to-solution vs. CPU cost. Similar to the BLAST evaluation (Section763.3. EVALUATION0 20 40 60 80 100500100020005000NodesApplication time in secondslllllll ll ActualPredictedFigure 3.12: Montage time to solution (with the y-axis in log scale) for varyingnumber of nodes deployments on TB101.3.3.3), this section analyzes the decision of cost vs. time-to-solution. Figure3.13 shows the workflow execution time (on the y-axis) and the totalcost in CPU hours (on the x-axis), actual and predicted, for Montageusing different cluster sizes (numbers of nodes indicated over line points).Note that adding more nodes increases the cost almost linearly but theperformance improvement is small after 15 nodes and there is virtuallyno improvement after 75 nodes. This is a result of the Montage workflowstructure where some stages are not able to exploit the parallelism offeredby these allocations.This scenario highlights the non-linear relation in the cost vs. time-to-solution trade-off, and how the predictor can accurately capture thisrelationship. The predictor can show the user the allocation options andeach of their impacts on application performance and cost. When the user isinterested in optimizing for just one metric, she can opt for the allocations in773.3. EVALUATION10000 20000 30000 40000 50000 600001000200030004000Cost in CPU−secondsApplication time in secondslllll l l l151015 25 50 75 100l ActualPredictedFigure 3.13: Montage running cost in CPU allocated time (x-axis) vs. time-to-solution (y-axis) for varying fixed size deployments (number of nodes shownin the text besides the data points) on TB101. Note that axes do not start at0.the top-most or right-most points in Figure 3.13. If the user picks the largestnumber of nodes available, she indeed obtains the fastest time-to-solution.However, Figure 3.13 also highlights a set of options (5 to 25 nodes) that arenot so slow or expensive in comparison to the others. By analyzing the plot,she can opt for an allocation that costs 4 times less than the fastest allocation,yet it is just 20% slower.Results on TB20. To explore a different execution platform, an addi-tional evaluation considers TB20 with MosaStore deployed on RAMDisks[57]. Figure 3.14 summarizes the results. The workload is smaller; theworkflow generates over 650 files, with sizes ranging from 1KB to over100MB, and about 3GB of data are read from or written to the storagesystem. Overall, the predictions obtain “good-enough” accuracy to support783.3. EVALUATION5 10 15 20050100150200NodesApplication time in secondsllll ll ActualPredictedFigure 3.14: Montage time to solution for fixed-size deployments from 1 to 19nodes on TB20 - an additional node runs the storage manager.the user’s decisions about provisioning. The average prediction error is 9%,the smallest is less than 1%, and the maximum prediction error is less 15%.3.3.5 Predictor Response Time and ScalabilitySections 3.3.1, 3.3.3 and 3.3.4 describe the accuracy of the predictor andits ability to guide storage system configuration in the context of syntheticbenchmarks and real applications (Requirement R1 from Section 3.2.1). Theyalso demonstrate how the accuracy of the predictor behaves in a number ofdifferent scenarios (Requirement R4 from Section 3.2.1) – e.g., as the storageconfiguration, or the number of nodes in the system, varies in total, forclients, and for storage nodes.The goal of this section is to demonstrate that the predictor satisfies therequirement R2 listed in Section 3.2.1: to be useful, the predictor’s responsetime should use much less resources (machines × time) than the run of anactual application, even for large systems and I/O intensiveness. Specifically,the goal is to answer the following questions:793.3. EVALUATION• How much allocated computing power (in terms of number of ma-chines × time) does the prediction use, in comparison to the actualexecution?• How does the predictor’s response time increase as the complexity ofthe simulated system increases?Prediction effort. Overall, the predictor uses orders of magnitude lessresources than the corresponding workflow execution: for BLAST: 200x-500x, for Montage 300x-600x, for the reduce benchmark 2000x less CPU time(all for TB20 when using all 20 nodes). For the reduce benchmark scenario,the savings can be estimated to be as high as 2× 103 less resources.Scalability. The remainder of this section focuses on the reduce bench-mark, since, among the synthetic benchmarks, it exhibits a more complexpattern than pipeline, and similar to broadcast. Using no storage systemoptimizations makes this scenario more complex since it generates moreevents to be simulated.In this section, each point in the plots is the average for 10 rounds, witherror bars representing the standard deviation. The predictor is executed onone node of the testbed TB20 described in Section 3.3.Figure 3.15 shows the results for increasing the amount of data whilekeeping the number of nodes constant at 20 as is done in the actualexperiments presented in the accuracy analysis in Section 3.3 for the storagesystem under the default configuration. It shows evidence of how muchfaster the predictor can be in comparison to the actual execution presentedearlier in Figure 3.5. Consider the case of a medium workload (100 × thesmall workload): the predictor takes an average of 41 milliseconds to predictthe behaviour of the system where the actual execution takes 4.5 seconds in20 nodes, resulting in almost 2000 × less resources in the prediction than inthe actual execution (100 × less time and 20 × fewer machines). If the timeto set up the storage system and launch the tasks were taken into account,4.5 seconds would become almost two minutes, and it would increase thedifference even further, leading it to be 6 × 107 times less resources. Theresults in the case of the large workload are even better: the predictor803.3. EVALUATION1 5 10 50 100 500 100015105010050010005000Size in comparison to small workloadTime(ms)llllFigure 3.15: Predictor response time for 20 nodes with increasing the amount ofdata (varying the workload) in the system (log-scale for both axes).takes 300 milliseconds as opposed to 39 seconds on 20 nodes for the actualbenchmark, leading to a difference of 2,600 × less resources.Figure 3.16 presents a weak scaling experiment in which the data sizescales up with the number of nodes up to 1000 nodes. The idea here is toshow how the response time increases as the complexity of the predictedscenarios increases for both axes: the number of nodes and the data amount.This experiment is based on the reduce benchmark for large workload sizes.Thus, there are three stages, each one writing and reading files. Stage inand intermediate stages have files of 100MB per node, stage out has N×intermediate size. In this experiment, the point for 20 nodes is exactlythe prediction for the actual benchmark as shown in Section 3.3.1, taking300 milliseconds in the simulation as opposed to 39 seconds in the actualbenchmark.Note that the response time increases linearly; the rightmost point inthis plot shows the time to predict the behaviour of a system with 1000nodes writing a total of 300,000MB (3 stages× 1000 nodes× 100MB) andreading the same amount of data. That is, it moves almost 600GB around.Moreover, the simulation for the 1000 nodes and 600GB scenario takes 37813.3. EVALUATION0 200 400 600 800 1000010000200003000040000NodesTime to predict in msl l ll l ll ll ll ll ll l ll l ll ll ll ll ll l ll ll ll l ll lll DSSWOSSWeak scaling based on large workloadFigure 3.16: Response time in a weak scaling scenario. The data size scales upwith the number of nodes up to 1000 nodes in increments of 25 nodes.Each point represents the average of 10 rounds. The standard deviationis little (smaller than 3%) when number of nodes is larger than 50 and areomitted to reduce cutter.seconds in average for the default configuration (DSS line), which is lessthan the amount of time to run the actual benchmark for the much simplerscenario of 20 nodes and approximately 12GB.The WOSS line shows the results for the system with data-placementoptimizations. The scenario for 1000 nodes takes almost 25 seconds. Notethat it takes less time to predict because it generates fewer events to besimulated. What happens is that data placement optimizations reduce thepressure on the network during the simulation as well. Therefore, the morelocal data manipulation, the fewer events and the faster prediction time.Implementation DetailsThe main simulation loop has the complexity of O(NlogM), where N isthe total number of events to be simulated, andM is the number of eventscurrently in a priority queue. The system tends to exhibit N≫ M as thecomplexity of the simulated storage system grows. Therefore, the resultsshow a linear scalability.823.4. PREDICTING ENERGY CONSUMPTION0 5 10 15 200100200300400NodesTime(ms)lllllllll l lFigure 3.17: Prediction time for 20 nodes and increasing amount of data in thesystem. Note the log-log scale.This logarithmic behaviour for increasing number of nodes while keep-ing the amount of data can also be verified for systems with small numberof nodes (see Figure 3.17). This happens because the fewer nodes the systemhas, the more loopback data transfers it has. Therefore, it requires fewerevents to simulate the network behaviour. The increase in the number ofevents, however, is not linear, but rather logarithmic. This is confirmed bytracking the number of simulated events.Roughly, the simulator processes between 1250–1820 events per mil-lisecond on one machine used in the testbed TB20 described in Section3.3.3.4 Predicting Energy ConsumptionSection 3.2 presents the design of the prediction mechanism, including aqueue-based model and a system identification procedure to seed the model,that is able to estimate traditional performance metrics, such as turn-aroundtime, or cost of workflow applications. This section focuses on the followingquestion: Which extensions does the performance predictor need in order to captureenergy consumption behaviour in addition to traditional performance metrics?833.4. PREDICTING ENERGY CONSUMPTIONSpecifically, this section presents an extension of the initial model presentedin Section 3.2.3 to take into consideration energy predictions – a metric thathas grown in importance, due its impact on cost, or even the feasibility ofbuilding large computing infra-structure [29].This extension relies on a coarse-grained energy consumption modelassociated with the energy characteristics of the underlying computingplatform. My collaborators and I [54] have used this approach to estimatedata deduplication energy consumption as described in detail in Chapter 5,and it has also been used by Ibtesham et al. [96]. For workflow applications,Hao Yang was the main researcher responsible for applying and evaluatingthis approach in targeting the energy consumption metric use-case, which issummarized in this section. Yang et al. [171] present a detailed evaluationof this approach8.Overall, the energy predictions obtained an average accuracy of morethan 85% and a median of 90% across different scenarios, while havingsimilar response times and resource usage to those predictions presented inSection 3.3.5.The rest of this section is organized as follows. Section 3.4.1 describes asimple analytical energy consumption model that satisfies the requirementsdescribed in Section 3.2.1. Section 3.4.2 presents an evaluation of energyconsumption predictions using synthetic benchmarks and real workflowapplications.3.4.1 Energy Model ExtensionAs discussed in Sections 2.2.3 and 3.2.5, a typical task in a workflow applica-tion progresses as follows: (i) the machine running the task reads the inputdata from the intermediate storage (likely through multiple I/O operations),(ii) the task processes the data, (iii) the machine writes the output to theintermediate storage. Although these phases are not completely separated,they have little overlap.8Energy Prediction for I/O Intensive Workflows [171] Hao Yang, Lauro Beltrão Costa, andMatei Ripeanu. Proceedings of the 7th Workshop on Many-Task Computing on Clouds,Grids, and Supercomputers, MTAGS ’14. pages 1 – 6. ACM, November 2014.843.4. PREDICTING ENERGY CONSUMPTIONThe herein described extension of the model presented in Section 3.2.3captures the energy consumption of a task by assuming that these phasesare non-overlapping and associating each to different power profiles: (i) idleprofile - power to keep the machine on (Pidle); (ii) application processingprofile - additional power drawn to run a task on the CPU when notperforming storage operations, beyond the amount already drawn in idlestate (PApp); (iii) local storage profile - extra power to perform read andwrite operations on the storage node (Ps); and (iv) network profile - extrapower to perform network data transfers (Pnet).With this mindset, the total energy spent during a workflow applicationexecution can be expressed as the sum of the energy consumption ofindividual nodes: EAn =N∑nEtotn , where N is the total number of nodes(Ncli ∪Nsm) , and Etotn is the total energy consumption of node n. LetEtotn = Eidlen + EAppn + Esmn + Enetn (3.4)where Eidlen is the energy consumed to keep the node n on, given byEidlen = Pidlen × Tidlen (3.5)where T idle is the application execution time. Similarly, the energy spentby a node in each profile is simply calculated by Eprofilen = Pprofilen × TprofilenThe rest of this section describes the extension of the seeding processused to collect the power measurements, and explains the integration of theinitial performance predictor and the energy model.Energy Model Seeding: System Identification ExtensionTo use the energy model, one has to provide Pidlen , PAppn , Psmn , and Pnetn inaddition to the process already described in Section 3.2. The seeding scriptwas extended to collect power profiles. Specifically, the script samples thepower when the node is idle over a period (Pidlen ). For Psmn and Pnetn , theprocedure is similar to the one described to estimate µsm and µnet - the onlydifference is that the scripts also measure the power. Finally, the stress [3]853.4. PREDICTING ENERGY CONSUMPTIONDescription Symbol Seeding ValueIdle node power consumption Pidle 91.6WAdditional power when stressingCPU onlyPApp 125.2W − PidleAdditional power for storageoperationsPsm 129.1W − PidleAdditional power for networktransfersPnet 127.7W − PidlePeak power (not used forseeding)- 225.3WTable 3.5: Energy parameters and values describing one node using dual-core2.3GHz Intel Xeon E5-2630 CPU, 32GB RAM and 10 Gbps NIC from TB11presented in Section 3.4.2. Additional power means power drawn in each ofthese states in addition to the power already drawn when the machine is idle.utility tool is used to estimate PAppn by imposing a specific load on the CPUwhile the scripts measure power. Table 3.5 lists the parameters of the energymodel and the seeding values used in the evaluation presented in Section3.4.2.Implementation ExtensionFigure 3.18 presents the integration of the initial performance predictorand the energy-related additions. The simulator tracks the time each nodespends on the different phases of a workflow task: executing the computeintensive part of the application, writing to and reading from storage,and receiving/sending network data. The energy module added to theperformance predictor receives the time from the simulator and the powerseeding from the scripts, and estimates the energy consumption.3.4.2 Energy EvaluationThis section presents the evaluation of the prediction mechanism’s accuracywhen it comes to energy consumption. Similarly to Section 3.3, thisanalysis uses synthetic benchmarks and real workflow applications withdifferent storage configurations. Additionally, this section briefly analyzesthe prediction mechanism’s accuracy for predicting the impact for power863.4. PREDICTING ENERGY CONSUMPTIONPrediction  FrameworkPerformance PredictorWorkload Description Time Profiles EnergyModelPredicted EnergyPlatform Power characteristicsSystem ConfigurationPlatformPerformance characteristicsFigure 3.18: Integration of the performance predictor and the energy modelas described by Yang et al. [171]. The predictor receives the same inputdescribed in Section 3.2, the information described in Table 3.5, and abreakdown of the time spent in each power profile. The energy modulepasses this input to the energy model to estimate the energy consumption.tuning techniques. Yang et al. [171] present a more detailed evaluation ofusing this prediction approach to estimate energy consumption of workflowapplications.The experimental setup is the same as that described in Section 3.3.MosaStore is the storage system used. The label DSS refers to experimentsusing a Default Storage System configuration, and the label WOSS refersto experiments where the system configuration is optimized for a specificworkflow pattern.Testbed. A lack of infrastructure to measure power prevents thisevaluation from using the same testbeds described earlier (TB20 and TB101).Thus, a different testbed (TB11) is used for the energy evaluation: it isGrid5000 ‘Lyon’ cluster [41] where each node has two 2.3GHz Intel XeonE5-2630 CPU, 32GB RAM, and 10 Gbps NIC. One node runs the metadatamanager and the workflow coordination scripts, while the other nodes runthe storage modules, the client SAI, and the application processes. Thisplatform limits the scale of the experiments to 11 nodes.Power measurement. Each node is connected to a SME OmegaWatt873.4. PREDICTING ENERGY CONSUMPTIONpower-meter9, which provides 0.1W power resolution at a 1Hz samplingrate. To measure the total energy consumption, the scripts that run theexperiments aggregate the power consumption for the duration of thebenchmark execution. The evaluation does not take into account theenergy consumed by the node that runs the metadata service and workflowscheduler, as these are not subject to the configuration changes that theprediction mechanism targets.Evaluation Metrics. Similar to the time comparison, the evaluationfocuses on prediction accuracy by comparing the predicted and actualenergy consumption, and reports the prediction errors as |1− EpredEactual |.Synthetic Benchmarks: Workflow PatternsFigures 3.19a, 3.19b, and 3.19c present the predicted and actual energyconsumption for the pipeline, reduce, and broadcast synthetic benchmarks,respectively. Overall, the energy consumption predictions have an averageof 12.5% error and typically close to one standard deviation interval. ForDSS, the pipeline benchmark has the best accuracy with 5.2% error, reduceis at 16.4%, and broadcast is at 15.9% (just one replica).For WOSS, the predictions have an error of 13.4% for pipeline, 12.2% forreduce, and 16% for broadcast (using 4 replicas in the WOSS configuration).Despite the fact that the prediction mechanism is more accurate for executiontime than energy, the predictions captures the energy consumption for bothDSS and WOSS configurations with adequate accuracy: it accurately predictsthe energy savings brought by WOSS, and can support users in makingdecisions about storage configurations when energy consumption is used asoptimization criteria.Predicting the Energy Consumption of Real ApplicationsAs shown above, the predictor is able to estimate the energy consumptionof synthetic benchmarks, as well as the impact of different data-placementoptimizations on this metric. To understand how accurately the predictor9Details can be found at PREDICTING ENERGY CONSUMPTIONDSS DSS Pred WOSS WOSS PredActualPredictedSystemsEnergy Consumption (KJ)020406080(a) PipelineDSS DSS Pred WOSS WOSS PredActualPredictedSystemsEnergy Consumption (KJ)010203040(b) ReduceDSS DSS Pred WOSS WOSS PredActualPredictedSystemsEnergy Consumption (KJ)0102030405060(c) BroadcastFigure 3.19: Actual and predicted average energy consumption for the syntheticbenchmarks.can estimate the energy consumption for real applications, Figures 3.20aand 3.20b report the energy consumption for BLAST and Montage.The workload for both applications are scaled down because of thesmaller scale of the TB11 testbed, in comparison to TB20. For BLAST, eachnode receives 8 DNA sequence queries as input to search on the samedatabase used in Section 3.3.3. For Montage, the workload has approximately2,000 tasks and is presented in Table 3.6.Similar to the time predictions, overall, the energy predictions are moreaccurate for the real applications than for synthetic benchmarks since thesynthetic benchmarks are designed to produce a high stress on the storagesystem, which results in contention and higher variance, which are harder tocapture when modeling the storage system. BLAST predictions (see Figure3.20a) exhibit an average error of 5.2%; Montage (see Figure 3.20b) exhibitsa 15.9% average error.893.4. PREDICTING ENERGY CONSUMPTIONActual PredictedActualPredictedEnergy Consumption (KJ)050100150200(a) BLASTActual PredictedActualPredictedEnergy Consumption (KJ)020406080100120140(b)MontageFigure 3.20: Actual and predicted average energy consumption for the realapplications.Stage Data #Files File SizestageIn 320MB 163 1.7MB-2.1MBmProject 1.3GB 324 3.3MB-4.2MBmImgTbl 50KB 1 50KBmOverlaps 54KB 1 54KBmDiff 409MB 895 100KB - 3MBmFitPlane 1.8MB 449 4KBmConcatFit 21KB 1 21KBmBgModel 8.3KB 1 8.3KBmBackground 1.3GB 325 3.3MB - 4.2MBmAdd 1.3GB 2 503MBmJPEG 15MB 1 15MBstageOut 518MB 2 15MB-503MBTable 3.6: Characteristics of the Montage workflow stages for the energy evalua-tion.903.4. PREDICTING ENERGY CONSUMPTIONlll01002003004001200 1800 2300Frequency (MHz)Energy (KJ)l ActualPredicted(a) Energylll01002003004005001200 1800 2300Frequency (MHz)Time (sec)l ActualPredicted(b) TimeFigure 3.21: Actual and predicted average energy consumption and executiontime for BLAST for various CPU frequencies.Predicting the Energy Consumption Impact of DVFS TechniquesDynamic Voltage and Frequency Scaling (DVFS) (also known as CPUthrottling) is an important technique that limits the maximum frequency ofprocessors in order to reduce power consumption, with the drawbacks oflimiting the number of instructions a processor can issue in a given amountof time. Thus, although this technique can save power consumption, itis not clear whether (or when) frequency scaling can reduce the energyconsumption of different applications, including workflow applications.To evaluate the predictor’s ability to predict the impact of CPU frequencyscaling on energy consumption, Figures 3.21 and 3.22 present energyconsumption and execution time for two benchmarks as presented by Yanget al. [171]: (i) BLAST application (Figure 3.21), and (ii) the pipeline syntheticbenchmark (Figure 3.22), performing only I/O operations. DVFS is used toset the processors at the frequencies of 1200MHz, 1800MHz, and 2300MHz.The seeding procedure is performed independently for each frequency.BLAST is more CPU intensive, so reducing the frequency increases theruntime and the energy consumed, leading to 85.5% more energy consump-tion between the minimum and maximum frequencies evaluated. Thepipeline benchmark is not strongly affected by CPU performance and, thus,using the minimum frequency does not increase the benchmark execution913.4. PREDICTING ENERGY CONSUMPTIONl ll010203040501200 1800 2300Frequency (MHz)Energy (KJ)l ActualPredicted(a) Energyl l l0102030401200 1800 2300Frequency (MHz)Time (sec)l ActualPredicted(b) TimeFigure 3.22: Actual and predicted average energy consumption and executiontime for the pipeline benchmark for various CPU frequencies. Note that,although the average time for 2300MHz is slightly longer than for 1800Mhz,the average times are inside the confidence interval obtained for theaverage times at 95% of confidence and, thus, they are considered tobe equivalent for this evaluation.time significantly. More importantly, in the context of this dissertation, thepredictions are useful to support configuration decisions in these cases: Thepredictor estimates 11% vs. 17% (actual) potential energy savings for thepipeline, and an increase of 96.5% vs. an actual 85.5% increase for energyconsumption between the minimum and maximum frequencies for BLAST.These results emphasize that it is not always clear if DVFS techniquescan save energy since computational and I/O characteristics of the workflowapplication have an impact on the actual runtime. More importantly, theprediction mechanism was effective to predict the energy consumption withadequate accuracy, and support decisions as to when DVFS techniquesshould be used.3.4.3 Energy Extension SummaryAs is the case for traditional performance metrics (evaluated in Section3.3), overall, the performance prediction mechanism is able to estimateenergy consumption close to actual consumption and support configurationdecisions (Requirement R1). More importantly, the extensions made to the923.5. RELATED WORKmodel and seeding procedure were minimal and do not require changes tothe storage system (Requirement R2). Additionally, the time to predict theperformance for the energy cases evaluated use less resources (by one totwo orders of magnitude) than the actual runs, as the scenarios presented inSection 3.3.1 (Requirement R3).Finally, the evaluation presented in Section 3.4.2 covers power-tuningtechniques that target multiple scenarios and success metrics, which is keyfor providing more evidence that the proposed solution works properly ina number of scenarios (Requirement R4). In fact, Yang [170] has evaluatedthe performance prediction mechanism proposed in this dissertation, withemphasis on the energy extension, in different scenarios, including yetanother testbed; Yang’s results are similar to the ones presented in thischapter.3.5 Related WorkThis section describes previous work on different approaches to predictingstorage system performance and tuning its configuration parameters.Model based analysis. A number of projects use a model-based approachto estimate storage system performance with a given configuration orworkload. Ergastulum [22] targets centralized storage solution based on oneenclosure to recommend an initial system configuration, and Hippodrome[21] relies on Ergastulum to improve configuration, based on online moni-toring of the workload. By considering a distributed system, the predictorproposed in this dissertation handles more complex interactions amongsystem components and more configuration options – which do not exist incentralized solutions.Simulation based systems. IMPIOUS (Imprecisely Modeling ParallelI/O is Usually Successful) [122] is a trace-driven simulator that uses anabstract storage system model designed to capture the main mechanismsof parallel file systems. The simulator is simplified to be able to simulatethousands of client and storage nodes, which has the drawback of not beingaccurate, producing performance estimates that under- or over- estimates933.5. RELATED WORKthe performance by up to 60%. Another trace-driven simulator PFSsim isdesigned specifically for evaluating I/O scheduling algorithms in parallelfile systems. PFSsim simulates the storage system at low component level,simulating the network using OMNeT++ [160] and disks using DiskSim[1]. In other work, Liu et al. [111, 112] build a simulation framework forsimulating the storage system of supercomputing machines. The frameworksimulates all hardware components including compute and I/O nodes, thestorage subsystem, and the supercomputing interconnect. Finally, FileSim[70] is a parallel file system simulation framework that simulates file systemoperations at low granularity, including disk operations and packet levelsimulation of network operations. The simulator simulates specific storagesystem operations, such as data placement, or locking algorithms, and canbe used to validate new algorithms or metadata service.Similar to the work presented in this chapter, Thereska et al. [156]proposed a predictor mechanism for a distributed storage system witha detailed model. To provide such information, they propose using Stardust[158] a detailed monitoring information system that requires changes to thestorage system and kernel modules to add monitoring points. This approachenabled their predictor to achieve prediction within 19% of the actualpredictions depending on the workload. The herein proposed predictorhas achieved similar accuracy on the application’s workload evaluated,however, with the advantages of a lightweight approach to seed the model,and not requiring changes to the system design or kernel modules.Unlike some of the above-mentioned efforts, this predictor approachtargets simulating a generic distributed storage system architecture ina cluster infrastructure (not special supercomputer machines) withoutsimulating particular storage system operations. It avoids detailed low-level simulation (e.g., disk or packet level simulation), without significantlycompromising accuracy, enabling the prediction mechanism to efficientlysimulate large-scale deployments.An important difference from past work on storage systems simulationis the focus on a whole workflow application and the potential interactionamong the workflow’s phases – rather than the average performance for a943.5. RELATED WORKbatch (e.g., Hippodrome [21] and Ergastulum [22]) of storage operations,and on evaluating performance of the system in larger scale – in contrastto Thereska et al. [156]), for example. Additionally, this chapter targets thepartitioning problem of splitting the nodes between the application andintermediate storage.Monitoring and/or Machine Learning based tuning targeting overall applicationperformance. Behzad et al. [27] present an auto tuning framework forHierarchical Data Format, version 5 (HDF5) I/O library [154]; their solutionintercepts the HDF5 I/O calls and injects optimized parameters into parallelI/O calls. The herein proposed solution intercepts the HDF5 I/O callsand injects optimization parameters into parallel I/O calls. Further, theirframework monitors I/O performance, and explores the tuning parameterspace using a genetic algorithm via actual application runs. Differently fromBehzad et al. [27], ACIC’s [110] machine-learning approach usesCART tobuild the model used to guide the optimization of parallel applications.Finally, Zhang et al. [175] have recently proposed an approach to determinethe storage bottleneck for workflow applications using a set of benchmarksand actual runs of the target application.This dissertation’s approach enables an exploration of the system at alower cost. The predictor is able to estimate performance of a scenario thatadds or reduces resources and change the configuration without requiringnew runs (or a larger training set) of the application for new executionsof benchmarks (as e.g., Zhang et al. [175]) nor generations of the geneticalgorithm (as e.g., Behzad et al. [27]). Further, unlike [27, 110] this worktargets workflow applications and predicting a POSIX-based storage systemperformance, including the impact of scheduling decisions in combinationwith data placement. In fact, the proposed solution can be used thoughan adaptation of these machine-learning techniques, or other optimizationapproaches, to the target context in order to perform auto-tuning.Finally, Elastisizer [91] and Starfish [92] target a similar problem au-tomating allocation choices for an entire application. Their work, however,does not address the aspects of workflow applications or storage systemconfiguration since it focuses on a different class of application: Map Reduce953.6. PREDICTOR DEVELOPMENTjobs.Predicting Energy Consumption. Past work has targeted modeling thepower-consumption of a complete machine, based on some utilizationmetrics or performance counters. For example, Economou et al. [69] presentsa method to capture the power consumption of a system based on the idlepower CPU utilization, off-chip memory access count, and I/O rate of thenetwork and hard disk. Gurumurthi et al. [84] proposes analytical modelscoupled with events relative to the system’s architecture in order to simulatepower consumption. In the context of parallel applications, Feng et al. [72]focuses on the characteristics of multi-threaded applications. Additionally,Pakin and Lang [129] focus on evaluating the energy savings when DVFS isenabled.Although these approaches have achieved success, they typically needa deeper and previous knowledge of the architecture, which leads to alonger prediction time. Moreover, this chapter focuses on the distributedstorage layer of workflow applications that have various patterns and areI/O intensive.Towards a distributed system and a more I/O intensive workload, Samaket al. [141] present an approach to analyze power consumption focusingon pipeline to infer the consumption of larger infrastructures, not coveringthe exploration of different patterns, workload, and configuration options.Finally, some approaches like EEffSim [133] presents a simulator to evaluateenergy consumption of multi-server storage systems, but lack a validationof their approaches on real testbeds.3.6 Predictor DevelopmentThe prototype of the performance prediction mechanism is written mainlyin Java with part in C. Some of the scripts used to run experiments are inBash and Python. The source code, including the scripts used to run theexperiments, aggregate data and plot the analysis, and the measurements963.7. SUMMARY AND DISCUSSIONcollected are available at an SVN repository 10. The execution logs collectedduring the experiments are available at the NetSysLab cluster, since they aretoo large to be uploaded to the code repository.Although, to this date, I have been the main contributor to the coderepository, I have also received help from other developers, including AbmarBarros, Hao Yang, and Marcus Carvalho. We have also used automated testsand code reviews, which are also available online11. For the code reviews,in addition to the developers cited above, I have received help from ElizeuSantos-Neto, Samer Al-Kiswany, and Abdullah Gharaibeh.3.7 Summary and DiscussionThis chapter makes the case for a prediction mechanism to support provi-sioning and configuration storage choices for workflow applications. Itfocuses on predicting the performance of workflow applications whenrunning on top of an intermediate object-based storage system, and presentsa solution based on a queue-based model with a number of attractiveproperties: (i) a generic and uniform system model, (ii) a simple systemidentification process that does not require specialized probes nor systemchanges to perform the initial benchmarking, (iii) low runtime to obtainpredictions, and (iv) adequate accuracy for the studied cases.The user can also use more complex utility functions, based on thepredicted performance metrics, to reach her specific goal, as showed byStrunk et al. [152] and Wilkes [167]. For these cases, the user, or anautomated tool, can apply different optimization solvers to search theconfiguration space and propose the most desirable storage configurationand provisioning choices.The discussion below clarifies some of the limitations of this work andthe lessons learned during this designing exercise.10Predictor implementation, simulation results, and its input data are available at Actual measurements, configuration deploymentand storage system code can be found at SUMMARY AND DISCUSSIONWhat are the main sources of inaccuracies?Currently, there are sources of inaccuracies at multiple levels. First, themodel does not capture all the details of the storage system. For example,support services like garbage collection or storage node heartbeats orthe control paths are simplified to match a generic object-based storage(MosaStore uses a FUSE-based implementation that would need morecomplex control path), and this model assumes that all control messages areof the same size. Second, the system identification mechanism is constrainedto be simplified even further, at the cost of additional accuracy loss. Third,the model does not capture the infrastructure in detail (e.g., contention atthe network fabric level or OS scheduling). Finally, so far, the applicationdriver uses an idealized image of the workflow application; for example,all pipelines are launched in the simulation exactly at the same time, while,in the experiments on real hardware, coordination overheads make themslightly staggered).What are the sources of inaccuracies particular to the energy consump-tion predictions?In addition to the reasons summarized in Table 3.7, which include the sim-plicity of the model and its seeding mechanism, the error in the predictionsfor energy arise from some inaccuracies in the input: First, the power metersused provide 1Hz sampling rates, which excludes the consumed energy atsub-second granularity from the actual measurements. Second, the timepredictor should provide an accurate breakdown of the time spent in eachpower profile, which is hard to validate without a more intrusive approach.Additionally, the additive energy model is a simplification since it doesnot capture the lack of energy-proportionality of current platforms.What are the trade-offs of optimizing for time, versus optimizing for costand energy?To optimize for time, one can use an optimized storage configuration oradd more resources (e.g., add more nodes). The former approach usually983.7. SUMMARY AND DISCUSSIONSource ExamplesStorage system Fine granularity for the activity inside eachcomponent, detailed execution path, ormaintenance services such as failure detection andgarbage collection.Infrastructure Contention at the network fabric level, complexnetwork topology, or detailed scheduling overhead.Application Tasks launched at the same time, absence of faultsby crash, or machines with degraded performance.SystemidentificationAssumptions about client and storage service times.Table 3.7: Summary of the limitations and main sources of inaccuracies for theprediction mechanism.exploits data locality and location-aware scheduling, which, in addition totime, generally reduces the amount of data transfers and, thus, it reducesthe energy costs as well. The idle power, however, remains a large portionof the total power consumption due to the fact that, though they are moreenergy proportional than their predecessors, the state-of-the-art platformsare not perfectly energy proportional (testbed TB11’s idle power is 40% ofthe peak power). Section 5.4 extends this discussion in the context of twogenerations of machines with different power proportionality in the contextof data deduplication storage technique.Increasing the allocation size of an application may improve perfor-mance at the cost of money and more energy. Section 3.4.2 demonstrates,for a subclass of applications, it is additionally possible to optimize forenergy only by using power-tuning techniques such as CPU throttling.These techniques, however, need to be carefully considered, as they canbring energy savings or lead to additional costs, depending on the specificapplication patterns. The prediction mechanism proposed in this chapter isparticularly useful to support these types of decisions.What is the impact of failures during the application execution?Currently, the simulator does not account for failures in the task execution,either by crash or another reason that may degrade the performance and993.7. SUMMARY AND DISCUSSIONmakes the execution time of some tasks longer. A common approach toaddress this problem is adding replication for failed tasks or tasks that aretaking long to finish execution [66], which can be easily incorporated tothe workqueue [62] or to the data-location-aware workqueue [142] alreadyimplemented in the simulator. Additionally, the simulator would benefitfrom receiving a model that captures faults for the target environment to beconsidered during the prediction.Howaccurate is the predictionmechanismwhen the intermediate storageis deployed on spinning disks?So far, the focus has been on predicting performance when the intermediatestorage is deployed over RAMDisks since this is a common setup on largesystems that rely on intermediate storage, as it improves performance;and some platforms do not even have spinning disks (see Section 2.2.1).The storage service does not model history-dependent behaviour, thusit is expected to achieve lower accuracy predictions when the system isdeployed over spinning disks. This issue can be addressed by using a moresophisticated model or simulator of the storage device as DiskSim [1], or amachine learning approach as described by Crume et al. [61].A preliminary evaluation, however, shows how the current (unchanged)model performs when using spinning disks, on testbed TB20, for the syn-thetic benchmarks described in Section 3.3.1. Figure 3.23 shows the resultsfor the reduce pattern when using the medium and large workloads. The keyobservation here is that, although prediction accuracy is lower, predictionsare good enough to make the correct choice between DSS and WOSS – thatis, the choice of whether to use the data co-placement optimization for eachof the workloads (note that this optimization is beneficial in one case, and isdetrimental in the other one). Pipeline benchmark on spinning disks showsresults close to those using RAMdisks (Figure 3.4).The results when using TB101 , which has newer machines than TB20,and MosaStore deployed on spinning disks are similar to those of theRAMDisks as well. Moreover, Figure 3.12 shows results for Montage usingMosaStore deployed on spinning disks. The results for TB101 show the1003.7. SUMMARY AND DISCUSSIONDSS DSS Pred WOSS WOSS PredActualPredictedSystemsTime (sec)012345(a)Medium WorkloadDSS DSS Pred WOSS WOSS PredActualPredictedSystemsTime (sec)020406080(b) Large WorkloadFigure 3.23: Actual and predicted performance for the reduce benchmark onspinning disks.trend for the impact of newer disks with higher performance and largerbuffers have on the predictions of these applications.Additionally, Figure 3.24 presents the predicted run time for pipelinebenchmark running on top of Ceph storage system deployed on spinningdisks using TB101.Hao Yang has been conducting an evaluation of spinning disks in thecontext of energy predictions.How general is the proposed prediction mechanism? That is, can theprediction mechanism be used for other storage systems?The prediction mechanism’s goal is to model a generic object-based storagesystem and have a system identification process that works entirely at theapplication level only, to be easily portable across deployment platforms.While, so far, the predictor has been evaluated in depth for the MosaStore1013.7. SUMMARY AND DISCUSSIONsystem in its DSS and WOSS configurations, a preliminary experiencegave encouraging results when using it to predict the relative applicationperformance of the pipeline benchmarks on DSS in two other storagesolutions, Ceph [164] and GlusterFS [65], for a limited number of scenarios.The experiments consisted of the pipeline benchmark running on top ofthese two systems. The traces were the same already used for MosaStorepredictions, no new traces were collected. The input not shared among thethe systems was the seeding (see 3.2.4).Figure 3.24 shows the actual and the predicted performance for thepipeline benchmark using Ceph deployed on spinning disks. This experi-ment also varies the number of nodes for TB101. Predictions for the averageperformance are within 15% of the values for the actual performance. Addi-tionally, Ceph has a high standard deviation, which place the predictionsoverlapping the intervals considering the standard deviations for all scales.For GlusterFS, the predictor obtained error around 30% on top of 100 nodes.More importantly, this accuracy is “good-enough” to compare theperformance among those systems and MosaStore, capturing which oneshould provide better performance. Note that Al-Kiswany et al. [18] evaluatethe performance of these systems and compare them to MosaStore in thecontext of workflow applications.What should one do to use this solution?The performance prediction mechanism needs to receive both: a descriptionof the (i) workload and of the (ii) platform. The information related to theplatform is coupled with the storage system and the network, requiring anew execution of the seeding procedure for a different platform – either anew network or a new set of machines.The input related to the workload should be in the form of (i) a per clientI/O operations trace and (ii) the task-dependency graph for scheduling(see Figure 3.1). Note that it is not dependent on the storage system oron a specific deployment, it is simply has to be in the format that thesimulator expects to parse the input. Thus, the same input can be used1023.7. SUMMARY AND DISCUSSIONCephl ll0102030405025 50 100NodesAverage time (sec)l ActualPredictedFigure 3.24: Actual and predicted performance for the pipeline benchmark whilevarying the number of nodes on TB101 using Ceph as storage system onspinning disks. Plot shows the average benchmark time for 20 rounds.Error bars represent standard deviation.across different deployment or even systems. In fact, the experiments forCeph and GlusterFS, in the discussion item above, used the same input,which was already used for MosaStore predictions.Moreover, the platform description for Ceph and GlusterFS, used asinput to the performance predictor, was also based on a deployment of justthree machines as described in Section 3.2.4.Can the same workload description be used for different versions of thesame application?Workload description is based on a trace of I/O operations. If the behaviourof the I/O operations changes, it may affect the performance on theapplication in a given configuration of the storage system. For example, ifa new version of the application takes more advantage of data locality, thecache-hit ratio may increase, which reduces the amount of data transfers1033.7. SUMMARY AND DISCUSSIONamong the nodes. This reduction decreases the I/O demand on the storagenodes and can reduce the number of storage nodes to deliver the sameperformance. Additionally, the new version may also change the processingtime of the application, which is used to predict the overall performance ofthe application.In complex applications, however, it is likely that just a portion of thebinaries changes, which affects only a subset of the stages. In this case, theperformance predictor would only need the trace from the stages that usethe new binaries to combine with the stages from a previously collectedtrace. Note that it assumes that the output files from these stages would notchange. If the new binaries produce file with different characteristics fromthe previous version, then the subsequent stages should be re-executed toproduce new traces.Does changing the data sizes requires a re-execution of the entire applica-tion? or Can one use the same traces to produce the workload descriptionfor different data sizes?The results from this study show that the data size impact on the overallperformance (e.g., total execution time), but they do not have a high impacton the overall relation among the different data-placement options. Forexample, the results from the reduce benchmark for medium and largeworkloads have different execution times, but the gains of using WOSS overDSS are proportionally similar.For other configuration options, however, the actual best configurationchoice may be different while the overall behaviour of the performancecurve in a plot looks similar, just shifted. As an example of this scenario,consider the results for Montage large workload obtained from TB101 andthe one obtained on TB20 in Figures 3.12 and 3.14. For both cases, the overallshape of the graph shows decreasing performance gains as the scale of thesystem grows. The actual values for number of nodes offering the bestperformance or the actual application time, however, differ significantly.104Chapter 4Using a Performance Predictorto Support Storage SystemDevelopmentThis chapter presents an experience of using the prediction mechanismbeyond its original design goal: using the performance prediction mech-anism described in Chapter 3 to better understand system behaviour anddebug MosaStore (Section 2.1). Specifically, the developers of MosaStorecompare the predictions to the actual performance and decide whetherthey are close enough to their goal1. In the case in which they are not closeenough, the developers proceed to debug the system, and follow this processiteratively as part of its development. The predictor is also used to evaluatethe potential gains of new system features.Overall, the predictor is useful for setting goals for the system’s perfor-mance, despite the difficulties of properly seeding the model and mimickingthe used benchmarks to evaluate its performance. Specifically, this approachgives confidence in the implementation of MosaStore for some scenarios and1Experience with Applying Performance Prediction during Development: a Distributed StorageSystem Tale [59]. Lauro Beltrão Costa, João Brunet, Lile Hattori, and Matei Ripeanu.In Proceedings of the 2nd International Workshop on Software Engineering for HighPerformance Computing in Computational Science and Engineering, SE-HPCCSE ’14. Pages13–19. IEEE, November 2014.1054.1. MOTIVATIONpoints out situations that needed further improvement. The latter makesboth an improvement of the system’s performance by up to 30% and adecrease of 10x in the response time variance, possible.In presenting this experience, this chapter sheds light on two higher-level questions: (i) What are the potential benefits that the proposed performancepredictor can bring to the software development process of a storage system?, and(ii) What are the limitations and challenges of using a performance predictor as partof the development process in the presented use-case?.The rest of this chapter is organized as follows: Section 4.1 presentsthe motivation for having a baseline for expected efficiency of distributedsystems. Section 4.2 advocates for the use of a performance predictor as anapproach to setting the efficiency baseline, in order to support the develop-ment of distributed systems, and presents the methodology of the use-casedescribed in this chapter. Section 4.3 shows how the performance predictorcan support the development by describing its use in the developmentof MosaStore. Section 4.4 summarizes the main problems faced by theperformance predictor approach during this study. Section 4.5 discusses theexperience of applying performance prediction and highlights the benefitsand limitations. Section 4.6 presents related work specifically in the contextof predicting performance to support software development. Finally, Section4.7 presents concluding remarks regarding this study.4.1 MotivationEfficiency, in terms of the resource usage required to deliver ‘good’, or simplyclose to optimal performance, is an important criterion for the success ofa system. In particular, for large scale computer systems, which are likelyto be used by several clients and handle a variety of workloads, this non-functional requirement is crucial to achieve success and encourage theiradoption. High performance must be reached while keeping the cost low,by using the least possible amount of resources, while delivering the fastesttime-to-solution. Because of its importance, efficiency should be addressedfrom the early stages of software development [24], especially given that1064.1. MOTIVATIONmaintenance and debugging costs increase over time [165].The current state-of-the-practice in addressing performance consists ofanalyzing a system by employing profilers to monitor its behaviour [24].During the analysis, these profilers should pinpoint the regions of the codewhere execution takes longer and, therefore, should receive more attentionand be optimized. Although profiling-based optimization is undoubtedlya key part of the development of complex systems, deciding when theefficiency of a system has reached a “good enough” level, where extra effortis unlikely to render extra benefits, is still a challenge.In this context, where optimal performance is crucial for the functioningof the system, developers should be able to rely on tools that providean estimate of the expected performance (i.e., a baseline) and compareit to the actual performance while developing the system. This processis complex, even more complex for distributed systems, since there aremore components and different interactions when compared to systemshosted on a single node (e.g., failure and retry operations to tolerate networkfailures). This process would be conceptually similar to the use of automatedtests to check functional requirements, but with the tool being a system’sperformance predictor targeting a specific efficiency level. Ideally, thetool should also provide information about where the system is losingperformance, similar to when automated tests identify a part of the codethat is failing for a specific functional requirement.Note that the use of a performance predictor in this scenario is dif-ferent from its common use. Typically, performance predictors rely onthe monitoring information from an already deployed system to supportconfiguration decisions (see Section 2.3), capacity planning, or onlineprovisioning adaptation during a deployment-planning or post-deploymentphase. The scenario presented in this chapter advocates for the usage of aperformance predictor as a tool incorporated into the software developmentphase in order to provide a performance baseline that guides the effort toimprove performance.1074.2. CASE STUDY4.2 Case StudyThis chapter discusses the importance of using performance predictors todevelop complex software systems by narrating an experience in applying aspecific performance predictor in the development of a complex and error-prone distributed storage system. It shows how the performance predictorhelped developers to properly handle the interaction of a potentially largenumber of distributed components in a storage system, while avoiding com-promising efficiency due to the complex interactions of these components.In addition to presenting the gains of applying the performance predic-tor in this use-case, this chapter also discusses the main problems facedduring this study. To this end, we2 proposed and followed the use of theperformance prediction approach during the development of a distributedstorage system.This rest of this section presents a brief summary of both: MosaStoreas the system used in this study (Section 4.2.1 and detailed in Section 2.1)and the design of the performance predictor (Section 4.2.2 and detailedin Chapter 3), while highlighting the aspects relevant to the rationale ofapplying the performance predictor during the development process.4.2.1 Object SystemThis study focuses on MosaStore [14], a distributed storage system thatcan be configured to enable workload-specific optimizations. It is mainlyimplemented in C, and has approximately 10,000 lines of code. To thisdate, more than fifteen developers were involved in different versions of itsimplementation. Section 2.1 presents MosaStore in detail.2When I use we in this chapter, I refer to the people involved in the development ofMosaStore as well as two software engineering researchers, João Brunet and Lile Hattori,with whom I have discussed the goal of this study and who helped to set the goals. Duringthis case study, none of us were actively involved in the coding of MosaStore. Specifically,the MosaStore’s developers were Emalayan Vairavanathan, Samer Al-Kiswany, and HaoYang.1084.2. CASE STUDYMosaStore ArchitectureMosaStore employs a widely-adopted object-based storage system architec-ture (such as that adopted by GoogleFS [82], PVFS [85], and UrsaMinor [10]).This architecture includes three main components: a centralized metadatamanager, storage nodes, and a client-side SAI, which uses a VFS via theFUSE [2] kernel module to implement a user-level file-system that providesa POSIX API in MosaStore. Section 2.1.1 and Figure 2.2 present MosaStore’sarchitecture in detail.MosaStore Configuration VersatilityMore relevant to this study, MosaStore can be configured according to amultitude of options (Section 2.1.2). These possible configurations can resultin a number of different deployments of the system, where storage nodesand SAIs interact in different ways, with different overheads, and leadingto different performances. In this context, the developers of MosaStoreinitially employed profiling analysis during its development, but in an adhoc manner; there was no baseline for the best possible performance, andno indicator of when they should have started or stopped the profilingprocess. This scenario brought a challenge to MosaStore’s developers, whichis present in many other systems: for a given configuration, what performanceshould an implementation deliver to be considered efficient?4.2.2 Performance PredictorTo support MosaStore’s development in the performance-improvementphase, we decided to employ the performance predictor as part of thedevelopment process. This section describes the general reasoning behindusing the performance predictor as a mechanism to provide a performancegoal for the system in the context of development.Local Deployment: A Simple ScenarioIn its simplest configuration, MosaStore uses a single machine. By deployingall three components locally, the system should not face efficiency loss1094.2. CASE STUDYrelated to the coordination of distributed components, nor due to contentionon the network substrate.The expected ideal performance for such a scenario, can be set by externaltools that serve as benchmarks for local file systems, or even by simpleanalytical models. MosaStore developers already used this approach, whichserved as simple “sanity” check for the system’s performance. For example,a scenario where a single SAI uses a manager and a storage node deployedon the same machine should provide a sequential write throughput similarto the throughput of a write to a local file system. For MosaStore, theperformance baseline to be achieved was set based on the performance of adifferent local file system – Third extended filesystem (EXT3), in this case.Distributing the System ComponentsAs the number of components in the system grows, predicting performanceof a distributed system accurately and in a scalable manner becomes chal-lenging. Analytical models for the contention of the distributed environmenton the storage nodes and on the manager, or the network performance thatcan capture different applications’ workload, become increasingly morecomplex. Moreover, there are no tools that can be used as a baseline, anda similar system would play the role of a competitor, which would notnecessarily indicate the best performance that the system can deliver ina given setup. One could indeed use a competitor system to set the goalfor the performance of a system, but this approach still misses the point ofinforming the developers that there is more to be extracted from the system.The predictor’s approach for the distributed configurations builds onthe single-machine scenario. The predictor models part of the overhead ofthe distributed system that is inherent from the distributed environment(e.g., data transfers or network sharing), while another part is directlyrelated to the implementation and deployment environment (e.g., resourcelocking or network packet loss). To capture the inherent interactions of thedistributed environment, the predictor uses a queue-based storage systemmodel (Section 3.2.3), which is not affected by performance anomalies1104.2. CASE STUDYcaused by the implementation.The system’s performance in the single machine scenario seeds thedifferent components of the model (SAI, storage module, and manager),capturing the performance of the system components when there is noefficiency loss caused by the distributed environment. Finally, a utility toolseeds the performance of the network part of the model.Predictor Input/OutputThe predictor requires three inputs: (i) the performance characteristics ofthe system’s components, (ii) the storage system configuration, and (iii)a workload description. The performance characteristics of the system’scomponents used to seed the model are based on the single-machinescenario. Not present in the single-machine case is the network substrate,which is characterized via a utility tool to measure network throughput andlatency (e.g., iperf [128]). Section 3.2.4 details the procedure for seedingthe predictor’s model. Chapter 3 presents a complete discussion of thepredictor design, its input, the model seeding process, and an evaluation forits response time, scalability, and accuracy.Intended Usage ScenarioThe predictor instantiates the storage system model with the specific com-ponent characteristics and configuration, and simulates the application runas described by the workload description. When the simulation concludes,it provides an estimate of the performance of the system for a given setup.The developers of the system actually deploy and run the system, inthe same setup used to obtain the prediction, and measure the actualperformance. They then compare the estimate for the simulator to the actualperformance and decide whether they are close enough to their goal. If theyare not close enough, they pursue more optimizations by debugging thesystem, and follow this process iteratively as one phase in the developmentcycle (see Figure 4.1).1114.3. EXPERIENCE USING THE PREDICTOR DURING DEVELOPMENTFeature Implementation or Bug FixDid it pass the tests?Prepare and Run Performance TestsSubmit Code to ReviewIs Performance Acceptable?Is Code Review Approved?CommitCode RepositoryYesYesYesNoNoNoUnit / System TestsFigure 4.1: The use of the performance predictor as part of the developmentcycle. The developers have a new phase when they run a benchmark (i.e.,a performance test) and compare to the performance predictor results toverify whether of not the performance is acceptable.Note that with the predictor the developers can obtain the expectedperformance at the granularity of a module (i.e., client, storage, or manager)and main operations (e.g., read and write), which is the granularity capturedby the model in the predictor (see Section 3.2.3). The developers then cantrack the mismatch per module, as summarized in the cases described inSection Experience Using the Predictor duringDevelopmentAs part of the development cycle of MosaStore, synthetic benchmarks runon top of the system to mimic real workflow applications’ access patterns[159] and real applications, in order to evaluate the performance of thesystem. These synthetic benchmarks represent worst-case scenarios in termsof the storage system’s performance, as they are composed exclusively ofI/O operations, which are intended to create contention and expose theoverhead of the storage system.The predictor estimates performance metrics for the same environmentused to evaluate the performance of the storage system, as explained in1124.3. EXPERIENCE USING THE PREDICTOR DURING DEVELOPMENTSection 4.2.1 as the intended usage scenario, and shown in Figure 4.1, theperformance predictor was included in the development process. That is,the developers execute a benchmark and instantiate the predictor to mimicexactly the same set-up in terms of workload, scale, and configuration ofthat benchmark. After executing both, the developers compare the metricsof the predicted and actual performance.In a number of cases, the predicted and actual performance were close,providing confidence that the implementation was efficient. There werecases however in which the actual and predicted performance differedsignificantly; these cases highlighted complex performance-related anoma-lies, leading to a debugging effort. This section presents three of thesecases of performance anomalies by describing how they affected the system.Specifically, it describes how the predictor identified performance anomalies,how the predictor was useful to address them, and how the fixes impactedthe system performance3.The cases presented here focus on two benchmarks used in Chapter 3and summarized here:Pipeline benchmark. A set of compute tasks are chained in a sequence suchthat the output of one task is the input of the next task. 19 applicationpipelines, one for each machine, run in parallel and go through threeprocessing tasks.Reduce benchmark. A single compute task uses input files produced bymultiple tasks. In the benchmark, 19 processes run in parallel on differ-ent nodes, each consumes an input file, and produces an intermediatefile. The next stage consists of a single task reading all intermediatefiles, and producing the reduce-file.Experimental Setup. The benchmarks run on a testbed of 20 machines,each with an Intel Xeon E5345 2.33GHz CPU, 4GB RAM, and 1Gbps NIC.3The accuracy of the predictor is not the focus of this chapter, which rather focuses onhow the tool can be useful for supporting the development process. Chapter 3 presentsa deep evaluation of the accuracy of the predictor using synthetic benchmarks and realapplications, including a detailed presentation of each benchmark.1134.3. EXPERIENCE USING THE PREDICTOR DURING DEVELOPMENTMosaStore’s manager runs on one machine, while the other 19 machinesrun both a storage node and a client access module per machine. The plotsin Figure 4.2 show the average turnaround time and standard deviation forat least 15 trials, which guarantee a 95% confidence interval, with relativeerror of 5%, according to the method described by Jain [98].To make the impact of these improvements clear, Figure 4.2 showsthe average time for the two benchmarks and different versions of thesystem. Figure 4.2a has four bars: one representing the time obtained fromthe initial version of the storage system, two showing the versions of thesystem after fixing the performance anomalies described in this section, andthe last showing the predicted time. Figure 4.2b includes only one of theperformance anomalies, since the other one does not impact this benchmark.Overall, the use of a performance predictor allowed improvements of up to30% in the execution time of the benchmarks, and a decreased variance byalmost 10x in some scenarios, by directing the developers in a debuggingeffort.4.3.1 Case 1: Lack of RandomnessContext. Whenever a client creates a file, it contacts the manager to obtain alist of storage nodes that will be used during the file write operation. Thenumber of storage nodes is determined by the stripe-width configurationparameter. The order in which the storage nodes will be used to write isdetermined by the manager, and should be shuffled according to a randomseed.Problem. The manager used a constant seed to shuffle the list of storagenodes returned. Therefore, when the system was set to use the maximumstripe-width, all clients obtained the list in the same order, issuing writeoperations to the storage nodes in the same order. This created temporalhot-spots where the first storage node received connections from all clients,while other storage nodes were idle.Detection. When developers are interested in obtaining the responsetime for a given component of the system, they can simply obtain this1144.3. EXPERIENCE USING THE PREDICTOR DURING DEVELOPMENTInitialRand_LockTimeoutPredictedTime (sec)0510152025(a) Pipeline benchmarkInitialTimeoutPredictedTime (sec)02468(b) Reduce benchmarkFigure 4.2: Impact of fixing performance issues. ‘Rand_Lock’ bar shows resultsfor fixing two performance anomalies together (Sections 4.3.1 and 4.3.2).Section 4.3.3 presents the case plotted as ’Timeout’ turning on a log option that measures the time from the reception of arequest until its response at the component acting as a server for that request.In fact, this functionality helped in this case and was key in detecting theproblem for the next two cases. By studying the behaviour of the systemfrom the logs, the developers stated a hypothesis for the problem, fixed it,and executed the benchmarks again.Fix. The fix for this problem consisted of changing the algorithm thatshuffles the list of storage nodes, so it would use a different seed every timeit is invoked, providing the necessary randomness for node allocation.Impact. The performance improvement can be seen in the Rand_Lockbar in Figure 4.2a, which combines this and the next case.4.3.2 Case 2: Lock OverheadContext. Whenever a client reads or updates metadata information (e.g.,opens or closes a file), it contacts the manager, which needs to lock multiple1154.3. EXPERIENCE USING THE PREDICTOR DURING DEVELOPMENTupdates over the metadata in order to avoid race conditions.Problem. The developers of the initial version of the system chose aconservative approach, by opting to lock the large code blocks that are calledduring the client invocation, rather than locking only the critical regions.Detection. The predictor shows a mismatched between the actual andpredicted time for the pipeline benchmark. Based on the manager’s actualservice times, the developers could verify that a few requests to the managertook reasonably longer than the average. Once they spotted the problem inthe manager, they started a debugging process to spot the parts of the codethat took longer and detected large and unnecessary lock scopes.Fix. Reduce the lock scope.Impact. The performance improvement can be seen combined with thelack of randomness case in Figure 4.2b. The overall gain was around 2.5seconds in average. Note that the variance also decreased.4.3.3 Case 3: Connection TimeoutContext. Similar to the way in which the predictor helped us revisitassumptions about the implementation of the system, it also pointed outissues related to the middleware stack that the storage system relies upon.If a client contacts a server to establish a TCP connection, it waits for aspecific timeout before trying it again, if the original attempt fails. Note thatthis is to establish the connection, and not the TCP window managementthat occurs once the connection is established.Problem. In cases where too many clients tried to send data to a storagenode at the same time, the storage node dropped some SYN packets (packetused to establish a TCP connection) because there were too many packetsin its queue. The client then waited for a timeout defined by the TCP-implementation, which is 3 seconds, consequently taking the average timeto write the data in the benchmark to a much higher value than expected.Detection. The developers also used the system logging to verify theservice time of each component in this case. The predictor provides similarinformation, which was compared against the system’s logs in order to1164.4. PROBLEMS FACEDidentify when there was a mismatch between the time predicted for thestorage nodes to finish a request and the actual time. Indeed, system logsshowed an actual longer waiting time.In this case, however, obtaining service times for each request didnot help much because the problem actually prevented the request fromreaching the storage component in the first place. To find the gap betweenpredicted and actual time, the developers added a new functionality thatlogged service time at the client side. By collecting these times and analyzingthem, the developers could see that most of the requests were processedaround the time predicted, and the other requests were processed threeseconds later. By debugging the code, the developers found that this gaphappened in the network connection phase of the request and discoveredhow the TCP implementation could lead to this case.Fix. The developers decided to define their own timeout instead ofrelying on the system’s default. To do so, they needed to change theimplementation, from using blocking to non-blocking sockets during theconnection phase.Impact. The performance improvement can be seen in Figure 4.2a andFigure 4.2b. In this case, the gain affected the Reduce benchmark (Figure4.2b) more heavily due to its data flow, where several machines write to justone.4.4 Problems FacedThe limitations of using a separate model to capture the actual implemen-tation behaviour are well-known, and well-captured in a sentence by G. E.P. Box: “Essentially, all models are wrong, but some are useful”. Indeed,properly capturing the system’s behaviour to provide useful performanceestimates, or to correctly define the deployment to be simulated, can bechallenging.In some cases, the system analysis was affected by simplifications inthe model, shortcomings of the seeding process, or incorrect assumptions1174.4. PROBLEMS FACEDabout the deployment platform. The following list describes the four mainproblems faced:Workload description mismatch. The predictor receives a collection of I/Ooperations based on the log of a benchmark’s execution. These bench-marks launch processes on specific nodes via ssh (secure shell). Theactual time to launch one round of processes in the cluster varies from0.1 to 0.3 seconds. This variance is not related to the storage system, butit affected the comparison between actual and the predicted time. Afterdiscovering this variance, we added it to the workload description tobe simulated.Platform Mismatch. During the execution of the benchmarks, we realizedthat one of the machines used in the experiments was reasonablyfaster than the others. In this case, therefore, we had specified adeployment environment that was not the one actually used. Runningthe seeding process on the fast machine, and specifying a moreaccurate deployment environment for it, properly fixed the problem.Modeling Inaccuracy: lack of local priority for reads. In cases where thereare several replicas of a chunk in the system, the simulated clientselects one randomly. The actual storage system, however, givespriority to chunk replicas located in the same machine. The simulatordid not capture this priority and it led to mismatched predictions incases with higher replication levels. It happened in the initial stage ofthe simulator implementation, but unit tests captured the problem.Modeling Inaccuracy: connection timeout. The connection timeout prob-lem described in Section 4.3.3 shows a case where the implementationdid not handle establishing connections properly. The implementationwas changed to avoid long waits. Nevertheless, a remote machinereceiving more requests to open connections than it can handle wouldstill drop some of the requests, which may increase response time.This situation describes a scenario where the model does not capturethe actual behaviour of the deployed system. We consider this problem1184.5. DISCUSSIONto be a result of the environment. Therefore, we did not extendthe model to include this behaviour and adjust the prediction sincethe predictor should provide accurate performance in the absence ofimplementation-related or environmental issues.4.5 DiscussionThis section discusses the use of a performance predictor as a tool tofacilitate the development of complex systems, with the goal of callingthe community’s attention to the value of producing such a tool as the onepresented in Section 4.2.Can a performance predictor bring benefits to the software developmentprocess?The case study presented in this chapter shows the potential usefulnessof having a performance prediction tool that can set a performance targetfor a system in deployments with different configurations and scale. Thepredictor brings confidence in the results obtained in several scenarios, issuccessful in pointing out scenarios that needed improvement, and cansupport the improvement effort. In fact, the performance of the systemimproved by up to 30%, and the response time variance decreased by almost10x in some scenarios (e.g., the case in Section 4.3.3), as a result of applyingthis approach.We believe that applying a performance predictor in the developmentof other complex systems would also be beneficial. Similar to back-of-the-envelope calculations, the predictor indicates the bounds of expectedperformance for a given system. A predictor can, however, take back-of-the-envelope calculations a step further, because the model it uses providesbuilding blocks to guarantee its usefulness in complex scenarios whereback-of-the-envelope estimates are intractable or inaccurate. Not onlycan developers use a predictor to obtain a baseline to detect performanceanomalies, but also to evaluate the potential gains of implementing new1194.5. DISCUSSIONcomplex optimizations, or to study the impact of a faster network and nodeson a system.What are the limitations and challenges of using a performance predictoras part of the development process?Overall, the problems that we faced using this approach can be split intotwo classes: (i) on having accurate performance predictions, and (ii) on itsuse during the development phase.The first class covers problems related to performance predictors ingeneral [24], such as having an accurate model and proper seeding. Section4.4 describes the problems encountered during the case study. Additionally,the approach we apply has one main challenge: to define the baseline to setthe performance goal.Since we were interested in a distributed system, we targeted thischallenge in two phases, as described in Section 4.2.2: First, we use ascenario focusing on a local deployment of all components that has theperformance goal set by comparing the performance to a similar competitivesystem. Second, we use a distributed scenario where the performance goalis given by a queue-based model. This approach of handling the complexdistribution case by building the distributed scenario on top of a singlemachine scenario was key for the success of the predictor use.Other approaches, such as microbenchmarks and analytical models, canbe used to define these baselines. The proper choice, however, is system-dependent as each case and approach have their own trade-offs.The second class of problems is related to how developers use such atool. In our experience, after the initial month of use, developers startedskipping the prediction tool as part of the cycle. During the initial phase ofapplying the tool to development process, the developers could discoverseveral problems and this led to several improvements of the system.After this initial phase, the gains of frequently running the tool and actualbenchmarks decreased since most of the problems were addressed.This problem is similar to having a suite of automated tests that take toolong to execute. In these situations, the test suite tends to be split in different1204.6. RELATED WORKsuites, one to be used often by the developers and another (or several) to beused as pre-commit or during nightly builds. Hence, we advocate that thepredictor tool should be used as part of a large suite of automated tests. Infact, we believe that developers should also specify a percentage of toleratedoverhead over the predicted performance in order to define acceptance testsas part of this large suite of automated tests.An additional challenge shared among all the performance predictionapproaches is to define how much overhead is acceptable on the top of thebaseline performance. In all cases, the developers set 10% as the minimalthreshold to stop the debugging process, but they pursued a performancemismatch in all cases - even for cases that were already within the 10%threshold - and defined when to stop in an ad hoc manner after reachingat least a 10% mismatch. The proper choice remains an open question. Infact, we believe it is case-dependent, since it depends on an analysis of thetrade-off of the estimated future development effort to fix the mismatchvs. the potential performance gain brought by a fix. The prediction tool,however, still serves to guide this decision.4.6 Related WorkThis section relates the approach described in this chapter with otherattempts to integrate performance prediction/analysis into the softwaredevelopment process. The goal of this section is to summarize how differenttechniques are used to improve efficiency during software development.Past work (e.g., the “Performance by Design” book by Menasce et al. [116])discusses techniques with similar goals for the design phase, before softwaredevelopment, which is out of the scope of this chapter.Balsamo et al. [24] conducted a survey of model-based performanceprediction at software development time. According to them, the firstattempt to integrate performance analysis into the software developmentprocess was conducted by Smith [149] and was named Software Perfor-mance Engineering (SPE) methodology. SPE is a generic methodologythat relies on software and system execution models to specify resource1214.6. RELATED WORKrequirements and conduct performance analysis, respectively. Our maingoal, in contrast, is to have a predictor to set a goal for performance, althoughit can also assist in performance analysis and debugging.Over the years, some approaches were proposed based on this SPEmethodology [51, 150, 168]. Most of them analyze Unified ModelingLanguage (UML) diagrams (e.g., class, sequence, and deployment diagrams)to build software execution models to achieve performance prediction.For example, Williams and Smith [168] employed Class, Sequence, andDeployment diagrams enriched with a Message Sequence Chart to evaluateperformance of an interactive system designed to support computer aideddesign activities. They described their experience in using these architecturalmodels to verify performance requirements during software design in orderto support architecture trade-off decisions. Bruseke et al. [40] use a Palladio-based component description that contains a contract for the performance.They use this contract to estimate the performance of a chain of components,and to perform blame analysis when the composition violates the contract.Cortellessa and Mirandola [51] also used UML diagrams to generate aqueueing network based performance model. However, they prioritizedState transition and Interaction diagrams as a source of information.There are two main differences between these approaches and the onedescribed in this chapter: First, our main goal is to have a predictor to set agoal for performance instead of only understanding how the performanceof a given implementation will be. Second, these approaches requireUML diagram analysis to build Queuing Network-Based (QN) models,while we build them from scratch with a coarser granularity (main systemcomponents).On the one hand, the use of UML allows the developers to automatesome steps of model specification, such as execution paths specification oncethe UML diagrams are done. On the other hand, system documentation,such as UML diagrams, may not be accurate and tends to be neglected overtime[107, 132]. By focusing on the main components, the approach presentedin this chapter avoids the effort of keeping the UML model updated andaccurate.1224.7. CONCLUDING REMARKSPast work had proposed solutions for improving efficiency during thedevelopment cycle by detecting when the introduction of new code to addfeatures or fix bugs negatively impacted performance. For instance, Heger etal. [89] proposed an approach to: (i) detect performance regressions duringsoftware development and (ii) isolate the root cause of such regression.The former is achieved by employing unit testing over the history ofthe software under analysis and comparing it to the performance resultsto uncover possible performance regression introductions. The latter isachieved by applying systematic performance measurements based oncall-tree information extracted from the unit tests’ execution. Besidescontrolled experiments, the authors also present their successful experienceinvestigating a performance regression introduced by the developers of theApache Commons Math library 4.Similar to this work, Heger et al. [89] proposed an approach thatadvocates for the introduction of performance evaluation into the softwaredevelopment cycle at a code commit granularity. Our work, however, fo-cuses on performance prediction, rather than detecting possible bottlenecksintroduced by code changes. In summary, we are concerned with how thesystem will perform given a specific workload scenario before the featuredevelopment, while Heger et al. try to detect a performance regression assoon as it is introduced into the code.4.7 Concluding RemarksEvery tool used during software development reflects a deliberate decisionbased on the trade-off between the cost and the benefits of employing sucha tool. Hence, stakeholders need to gather information to support theirchoices. In this context, this chapter advocates for the use of performancepredictors to develop complex software systems by narrating the experienceof applying a specific performance predictor in the development of acomplex and error-prone distributed storage system.4 CONCLUDING REMARKSOverall, the approach of using a predictor was useful for setting goalsfor the system’s performance, despite the difficulties of properly seedingthe model and mimicking the benchmarks used to evaluate its performance.This chapter shows how the performance predictor helped developers toprovide an efficient implementation of a distributed storage system. Specifi-cally, this approach points out situations that needed further improvement,which render up to 30% performance improvement, and a decrease of 10xin the response time variance, for some specific scenarios. It also increasesconfidence in the implementation when the actual performance matchedthe predicted one (Chapter 3).Based on the experience described in this chapter, we recommend the useof performance predictors during software development to help developersdeal with this non-functional requirement, similar to how automated testsverify functional requirements. In particular, when developing a large scaleand high performance system, in which performance is a key concern, webelieve that a predictor is useful to detect potential performance problemsearly and, consequently, reduce the effort of required to remove performancebottlenecks.Additionally, I suggest that the results described in this chapter canencourage researchers in the software engineering community to verifythe generalization of the benefits observed in this use-case by studying abroader set of systems.124Chapter 5Automatically Enabling DataDeduplicationTo better understand the challenges entailed in automating storage systemconfiguration at runtime, this chapter focuses on one optimization – namely,enabling online data compression through similarity detection in the contextof checkpointing applications. Specifically, this chapter presents a study thatuses a control-loop to automatically enable or disable data deduplication,having two key metrics forming the optimization criteria: (i) to improvewriting time1 and/or (ii) reduce energy consumption2.Overall, the proposed solution correctly configures the storage system toenable or disable data deduplication with small overhead – negligible extramemory and 3% extra time, and small error for the target metrics; the costof a misprediction is as low as 5%.The rest of this chapter is organized as follows: Section 5.1 presents amotivation for targeting data deduplication in the context of checkpointingapplications. Section 5.2 presents a high-level view of the architecture1Towards Automating the Configuration of a Distributed Storage System [52]. Lauro BeltrãoCosta, Matei Ripeanu, 11th ACM/IEEE International Conference on Grid Computing.October 20102Assessing Data Deduplication Trade-offs from an Energy Perspective [54]. Lauro BeltrãoCosta, Samer Al-Kiswany, Raquel Vigolvino Lopes, Matei Ripeanu. Workshop on EnergyConsumption and Reliability of Storage Systems in conjunction with International GreenComputing Conference, July 20111255.1. MOTIVATIONused and Section 5.3 describes an instantiation of this architecture as aproposed solution. Section 5.4 describes one of this work’s importantcontributions; it focuses on energy consumption by first studying the impactof data deduplication on energy consumption and, then, extending theproposed solution to include this metric. Section 5.5 briefly compares thework described in this chapter with past work. Finally, Section 5.6 presentsa summary and a discussion of the results presented in this chapter.5.1 MotivationCheckpointing means persistently storing snapshots of an application’sstate. These snapshots (or checkpoint images) may be used to restore theapplication’s state in case of a failure, or as an aid to debugging.Depending on the checkpointing technique used, the application’scharacteristics, and the time interval between checkpoint operations, check-pointing may result in large amounts of data to be written to the storagesystem in bursts. These bursts provide an opportunity to the intermediatestorage system approach - also known as burst buffer [112] - to avoid waitingfor the backend storage system during these writes. Additionally, and moreimportant in the context of automation, successive checkpoint images mayhave a high degree of similarity that data deduplication techniques canleverage to improve performance.Data deduplication [11, 76, 109, 125, 138, 172] is a data compressionstorage technique that eliminates redundant data. This technique providesa trade-off between computing power and the amount of data – it consumesadditional CPU cycles to detect data similarity and, in return, reduces theamount of data stored.When the system performs inline data deduplication, i.e., the storagesystem deduplicates the data while the application is writing the data to thestorage system, it can reduce the number of bytes sent over the network andI/O operations as well as the amount of data to be written to the storagesystem. However, it does not necessarily imply faster write operations orlower energy consumption, since the cost of creating the chunks’ identifiers1265.1. MOTIVATIONand chunking the data is expensive and, therefore, it is not clear that theperformance cost is paid off (see Section 2.2.4).Al-Kiswany et al. [12] demonstrate that similarity can be effectivelydetected at the storage system level (i.e., without application support) forcheckpoint applications using data deduplication and that this storagetechnique offers performance improvement in some cases. This pastwork, however, leaves a gap: it relies on the system administrator oruser to configure a data deduplication technique based on her knowledgeabout the checkpointing technique used by the application, the applicationcharacteristics, and the checkpointing interval.Even when the user has all necessary information, manual, static tuningcan be undesirable. In a scenario where the checkpoint frequency varies(e.g., as used for debugging), the proposed solution focuses on dynamically(i.e., at runtime) enabling similarity detection only when the applicationproduces checkpoints with high similarity and disabling similarity detectionotherwise.Similar to the workflow applications and the solution proposed inChapter 3, this chapter presents a solution that relies on a repetitive patternand on a prediction mechanism.Unlike the solution proposed for workflow applications (Chapter 3),this chapter focuses on a solution predicated on one key assumption: theexistence of repetitive operations throughout the application lifetime of acheckpointing application [12] and a small variance in the performance, ordata similarity, of these operations. This characteristic gives the opportunityto compare the effects of various configuration options to estimate anoptimal, or simply good enough, configuration during runtime. In thiscase, it satisfies the requirements described in Section 1.2 by using a controlloop that relies on a simple, yet, effective performance model.Additionally, past work on data deduplication does not analyze itsimpact on energy consumption, which has increasing importance in thecontext of high-performance computing systems (see Section 2.2.2). Thisstudy, to the best of my knowledge, is the first to study the impact of datadeduplication on storage system energy consumption [54].1275.2. ARCHITECTUREGoal. The high-level question that this chapter addresses is the follow-ing: “What are the Challenges and Efficacy of an Automated Solution for theOnline Configuration of an Intermediate Storage System?”To better understand the challenges entailed by the online automationof storage system configuration, this chapter presents a preliminary studyfocusing on data deduplication, in the context of checkpointing applications,using writing time, energy consumption, and storage space as optimizationcriteria.The progress in the direction of enabling data deduplication not onlysheds light on the challenges faced in automated tuning, but also has directapplications of practical value. Current operational practice when runningcheckpointing applications requires a wealth of information about thecheckpointing characteristics (e.g., checkpointing technique, or frequency)in order to configure the storage system. Automating the choice of enablingor disabling data compression via deduplication allows decoupling theconcerns of the application’s scientist or developer from those of the storagesystem operator.The measurements from this study demonstrate that the developedprototype correctly configures the storage system to enable or disable datadeduplication, with minimal overhead and minimal error for the targetmetrics. Additionally, it shows that, in the context of data deduplication,optimizing for energy consumption and for writing time become conflictinggoals as hardware becomes more power-proportional (i.e., hardware drawspower proportionally to its usage load).5.2 ArchitectureA solution that automatically enables or disables inline data deduplicationshould meet the requirements listed in Section 1.2. In this context, they are:• to reduce or eliminate the need for human effort to configure thestorage system,• to provide performance that is close to the user’s intention,1285.2. ARCHITECTUREApplicationDistributed Storage SystemMonitorActuatorControllerPredictorUtility FunctionMetrics MetricsDesired ConfigurationChange ConfigurationDataFigure 5.1: Control loop based on the monitor-control-actuate architectural pat-tern.• to keep the overhead cost of automating the configuration low.The existence of repetitive operations throughout the application lifetimeof a checkpointing application [12] enables the use of a monitor-control-actuate loop. This approach continuously adjusts the configuration of thestorage system to optimize its performance according to a user-specifiedoptimization goal (Figure 5.1).The closed loop consists of a monitor, which constantly monitors thebehavior of the storage system. The actual performance metrics monitoreddepend on the target optimization goal and on the active configurationat the moment. For the target storage system, the monitored metrics caninclude: the amount of data sent through the network, the amount of dataactually stored, the response time for an operation, compute or memoryoverhead at the client machine, and network latency.The monitor passes these measurements to the controller, which analyzesthem, infers the impact of its previous decisions, and may dispatch newactions to change the system configuration to the actuator. To perform itstasks, the controller needs: (i) a utility function that captures the user’sintention, and (ii) a prediction mechanism that estimates the impact of itsfuture actions on the user’s goal.1295.2. ARCHITECTUREThe utility function is an equation that receives different performancemetrics and produces an output according to the user’s intention. Thefunction’s goal is to reduce the multidimensional space of the monitoredmetrics to just one dimension that guides the optimization effort. That is,the goal of the controller is to maximize the utility.Initially, this study assumes the user focuses on a single metric tooptimize upon (e.g., minimize the write time for I/O operations), which canbe given by a simple function as U(Tw) = −Tw, where Tw is the writing time.More complex definitions, however, are possible [152]: the user can, forexample, define that she tolerates the write time using data deduplicationto be up to 20% longer than when using the default configuration, if theamount of data stored is reduced by more than 30%.The prediction mechanism estimates the impact of a specific change inconfiguration on the target performance metrics, given the current stateof the system. For example, for a storage system that can save spaceby compressing data, the predictor receives estimates of the achievablecompression rate, and predicts the impact of compression on the timerequired for write operations and the space saved.Finally, the controller decides whether configuration changes are neededand communicates its decision to the actuator. This decision consists ofestimating which configuration provides the best utility based on the currentdelivered utility and state, the accumulated past history of changes, and theestimates for the utility to be delivered with new configurations.The actuator performs the configuration changes as instructed by thecontroller. While this architecture and a number of its components, suchas the controller, the means to express utility, and performance models aregeneric; some of the other required components are specific for each storagesystem supported (e.g., the actuator, the monitor), or for each applicationcontext, as it is the case for the specific utility function and the predictor.1305.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTION5.3 Control Loop for Configuring Similarity DetectionThe goal of this exploration was to automate the choice between twoconfiguration options: data deduplication on or off.The control loop needs the following to be instantiated: the metrics tobe collected, the utility function the user defines to guide the choice of aconfiguration, and the performance model used for the prediction. Initially,two metrics are exposed to the user in order to define the utility function:(i) the time consumed for writing a checkpoint image, and (ii) the amountof data stored. In the scenarios studied in this section, the utility functionthat drives optimization is a linear combination of these two metrics. Forexample, the user may specify that, regardless of time, the storage footprintshould be minimized, or conversely, the user may specify that reducing thecheckpointing time is the primary optimization criteria. Section 5.4 presentsan extension of the the solution to include energy consumption as a thirdmetric.For each write operation, the monitor needs to collect the followinginformation: (i) the operation timestamp, (ii) the total duration of theoperation (note that this includes the time spent to calculate the hashvalue and to send data over the network), (iii) the total number of chunksreceived by the write function, and (iv) the number of chunks similar toolder versions. From (iii) and (iv) the controller can extract the percentageof similarity, estimate the best configuration, and instruct the actuator tochange the configuration.Similar to the approach presented in Chapter 3, the controller is builton top of a performance model that estimates the impact of configurationchanges on the system’s performance, which is described in Section 5.3.1.Section 5.3.2 describes the implementation of each component of the loop.Section 5.3.3 evaluates the proposed solution.5.3.1 Prediction Model for the ControllerThis modeling exercise captures the behavior of the write operations underinline deduplication and captures the situations where data deduplication is1315.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTIONdesired. The performance model needs to predict the two metrics thatare exposed to the utility function: (i) total time to write (Tw) and (ii)storage space used (Sw). Additionally, there are two possible configurations:similarity detection enabled or disabled.Symbol DescriptionTw Time to complete a write operationT io Time to perform the IO phaseTsim Time to detect data similarityTwon Tw when similarity detection is onTwoff Tw when similarity detection is offtio(x) Time to perform an IO operation for x units of storaged Size of data to be written (received by the write operation)c Size of the chunkSw Amount to be storedSwon Swwhen similarity detection is onSwoff Sw when similarity detection is offZ Data similarity detectedTh Time to calculate hash for data chunksth(x) Time to calculate hash for x units of storageU Utility functionTable 5.1: Terms of the data deduplication prediction model.Table 5.1 summarizes the list of terms used to formalize the predictionmodel. Given the terms of Table 5.1, time to perform a write operation isgiven by:Tw = T sim + T io (5.1)Let d be the data size, and Z be the data similarity detected. The time toperform the actual I/O phase is:T io = tio(d× (1− Z)) (5.2)When similarity detection is disabled, predicting the space for a writeoperation is simple: the amount of data received by the write operation isthe amount of data persisted. Formally,Swoff = d× (1− Z) = d× (1− 0) = d (5.3)1325.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTIONThe time depends on the I/O operations which typically present a highervariation and depend on several factors, including the operating system’soperations, buffers, and other I/O operations. The model predicts the writetime by analyzing the history of measurements obtained. It calculates amoving average of the required time to write a chunk. The moving averageis based on the last N operations where N can be tuned to improve accuracy.Once the time to write a chunk is estimated, the total time for the writingoperation can be obtained by multiplying the time to write a chunk by thenumber of chunks to write. That is,Twoff = Tsim + T io = 0+ tio(Swoff) = tio(d) = tio(c)×dc(5.4)When similarity detection is enabled, the total time is the sum of the timerequired to calculate the hash values for every chunk and the time to writethe new chunks. It ignores the time to compare hash values, as this is usuallyorders of magnitude smaller than the previous two. The model estimatesthe number of similar chunks based on the history of the last measurements.Then, the controller knows how much space is needed and can estimate theI/O operations cost using the same approach as described in Equation 5.4,when the similarity detection is disabled. Formally,Swon = d× (1− Z) (5.5)Twon = Tsim + T io = th(d) + tio(Swon) =[th(c)×dc]+[tio(c)×Swonc](5.6)The system can obtain a reasonable approximation of hashing overheadby calculating the hash for just one chunk, and multiplying this value by thetotal number of chunks received for the write operation. This can be doneby writing two versions of a file of chunk-size with the same contents. Theoperations related to the second file consists of just hashing the data. Notethat the controller can also keep a history of the time consumed to calculatethe hash, and estimate the time cost of hashing one chunk based on such a1335.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTIONhistory.Finally, the controller needs to choose the desired configuration. Cur-rently, it uses the prediction model described in Equations 5.3, 5.4, 5.5, and5.6 to estimate the metrics for the two possible configurations: similaritydetection on or off. Then, the controller applies the utility function to theestimated metrics, verifies which configuration provides the best utility, andmay request the actuator to change the current configuration, if it is not theone with the best utility. Assuming a simple utility function that just relieson writing time (U(Tw) = −Tw), it is worth it to enable deduplication if:U(Twon) > U(Twoff) ⇐⇒ Twon < Twoff (5.7)More complex utility functions may involve different metrics and needto normalize these metrics. For example,U(Tw, Sw, Twon, Twoff, Swon, Swoff) =[γ1 ×Twmax(Twon, Twoff)]+[γ2 ×Swmax(Swon, Swoff)](5.8)Equation 5.8 normalizes the value of each metric by dividing it bythe maximum value among the different configuration options (e.g.,Twmax(Twon, Twoff)), and gives weights (γ1 and γ2) for the different factors.If the user specifies the weights as γ1 = γ2 = −0.5, relative savings for bothmetrics have the same importance in their optimization criteria.5.3.2 ImplementationThis section describes the implementation of each of the components of thecontrol loop described in Section 5.2 as well as the effort to integrate thecontrol loop that enables and disables similarity detection into the MosaStoreprototype.To create a file, the client application requests a new file identificationfrom the manager and asks the manager to reserve the necessary space. Themanager returns a list of storage nodes that have space. The client thenstarts writing the chunks of the file to those storage nodes and assembles a1345.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTIONchunk map that maps the chunks their storage nodes for future retrieval [12].Note that the client only sends to the storage nodes those chunks not foundin the previous-versions of chunk maps for that file, and reuses the chunksalready stored by simply using their identifiers from the chunk map. Finally,when the client finishes writing, the chunk map is sent to the manager to bepersistently stored. To detect similarity between the content of successivewrite operations, a content-addressable storage scheme is used, as describedin Section 2.2.4. Other deduplication systems (e.g., Foundation [139] andVenti [135]) use similar techniques.For each component of the control loop, a brief summary of its imple-mentation in MosaStore follows:Monitor. To collect measurements, MosaStore’s default write operationflow was changed to provide a new one that captures the metricspreviously described in this chapter, based on the design described inSection 2.1.2. The monitor also captures whether similarity detectionwas used or not, by saving the tag used to describe the file.Controller. The controller receives measurements from the monitor andarchives them. Once the archived history achieves a specific size (thedefault value in the initial version is five estimates), the controllerstarts to predict the amount of storage space used and time consumedby write operations, according to the performance model describedin Section 5.3.1. The utility function that drives the optimizationis specified by the user through a configuration file. Finally, thecontroller uses the utility function, and, based on the estimated metrics,determines which configuration provides the best utility, then informsthe actuator if some configuration change is needed. Section 5.3.3shows that this online monitoring approach is effective to seed thesimple model regardless of the current configuration.Actuator. Originally, MosaStore did not support configuration changesduring runtime. A new feature that supports enabling/disablingsimilarity detection at runtime was added, following the approach1355.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTIONdescribed in Section 2.1. Note that to use this feature another optionof file versioning should also be enabled as described by Al-Kiswanyet al. [12].The initial configuration has the similarity detection activated sincethe controller needs to know the level of similarity between writes. Ifthe controller deactivates similarity detection, the system cannot provideinformation regarding the similarity level. This problem is addressed byreactivating the similarity detection again after a given amount of writeoperations (100 by default in the initial prototype). The overhead for thisapproach and alternative solutions are described in Section EvaluationThis section presents an evaluation of this prototype according to the successcriteria listed in Section 5.2: (i) effort to configure, (ii) performance deliveredby the automated solution, and (iii) the configuration overhead.Testbed: MosaStore is configured to use 10 storage nodes. The metadatamanager runs on a different machine. Each machine has Intel Xeon E5345 4-core, 2.33-GHz CPU, 4GB RAM, and 1Gbps NIC, as testbed TB20 describedin Section 3.3.Workload: This study uses synthetic workloads that mimic checkpoint-ing workloads based on a previous study from NetSysLab that collected andanalyzed [12] checkpoint images; the workload generator produces files atregular time intervals and controls the similarity ratio between consecutivefile versions, from 0 to 100% similarity.Statistics: Each point in the plots is the average over all write operationsperformed during the application’s execution. It guarantees 95% confidenceintervals with 5±% accuracy, according to the procedure described by Jain[98].Effort to ConfigureThe proposed solution requires minimal human effort to configure: the useronly needs to specify her intention. In the prototype, this is expressed by1365.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTIONspecifying the weights of runtime and storage footprint size in the utilityfunction. Note that the default configuration of the utility function assumesthat the user’s intention is to minimize writing time.System PerformanceTo analyze how satisfactory the automated configuration is, this sectioncompares the performance of the system using automated tuning and theperformance using the two manual configurations available: similaritydetection always on or always off.The first target scenario uses a synthetic workload based on past Net-SysLab experience of analyzing checkpointing workload for short check-pointing intervals as used for debugging [12]. The synthetic applicationwrites 100 files of approximately 256 MB each, with data similarity of 70%between writes, in order to analyze the case where the user is interestedin minimizing writing time for the three different configurations – i.e., thelower the writing time, the higher the utility as in Equation 5.7.The proposed automated solution is able to detect that the similarity ishigh enough in this case to offset the computational overheads generatedby hash-based similarity detection. Indeed, the automatic solution performsas well as a configuration in which similarity detection is always activated,which is the best configuration for this case.To analyze the system’s performance under different scenarios, thesame synthetic application varies the level of similarity present in thedata. This is equivalent, for example, to varying the frequency of thecheckpointing operations. For up to 50% similarity, similarity detectionshould be deactivated in this platform, as the hashing overhead is not paidoff by savings in I/O operations. Figure 5.2 demonstrates that automatedtuning chooses the correct configuration.The slightly longer writing time when using automated tuning at lowsimilarity levels is the result of the need to estimate the similarity level inthe data stream before making a decision. The end of this evaluation sectionpresents a brief discussion on the cost of automated tuning.1375.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTION0123450.0 0.1 0.2 0.3 0.5 0.6 0.7 0.8 1.0SimilarityAverage time per write (in seconds)ONOFFAUTOFigure 5.2: Average time to write a snapshot of 256MB. Confidence intervals weresmall (less than 5%) and are omitted to reduce clutter.Experiments for writes of different sizes (32 MB, 64 MB, 128 MB, and512 MB), while varying the similarity, exhibit similar behavior. Althoughthe required time to complete the write operations are different from theexperiments using 256 MB, this automated configuration solution alwayschooses the desired configuration.To analyze a case where the user is interested in more than one metric, thecontroller uses a utility function that considers storage space and writingtime as equally important by configuring the utility function with 50%weight for each of the two metrics, writing time and storage space, as inEquation 5.7.Figure 5.3 shows the average relative utility when writing checkpointingimages of 256MB, while varying the similarity rate in the data stream. Theresults are normalized by the value of the lowest utility among different con-figurations (detection always off in this case) to provide a clear comparison.Up to 20% similarity, the best configuration for the execution platformis having the data deduplication deactivated, as the time needed to hash1385.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTION012340.0 0.1 0.2 0.3 0.5 0.6 0.7 0.8 1.0SimilarityAverage relative utility per writeONOFFAUTOFigure 5.3: Average relative utility, when the two metrics are equally weighted, towrite a snapshot of 256MB.the data is not enough to balance the storage space saved and the savingsin I/O operations. As the similarity increases, the similarity detection costis paid off by the time saved during the I/O operations and the amount ofstorage space saved.Similar to the case in which the system focuses on minimizing writingtime, in this case the automated solution is able to detect that the similarityis high enough for data deduplication to provide a higher utility. Indeed, theautomatic solution provides the same utility in cases where data deduplica-tion is always activated (around 66% higher than the worst configurationfor 70% similarity), which is the best configuration for this case. Overall, asthe Figure 5.3 shows, the automated configuration choice always matchesthe optimal manual configuration.Cost of Automated TuningTo evaluate the configuration cost, this section analyzes the amount of extraresources used for the automated configuration: namely, CPU time and1395.3. CONTROL LOOP FOR CONFIGURING SIMILARITY DETECTIONmemory.Memory overhead is minimal: the controller only keeps a history ofpast measurements. Such history is short since the controller uses a movingaverage for the last N measurements. In the experiments, N = 5, whichresults in less than a kilobyte of extra memory. Older data can be discarded,or persistently stored on the backend for future analysis.As the experiments presented earlier suggest (see Figure 5.2), the per-formance of the configuration chosen by the automated solution, includingthe automation overhead, is within 5% of the performance of the bestconfiguration.The computing overhead generated depends on the state of the system.If data deduplication is activated, there is no additional overhead to estimatethe similarity level since similarity detection is already part of the datadeduplication process. However, if data deduplication is deactivated, thesystem still needs to estimate the similarity level. Thus, it starts hashingthe data which slows down the write operation, by up to a factor of two,for cases where there is no similarity among writes. Note that such anadditional cost exists only until the automated solution detects that hashingthe data is not worth it. From this moment on, this cost is amortized overthe following write operations. In the current implementation, wheneversimilarity detection is deactivated, the system activates again after 100writes.Depending on the workload, however, keeping the computing overheadlow can be challenging for an automated solution. This can result from acombination of two factors: (i) similarity detection is an intrusive storagetechnique requiring some processing over all data contents, and (ii) thesimilarity may vary often, requiring to be activated more often as well.To reduce this cost, two options are available: (i) using sampling toreduce the volume of data analyzed when detecting the level of similarity,and (ii) moving similarity estimation offline, out of the critical performancepath, and use an alternative approach to estimate data similarity [153].Section 6.2 discusses these alternatives.1405.4. DATA DEDUPLICATION TRADE-OFFS FROM AN ENERGY PERSPECTIVE5.4 Data Deduplication Trade-offs from an EnergyPerspectiveWhile the impact of data deduplication on traditional metrics (e.g., datathroughput, storage footprint) is well understood [12, 135, 139], previousstudies leave an important gap: energy consumption analysis. Specifically,they overlook two important issues. First, while data deduplication in-creases the CPU load, it may reduce the load on the network and storagedevices. As a result, it is unclear under what scenarios it leads to energysavings, if any. Second, the performance impact of energy-centric tuning ofthe storage system has been unexplored.This section targets the case of fixed-chunk data deduplication from anenergy perspective. First, to reduce the aforementioned gap in the analysisof energy consumption in data deduplication, Section 5.4.1 describes themethodology for an empirical evaluation of the energy consumption of datadeduplication in the target case. Section 5.4.2 assesses energy consumptionon two hardware platforms, with different characteristics in terms of powerproportionality, and identifies deduplication’s break-even points – thethreshold that delineates when it is worth it or not to deduplicate datafrom an energy or time perspective, demonstrating that the break-evenpoints for performance (in terms of writing time) and for energy efficiencyare different.Finally, Section 5.4.3 presents a simple energy consumption model thatmakes it possible to reason about the benefits of deduplication and offersan approximation for the energy break-even point. This model is used toextend the solution presented in Section 5.3 in order to consider energyconsumption as a third optimization metric.This study is related to a rapidly growing body of work on deployingenergy efficient systems. It joins others in which the focus is to understandthe energy consumption of compression techniques in different scenarios[48, 104]. To the best of my knowledge, this work is the first to studythe impact of data deduplication on storage system energy consumption.Moreover, it also considers different generations of machines, to demonstrate1415.4. DATA DEDUPLICATION TRADE-OFFS FROM AN ENERGY PERSPECTIVEthe impact of the new power proportional hardware (i.e., hardware whosepower consumption is proportional to the utilization level) [25] on energyconsumption.The empirical evaluation and energy model suggest that, as storagesystems and their components become increasingly energy proportional,the energy and time, or throughput, break-even points will shift furtherapart. This trend has an important consequence: optimizations for energyefficiency and performance will likely conflict. As a result, storage systemdesigners and users will have to make conscious and informed decisionsabout which metric to optimize for.5.4.1 Assessing Performance and Energy ConsumptionTo investigate the impact of deduplication on energy consumption, anempirical approach was chosen: monitoring MosaStore as a representativedistributed storage system that adopts deduplication, and subjecting it to acheckpointing-like workload.Workload. This section analyzes a synthetic workload that mimicscheckpointing applications as the one described in Section 5.3.3. Theworkload generator produces files at regular time intervals and controls thesimilarity ratio between consecutive file versions (from 0 to 100% similarity).Assessment TestbedThe performance evaluation and energy consumption was based on twoclasses of machines which are labeled ‘new’ and ‘old’ to make it clear theyare from different generations:‘new’ machines (Dell PowerEdge 610) are equipped with Intel Xeon E5540(Nehalem) @ 2.53GHz CPU launched Q1’09 and maximum ThermalDesign Power (TDP) of 80W, 48GB RAM, 1Gbps NIC, and two 500GB7200 rpm SATA disks. Nehalem is an architecture by Intel that exhibitsmajor improvements in power efficiency. Indeed, a machine withNehalem, in this testbed, consumes 86W in idle mode and 290W atpeak utilization.1425.4. DATA DEDUPLICATION TRADE-OFFS FROM AN ENERGY PERSPECTIVE‘old’ machines (Dell PowerEdge 1950) are equipped with Intel Xeon E5395(Clovertown) @ 2.66GHz CPU launched Q4’06 and max TDP of 120W,8GB RAM, 1Gbps NIC, and two 300GB 7200 rpm SATA disks. One ofthese machines consumes 188W in idle mode and 252W at peak. (Notethat these are the same machines from TB20 presented in Chapter 3).All machines run Fedora 14 Linux OS (kernel MosaStore usesthe same configuration in all experiments. As is the case for Section 3.4.2,the lack of infrastructure to measure power limits the scale of the testbedfor these experiments. Thus, only two machines from each testbed are used:the storage module on one machine; and the manager, the SAI, and theworkload generator on a second machine. The machines are connected by aDell PowerConnect 6248 10Gbps switch, whose energy consumption is notreported in this study.The machines are equipped with two WattsUP Pro [4] power metersto measure the energy consumption at the wall power socket, capturingthe energy consumption for the entire system. A third machine collects themeasurements from the meters via a USB interface. The meters provide 1Wpower resolution, a 1Hz sampling rate, and ±1.5% accuracy.Measuring Energy ConsumptionFor energy, since the meters give only a 1Hz maximum sampling rate, theexperiments collect the measurements every second during an experimentalbatch, from the beginning of the first write until the completion of the lastwrite. The power is given in watts (joules per second) for the last mea-surement interval, giving an estimate of energy consumed during the lastmeasurement interval. For each machine, the sum of energy consumptionestimates is used to obtain the total amount of energy consumed duringthe experiment batch. This section reports the average energy consumedper write, by dividing the total energy consumption by the number ofcheckpoint images written.The evaluation considers the energy consumed by all storage systemcomponents: manager, storage node and client. On the client node, however,1435.4. DATA DEDUPLICATION TRADE-OFFS FROM AN ENERGY PERSPECTIVEthe workload generator runs together with the storage system component.Since it is not possible to isolate the consumption only for the storage systempath, the evaluation conservatively reports the energy consumed for thewhole system, including the workload generator 3.In the plots, each point presents the average value for a file writeoperation (for energy or time), calculated over a batch of 50 writes at afixed similarity level. The experiments consider different data sizes (32,64, 128, 256 and 512MB) while varying the similarity level (0% – 100%,in increments of 10%). Although the time and energy required for eachoperation varies, the overall relation of energy consumption, time, andsimilarity is the same regardless of the data sizes. Thus, the plots only showresults for 256MB.5.4.2 Evaluation of the Energy Consumption ResultsThe goal of this evaluation is to analyze energy consumption, and thewriting time performance in writing a checkpoint image, as well as theirbreak-even points in term of similarity level for platforms with differentenergy proportionality properties. Figures 5.4a and 5.4b present the energyconsumption and the average write time per checkpoint on the ‘new’ testbed,respectively. The main point to note is that the break-even points for energyand performance are different: for similarity lower than 18%, hashingoverhead is not compensated for by the energy savings in I/O operations,thus, enabling deduplication brings benefits only when the workload has asimilarity rate higher than 18%. From a time performance perspective, onthe other hand, enabling deduplication makes sense only for even highersimilarity levels (higher than 40%).A second point to note is that deduplication achieves different relativegains for energy and time. For energy, the highest consumption level (at 0%similarity) is 2.9x larger than the lowest (at 100% similarity). For the timeto write a checkpoint, this ratio is 3.6x. Although deduplication enablesenergy savings to start at lower similarity level than it enables for saving3Indeed, additional experiments show the workload generator is lightweight comparedto the client SAI.1445.4. DATA DEDUPLICATION TRADE-OFFS FROM AN ENERGY PERSPECTIVE10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.92500500100015002000Similarity ratioEnergy consumption per write (joules)Break-even point for performanceOnOff(a) Average energy consumption10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.970123456Similarity ratioTime per write (seconds)OnOffBreak-even point for energy(b) Average time to writeFigure 5.4: Average energy consumed and time to write a 256MB file, with datadeduplication on and off, for different similarity levels in the ‘new’ testbed.Note: Y-axis does not start at 0 in Figure time (i.e., energy consumption has an earlier break-even point), at 95%of similarity, deduplication saves almost the same rate for both energy andtime (around 50%).Figures 5.5a and 5.5b present the energy consumption and the averagetime per checkpoint write for the ‘old’ testbed, respectively. Compared to the‘new’ testbed, the break-even point for energy (at 10% similarity) is closerto the one for performance (at 16% similarity). For this testbed, there aresimilar differences in the relative gains that deduplication enables.One important fact to note is that, although the two testbeds havealmost the same performance profile as evidenced by the checkpoint writeperformance, the two testbeds have different energy profiles. The energyconsumption per write with deduplication turned off is about 45-50% higherin the ‘old’ testbed, even though the writing time is only 10% higher. Withdeduplication turned on and a high similarity rate, the differences are evenmore striking: about 2x higher energy consumption in the ‘old’ testbed forabout the same write time (Figure 5.4a). The reason for these results isthat the newer generation machines with Nehalem CPUs are more powerproportional and save energy by matching the level of resources enabled1455.4. DATA DEDUPLICATION TRADE-OFFS FROM AN ENERGY PERSPECTIVE1 ratioEnergy consumption per write (joules)Break-evenpointforperformance(a) Average energy consumption10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.970123456Similarity ratioTime per write (seconds)OnOffBreak-even point for energy(b) Average time to writeFigure 5.5: Average energy consumed and time to write a 256MB file for differentsimilarity levels in the ‘old’ testbed. Note: Y-axis do not start at 0 in Figure5.5a.(e.g., switching cores on and off) to the offered load.SummaryThe main lesson from these experiments is the following: while for non-energy-proportional machines, performance- and energy-centric optimiza-tions have break-even points that are relatively close, for the newer genera-tion of energy-proportional machines, the break-even points are significantlydifferent. An important consequence of this difference is that, with newersystems, there are higher energy inefficiencies when the system is optimizedfor performance. The experiments presented above quantify these ineffi-ciencies: on the ‘old’ testbed, optimizing for performance leads to up toa 5% energy inefficiency (in the 10-16% similarity rate interval). For the‘new’ testbed, optimizing for performance leads to an up to 20% energyinefficiency (in the 18-40% similarity rate interval).1465.4. DATA DEDUPLICATION TRADE-OFFS FROM AN ENERGY PERSPECTIVE5.4.3 Modelling Data Deduplication Trade-offs for EnergyConsumptionSection 5.4.2 shows that deduplication can bring energy and/or performancesavings if enough similarity exists in the workload. It also shows that thebreak-even points for energy and performance are different, and depend onthe characteristics of the deployment environment. Under these conditions,system users need a mechanism to support their configuration decisionsrelated to deduplication.This section proposes a simple model to guide the storage systemconfiguration. This model can be used in two ways. First, it can identifywhether deduplication will lead to energy savings for a given level ofsimilarity. Second, it can estimate the energy impact of upgrading thesystem; for example, by adding SSDs to improve energy efficiency andtime performance, or by adding energy-efficient accelerators (e.g., GPUs) tosupport deduplication [11, 76].Two guidelines direct the design of the model. First, the model shouldbe simple to use and not require extensive machine benchmarking or the useof power meters. Alternatively, it should be seeded with power informationavailable on technical data-sheets of various components. Second, the modelshould be simple and intuitive, even at the cost of lower accuracy, since suchmodels have higher chances of being adopted and used to guide decisionsin complex settings.Consider the following variables:• B is the total number of chunks to be written (received by a writeoperation), and c is the chunk size.• Pidle, Ppeak, Pio are used to characterize different power consumptionprofiles: the power consumed by the machine in idle state, peak CPUload, and peak I/O (disk and network) load, respectively.• Eh(c) is the extra energy consumed by the machine to compute thehash value for one data chunk. It can be roughly approximated by1475.4. DATA DEDUPLICATION TRADE-OFFS FROM AN ENERGY PERSPECTIVEEh(c) = (Ppeak − Pidle)× Th(c), where Th(c) is the time for hashing asingle chunk.• Eio is the energy consumed in the transfer of one chunk on the storagepath - including all system calls, sending the chunk from the clientmachine, receiving it at the storage node, and storing it. Eio = 2 ×(Pio − Pidle) × T io(c). The factor 2 appears since there are both oneclient and one storage node involved in the storage path, and T io(c) isthe time for sending, receiving and storing a chunk.• Z is the detected similarity ratio of the data.For each write operation, the extra energy needed to compute the hashvalues is Eh(c)× B. The energy saved by reducing the stress on the storagepath is Eio × B × Z . Every time the energy savings are higher than theadditional energy spent to compute hash values, then it is worth turningdeduplication on. Note that this choice is independent of the data volume B.That is,Eio×B×S > Eh(c)×B ⇐⇒ Z >Eh(c)Eio=(Ppeak − Pidle)× Th(c)2× (Pio − Pidle)× T io(c)(5.9)The above modeling exercise highlights the main theme: for older, non-energy-proportional systems, optimizing for energy and for performanceare similar. Thus, the decision to optimize for energy depends only onthe relative runtime to hash or store a data chunk, and the similarity levelpresent in the workload.Power proportionality brings new factors into this equation: the relativeposition of power consumed when idle – under maximum I/O load, andunder maximum compute load. Once a system is power proportional (thatis, if Pidle is significantly lower than Ppeak and/or Pio), and it draws differentpower levels at peak CPU vs. at peak I/O load (Ppeak 6= Pio), a richer trade-off space emerges.Seeding. The user can easily use this formula (Equation 5.9) to guideher deduplication related decisions. The parameters needed can be easily1485.5. RELATED WORKcollected by benchmarks (Th(c) and T io(c)), or can be provided by systemassemblers in technical sheets (Pidle, Ppeak, and Pio), and estimated orextracted from the workload history (Z).Evaluation of the Energy Model AccuracyTo evaluate the accuracy of this simple model, the model’s prediction of theenergy break-even point is compared with actual measurements. In this case,the actual measurements come from the two testbeds from Section 5.4.2.To benchmark the testbed and estimate Pidle, Ppeak, and Pio, benchmarkscripts run an idle workload as well as a CPU, disk and network-intensiveworkload and measure the consumed power for each workload separately.Benchmarking the testbed and plugging its characteristics into the modelindicates that the energy break-even points are at 21.4% similarity for the‘new’ testbed and at 18.1% similarity for the ‘old‘ testbed. This is closeto a perfect oracle since the actual measurements indicate 18% and 10%,respectively, for the energy break-even points.Summary. Despite the model simplicity, it estimates the break-evenpoint with relatively good accuracy. The model only fails to predict theproper configuration when the similarity ratio is between 18-21.4% in thenew testbed, or between 10-18.1% in the old. Additionally, the cost of suchmisprediction is low; when the similarity is in these ranges deduplicationconfigured using the model consumes less than 10% extra energy comparedto the optimal configuration (Figures 5.4a and 5.5a).Note that designing an accurate fine-granularity model for the storagesystem energy consumption is a complex task: the main reason is that it ishard to decouple the energy consumption of different components in thesystem, as well as the energy consumed by the application running in thesystem.5.5 Related WorkThis work directly relates to efforts in the design and evaluation of datadeduplication solutions and energy efficient systems.1495.5. RELATED WORK5.5.1 Data DeduplicationA number of research and commercial systems employ various forms ofdeduplication targeting two main goals: (i) reducing the storage footprint –such as Venti [135] and Foundation [139] optimized for archiving, Mirage[138], optimized for storing virtual machine images, and DEBAR [172],optimized for enterprise-scale backup services; and (ii) improving timeperformance by reducing the pressure on the persistent storage, or the volumeof data transferred over the network, including low-bandwidth file systems[125], web acceleration [97], content-based caching [103], and the highperformance storage system StoreGPU [11, 76].While previous work focuses on the storage footprint and write-time-related benefits in systems of different scales, the impact of deduplicationon energy consumption has not been clear and the possibility of automatingthe configuration is not explored. On the one hand, detecting similarityintroduces computational overhead to compute the hashes of data chunks.On the other hand, if there is detectable similarity in the workload, thecomputational overhead can be offset by reduced storage or network effort.This chapter explores this trade-off from an energy standpoint and proposesan initial solution to automate the decision at runtime depending on utilityfunctions that rely on the energy consumption, time, and storage footprintof write operations.5.5.2 Energy Optimized SystemsWith non-power-proportional hardware (i.e., hardware that draws thesame power regardless of the utilization level), energy efficiency [25] istightly coupled with high resource utilization. Consequently, to increaseenergy efficiency, previous work recommends increasing system utilizationthrough mechanisms such as server consolidation [87], or through runtimeoptimizations to reduce per-task energy consumption [146].Hardware has, however, become increasingly power-proportional. Thistrend opens new opportunities for energy efficiency in the software stack;it enables shifting work from the most energy-efficient component (most1505.6. SUMMARY AND DISCUSSIONpower-proportional) of a computer system to the less energy-efficientcomponent (e.g., from disk to CPU) to reduce the total volume of energyconsumed for a specific task [87].Recently, Chen et al. [48] and Kothiyal et al. [104] have investigatedthe trade-off of compressing data, or not, in the context of MapReduceapplications and data centers, respectively. Similar to this work, theyconcluded that compression data, is not always the best choice in termsof energy consumption, as it depends on the workload. Ma et al. [114]investigate deduplication performance using a commodity low-power co-processor for hash calculations. Unlike this work, they do not explore thetrade-off of exchanging I/O operations with extra CPU load, and do notquantify deduplication energy savings. To the best of my knowledge, thiswork is the first to study the energy impact of deduplication and to put inperspective the impact of new generations of computing systems that havedifferent energy proportionality profiles.5.6 Summary and DiscussionThis study focuses on an automated configuration solution for data com-pression, through similarity detection in the context of checkpointingapplications. The evaluation presented in this chapter demonstrates that thedeveloped prototype correctly configures the storage system to optimallyenable or disable deduplication with a minimal error. Additionally, thisstudy exposes the cost of delivering an automated solution and howchallenging energy and similarity estimation can be.The progress in this direction meets the main goal of shedding light onthe challenges and accuracy of automated tuning, in the context of datadeduplication. It also has direct practical value, as automating the choiceof enabling or disabling data compression deduplication allows one todecouple the concerns of the application’s scientist and developer fromthose of a storage system operator.This chapter also analyzes energy consumption in the target scenario. En-ergy efficiency has become a pervasive theme when designing, configuring1515.6. SUMMARY AND DISCUSSIONand operating computer systems nowadays. While for non-energy propor-tional computer systems, energy and performance centric optimizationsdo not conflict, the recent trend towards increasingly energy-proportionalsystems opens new trade-offs that makes the design space significantly morecomplex, as optimizations for these two criteria may conflict.The energy analysis focuses on the energy consumption and responsetime of write operations using data deduplication in storage systems on twodifferent generations of machines. This evaluation supports and quantifiesthe above intuition: the more power proportional a system, the greater theopportunities to trade among different resources, and the larger the gapbetween performance- and energy-centric optimizations.Additionally, this chapter proposes an energy consumption model thathighlights the same issues and, in spite of its simplicity, can be used to reasonabout the energy and performance break-even points when configuring astorage system.The discussion below aims to clarify some of the limitations and thelessons of this preliminary study.What is the impact of using other similarity detection mechanisms (e.g.,content-based chunking) instead of fixed-size chunking?As Section 2.2.4 presents, fixed-size chunking creates a new chunk for everycertain amount of bytes. It has the advantage of being computationallycheaper than content-based chunking, since the chunking scheme doesnot need to analyze the data contents. It is, however, not robust for datainsertions or deletions that are not a multiple of the chunk size. Content-based chunk boundaries (content-based chunking) [125], on the other hand,are more robust to data insertions and deletions since the chunk boundaryis not bound to specific data patterns, and is not a specific size. The trade-offis the high computation cost of analyzing the data content versus the higherdegree of data similarity detected [12].After verifying that content-based chunking had a high cost and that theextra similarity detected would not pay off the overhead cost of chunkingthe data (i.e., the time cost of chunking the data would always be higher than1525.6. SUMMARY AND DISCUSSIONsending it over the network), the initial solution described in this chapterfocused on fixed-size chunking. Al-Kiswany et al. [12] describes a similarconclusion.What is the impact of hardware of accelerators on speeding up chunkingoperations and the proposed approaches?Past work has proposed the use of GPUs to offload hash computations (forinstance, my collaborators and I [53]) as well as some systems to use theseoffloaded hash computations to speed up data deduplication operations(e.g., StoreGPU [11, 76] and ShedderGPU [34]). These systems leverage theextra computational power of GPUs and make fixed-size chunking overheadalmost negligible when compared to the I/O cost. As a consequence, interms of writing time, enabling data deduplication is almost always thedesired approach.Additionally, these systems overcome the problem introduced by thecomputationally-intensive operations required for content-based chunkingin the sense that, depending on the degree of similarity in the data, content-based chunking can actually speed up write operations for inline datadeduplication – which, according to Al-Kiswany et al. [15], is not providedby current CPU-based implementations.In this case, the decision space shifts; using content-based chunking canimprove performance, depending on the similarity of the data. This makesthe decision of using GPUs and content-based chunking similar to that ofusing CPUs and fixed-size chunking, as described in this chapter. Thus,the solution proposed in this study can serve as an initial step for this newstrategy.Can this solution be applied to data deduplication in other workload?The target scenario of this chapter is inline data deduplication, using fixed-size chunking for checkpointing applications due to its potential datasimilarity. Other applications can offer similarity as high as checkpointingand, therefore, could use the proposed approach to optimize response1535.6. SUMMARY AND DISCUSSIONtime or energy consumption. As examples, Jayaram et al. [99] presentan empirical analysis of virtual machine repositories, and Park and Lilja[130] analyze datasets of backup applications; storage space savings can beas high as 60% for a generic archival workload [135], or 95% for a virtualmachine image repository [109].How difficult is it to predict similarity for different data deduplicationparameters based on a history of write operations?Data deduplication may offer several parameters to be configured that affectthe degree of detected data similarity. Content-based chunking, for instance,is based on a sliding window to detect chunk boundaries and receives thewindow size, the minimum chunk size, and the amount of bits advancedat each step, among other parameters. As Tangwongsan et al. [153] show,estimating how the detectable degree of similarity changes as the parameterschange is difficult, if not impossible. Instead, they propose a method toestimate an optimal expected chunk size, which can be obtained as a resultof the parameters listed above, for the same data.This lack of stability (and predictability) in the similarity of the dataflow, makes it difficult to perform an on-the-fly adaptation for inline datadeduplication. Moreover, it increases the cost of estimating similarity, sinceit requires more probing activations of similarity detection.Does energy consumption impose extra challenges over time and storageas part of the optimization criteria?Capturing the behavior of this new metric in the model requires theextension described in Section 5.4. This preliminary evaluation did revealthat considering energy consumption poses a challenge to monitoring.In contrast to what happens for measuring time and storage space, themachines are not commonly equipped with a means to monitor powerconsumption easily via software. Even when the machines have powermeters, the granularity is coarse, with sampling rates of 1Hz or 2Hz at theirbest. Thus, estimating the energy consumed by storage system operationsrequires analyzing a batch of operations, and reduces the precision of1545.6. SUMMARY AND DISCUSSIONquantifying the consumption of a single operation. Moreover, accessing themeasurements can be troublesome since it is common for them to requiredifferent user permissions, or a specialized API.Due to the difficulties of monitoring energy, a predictor for energyconsumption has one additional requirement: it should not depend onfine grain measurements from power meters during runtime. Ideally, thepredictor should rely on a model that could be seeded only with benchmarksbased on coarse grain measurements. Alternatively, the seeding procedurecould use power information available on the technical data sheets forvarious hardware components.Additionally, it is not possible to isolate the consumption of the storagesystem only, or any software system for that matter; this study’s evaluationconservatively reports the energy consumed for the whole system. Isolatingenergy consumption of just a part of a system, as it is possible to performfor time measurements, remains a challenge for future research.How should the results obtained change for new hardware?Part of this discussion is related to the use of GPUs, or faster processors ingeneral, as presented earlier in this section. This chapter presents an empiri-cal analysis indicating that, for storage systems that use power-proportionalcomputing components, the energy and throughput break-even pointsare further apart than less power-proportional systems. Additionally, theproposed energy model suggests that, as the trend toward more power-proportional systems continues, optimizations for energy efficiency andperformance will likely conflict more often.A direct consequence of this conflict is that storage system designers,administrators and users will have to decide, via conscious and informeddecisions, about the metric to optimize for.155Chapter 6Concluding RemarksClusters of computers, including parallel supercomputers, that are designedto provide a high performance computing platform for batch applications,typically provide a centralized persistent backend storage system. To avoidthe potential bottleneck of accessing the platform’s backend storage system,intermediate storage systems aggregate resources of the application’s allocationto provide a shared, temporary storage space dedicated to the application’sexecution [14, 31, 60, 112].Indeed, distributed storage systems have emerged to replace centralizedstorage solutions for large computing systems, as the distributed solutionhas appealing benefits over the centralized one, including lower cost, higherefficiency and performance, and incremental scalability. As a drawback,making decisions about resource provisioning, and storage system con-figuration are more costly in distributed systems. This happens becausemanaging data across several nodes requires more complex coordinationthan in a centralized solution, and distributed storage techniques exposetrade-offs that rarely exist in centralized solutions and are strongly depen-dent on the applications.Instead of offering a “one size fits all” solution via fixed parametersfor these decisions, some systems offer versatile solutions. These versatilesystems target improved performance of the application by supportingthe ability to ‘morph’ the storage system to best match the application’s156demands. To this end, versatile storage systems significantly extendthe flexibility to extend the storage system at deployment- or run-time.This flexibility, however, introduces a new problem: a much larger, andpotentially dynamic, configuration space to be explored.In this scenario, system provisioning, resource allocation, and configu-ration decisions for I/O-intensive batch applications are complex even forexpert users. Users face choices at multiple levels: choosing the amountof resources, allocating these resources to individual sub-systems (e.g., theapplication layer, or the storage layer) as well as configuring each of theseoptimally (e.g., replication level, chunk size, caching policies in case ofstorage) – all having a large impact on the overall performance of theapplication.Consequently, using a versatile storage system entails searching acomplex multi-dimensional configuration space to determine the user’sideal cost vs. performance balance point, making manual configuration anundesirable task.The research presented in this dissertation addresses these problems inthe context of intermediate storage systems. In particular, this work targetsworkflow applications – which communicate via files and are the focus ofthis study – as well as checkpointing applications.The traditional performance metrics that this research addresses as theoptimization criteria are a primary concern for many systems: applicationturnaround time, allocation cost of resources, and response time and datathroughput of storage system operations.Additionally, this research addresses energy consumption as part ofthe optimization criteria. First, it presents an analysis of the energy con-sumption in target scenarios to reduce the gap in past research when itcomes to modeling and understanding the trade-offs of optimization forenergy consumption and other traditional performance metrics. Second, itextends the initial solution proposed, for the prediction of an application’sperformance according to traditional performance metrics, to take intoaccount energy consumption as well.1576.1. CONTRIBUTIONS AND IMPACTThe end goal of this research is to provide support for provisioningand configuration decisions for intermediate storage systems to deliver thesuccess metrics (e.g., response time, storage footprint, energy consumption)close to user-defined optimization criteria. Specifically, this dissertationproposes performance prediction mechanisms that leverage the targetapplication’s characteristics to accelerate the exploration of the provisioningand configuration space. The mechanisms rely on monitoring informationavailable at application level, not requiring changes to the storage systems,nor specialized monitoring systems. Additionally, these mechanisms canbe used as building blocks for a solution that automates the configurationdecisions.The proposed solution meets the following requirements: First, theyreduce human intervention – as they require minimal human interventionto enable or disable various optimization techniques and to choose theirconfiguration parameters. Second, they produce a satisfactory configuration– as they enable configuration choices that bring the performance of thesystem close to the user’s intention. Third, they have a low exploration cost– as the overhead of using the proposed solution are low when compared tothe cost of running the application or I/O operations several times.6.1 Contributions and ImpactThis section presents a summary of the contributions that this dissertationprovides, as well as a discussion of their impact.6.1.1 Performance Prediction Mechanisms: Models and SeedingProcedures“Essentially, all models are wrong, but some are useful.” – G.E.P. Box (1976)To provide performance predictors that satisfy that the requirements listedabove and detailed in Section 1.2, this study focuses on the followingquestions.1586.1. CONTRIBUTIONS AND IMPACTHow can one leverage the characteristics of I/O intensive workflow ap-plications to build a prediction mechanism for traditional performancemetrics (e.g., time-to-solution and network usage)?Chapter 3 shows that the complexity of a prediction mechanism can bereduced by relying on some of the characteristics of workflow applications:relatively large files, almost distinct I/O and processing phases, many-reads-single-write, and specific data access patterns. Moreover, as the goal isto support configuration choice for a specific workload, achieving perfectaccuracy - at the cost of extra complexity - is less critical, as long as themechanism can support configuration decisions.By reducing the complexity of the prediction mechanism, this work isable to provide a system identification procedure to provide the model’sparameters (i.e., seed the model) that has two key features: (i) it relieson application-level operations (e.g., read and write) instead of detailedmeasurements of the execution path of these operations through the storagesystem, and (ii) it relies on a small deployment of the system, despite the goalof predicting the performance of a larger-scale deployment of the system.These two key features allow the mechanism to be simple, lightweight, andnon-intrusive, since it does not require storage system or kernel changes tocollect monitoring information, while still being effective at predicting thetarget performance metrics.The impact of such an approach is that it enables the selection of agood configuration choice in a reasonable amount of time. Specifically, fora wide range of scenarios explored, the predictor is lightweight – being200x to 2000x less resource-intensive than running the application itself,and accurate – given that 80% of the evaluated scenarios have predictionswithin 10% error and, in the worst case scenario, the prediction is still within21%. This accuracy is similar to or higher than past work that predicted thebehavior of distributed storage system or distributed applications. For ex-ample, Thereska et al. [156] target different workloads in the storage systemdomain while using a more detailed model and monitoring information[158], Herodotou et al. [91] focus on map-reduce jobs, and Kaviani et al.[100] provide a solution for support placement of software functions in a1596.1. CONTRIBUTIONS AND IMPACTcloud environment.What are the Challenges and Efficacy of an Automated Solution for theOnline Configuration of an Intermediate Storage System?Chapter 5, “Automatically Enabling Data Deduplication”, focuses on theexperience of providing a simple performance predictor for an onlineconfiguration by exploring data deduplication in the context of repetitivewrite operations (e.g., checkpointing applications).Chapter 5 presents a solution that relies on a repetitive pattern and ona prediction mechanism, similar to the solution for workflow applicationsin Chapter 3. This solution for data deduplication, however, is predicatedon the key assumption that repetitive similar operations occur throughoutthe application lifetime of a checkpointing application [12]. This characteristicprovides the opportunity to compare the effects of various configurationoptions in order to detect a “good enough” configuration during runtime.In this case, it uses a control loop that relies on a simple, yet effective,performance model.The progress in this direction not only sheds light on the challengesrelated to automated online configuration, but is also directly applicable inpractice since automating the choice of enabling or disabling deduplicationallows the application’s scientist/user to decouple her concerns about thecheckpointing characteristics (e.g., checkpointing technology, frequency) inorder to configure the storage system.Chapter 5 also points out that, depending on the workload, having alow computing overhead can be challenging for an automated solutionbecause (i) similarity detection is an intrusive storage technique, and (ii) thesimilarity may vary frequently, requiring similarity detection to be activatedmore often as well. Section 6.2.4 discusses this case in more detail.1606.1. CONTRIBUTIONS AND IMPACT6.1.2 Energy Trade-offs Assessment and Extension of thePrediction MechanismsThe energy study presented in this dissertation addresses two main ques-tions:• Is energy consumption subjected to different trade-offs than response time, orare optimizing for energy consumption and for response time coupled goals?• Which extensions does the performance predictor need in order to captureenergy consumption behavior, in addition to traditional performance metrics?Chapter 3 shows that workflow applications may have different trade-offs between optimizing for time-to-solution and energy consumption,depending on the characteristics of the application. To incorporate energyconsumption as a possible part of the optimization criteria, Chapter 3presents an extension of the performance predictor to estimate the energyconsumption of I/O-intensive workflow applications based on the idea ofassociating power profiles, while satisfying the same requirements that theperformance predictor should meet. Yang et al. [171] present and evaluatethis extension in detail, and demonstrate that the proposed predictionmechanism can estimate energy consumption with accuracy similar to thatof traditional performance metrics.Chapter 5 explores how online monitoring can enable online adaptation,and the initial experience of applying a power profile-based approach toestimate the energy consumption of a data deduplication technique. Chapter5 demonstrates the effectiveness of the prediction mechanism extension,in terms of accuracy, for energy consumption. Moreover, it demonstratesthat, in the context of data deduplication, hardware platforms with differentcharacteristics in terms of energy-proportionality present different trade-offsbetween optimizing for time-to-solution and energy consumption.6.1.3 Supporting DevelopmentThe prediction mechanism proposed in this dissertation can also be usedto support distributed system development. To assess this case, this study1616.1. CONTRIBUTIONS AND IMPACTdemonstrates the potential gains of using the prediction mechanism, beyondthe original design goal, by applying the performance predictor to betterunderstand and debug a distributed storage system that the NetSysLabgroup has been developing. For this use-case, this dissertation sheds lighton the following questions: Which, if any, are the potential benefits that theproposed performance predictor brings to the software development process of astorage system? and What are the limitations and challenges of using a performancepredictor as part of the development process?The developers compare this mechanism’s predictions to the actualperformance and decide whether they are close enough to their goal. In thecase that they are not, they proceed to debug the system, and follow thisprocess iteratively as part of the development. The predictor was also usedto evaluate the potential gains of new system features.This study provides evidence that the use of performance predictors dur-ing software development helps developers deal with this non-functionalrequirement of meeting a specific efficiency, similar to how automated testsverify functional requirements. Using the predictor helps the developersto set goals and improve the system’s performance. Specifically, thisapproach improved confidence in the implementation of MosaStore for somescenarios, and pointed out situations that needed further improvement. Thelatter made possible an improvement of the system’s performance by upto 30% and a decrease of 10x in the response time variance for some of thetarget benchmarks.Using the predictor as part of the software development process presentssome of the challenges inherent to any performance predictor: difficultiesin properly seeding the model and mimicking the benchmarks used (i.e.,the workload) to evaluate the performance of its implementation. It alsoentails the difficulties inherent to any additional step in the developmentprocess: providing a tool easy to be used that brings perceptive value to thedevelopers.An important challenge that this approach brings is defining what is“close enough” to the performance goal. Although the predictor can set anobjective ideal target to stop debugging, the developers are still in charge1626.1. CONTRIBUTIONS AND IMPACTof defining how much variation in the target value is tolerated, since someoverhead brought by the distributed system is expected.6.1.4 Storage System and Prediction Tool PrototypesChapters 3, 4, and 5 present evaluations that rely on the prototype imple-mentations of the MosaStore storage system and the proposed predictionmechanisms. These prototypes are available in the code repositories forMosaStore1 and the predictors2. These repositories also contain scriptsand configuration data used to execute the experiments for the evaluationdescribed in this dissertation, as well as the scripts and a summary of theresults of the experiments3 used in the data analysis of this study.In addition to the research study’s results and impact presented above,as is typical in Computer Systems research, part of a research project’scontribution is a working software system. In this case, the MosaStorestorage system is the working prototype based on, and/or serving as aresearch platform for the evaluation of, a number of research ideas [11–15,17, 75–77, 115, 159, 171] from several institutions, including this dissertation[52, 54, 55, 57, 58, 60].The implementation of these prototypes follow a development processthat includes automated unit and acceptance tests, as well as a code reviewprocess4. Having a working prototype available that relies on structuredtesting and code reviewing processes, and is used by various projects alsoincreases confidence in the evidence that this research provides.In addition to improving MosaStore’s usability, by providing support forits configuration, it is worthwhile to point out that MosaStore’s performanceand stability improved as a direct consequence of the research described inthis dissertation (see Chapter 4).1 logs are available at NetSysLab cluster, since they are too large to be upload tothese repositories.4Code reviews are available at LIMITATIONS AND FUTURE WORK6.2 Limitations and Future WorkIn addition to the results described in Section 6.1, the research presented inthis dissertation identifies some limitations of the proposed solutions whichare summarized in this section. Additionally, this section describes possibleextensions to this research, including alternatives to overcome the identifiedlimitations.6.2.1 Complete Automation of User DecisionsThe proposed prediction mechanisms accelerate the process of evaluatingdifferent decisions for the provision and configuration of storage systems.One natural next step is to use the prediction mechanisms to provide acompletely automated solution that receives a utility function capturing theuser’s intention and produces a configuration that maximizes the utility.In the past, several approaches have been proposed to search for pa-rameters that maximize a given function. For example, in the context ofa computer system’s configuration, the Recursive Random Search (RRS)algorithm has been applied for large-scale network parameter optimization[173] and for map-reduce jobs [90, 92]. The RRS algorithm uses a randomsampling, with adjusted sample space, to tolerate random noise and irrele-vant or trivial parameters, which is relevant to the domain of this research.Alternatively, approaches based on genetic algorithms could also providecomplete automation by using the proposed prediction mechanisms. Forexample, Strunk et al. [152] propose a set of utility functions for storagesystems and a tool based on genetic algorithms to find the input thatmaximizes these functions. Behzad et al. [27] also use genetic algorithms tooptimize the runtime of parallel applications by sampling the configurationspace via actual application runs. These, or similar approaches, can usethe solution described in this dissertation as building blocks for a toolthat automates the user’s decisions about provisioning and configurationof intermediate storage systems. For example, a genetic algorithm couldquery a performance predictor using different values for the configurationparameters instead of running the actual application.1646.2. LIMITATIONS AND FUTURE WORKAdditionally, note that the user may not be interested in a singleconfiguration that optimizes a utility function, or she may not have theutility function. In this case, the user would be interested in seeing a sampleof possible configurations with different trade-offs in order to pick one.An example of this is the case of performance vs. allocation cost (Sections3.3.3 and 3.3.4), where the user is interested in what she believes to be acost/performance balance point. For these cases, a complete automationsolution could use a Pareto sampling approach to provide the user with asubset of most desired configuration choices for a given optimization criteria(i.e., Pareto frontier) [16].6.2.2 Prediction Mechanism Support for HeterogeneousEnvironmentsSome cluster and cloud environments are not completely homogeneousacross nodes. This heterogeneity breaks one of the assumptions held by theseeding procedure, described in Chapter 3 – that is, relying on a minimaldeployment to be simple, and low cost.Chapter 3 argues that the storage system simulation is able to supportthe heterogeneity and, in fact, presents a heterogeneous scenario for theevaluation of the reduce benchmark. In this case, the seeding procedureshould be performed on the different types of machines and provided tothe prediction mechanism as part of the setup to be evaluated. For realapplications, however, the computing part of the workflow tasks are hardto incorporate in the prediction mechanism in the way it has been proposedin this study. The main problem is that it would require re-executions ofthe tasks for different machine types, which may take long – breaking theprediction mechanism requirements of being low cost.Two alternatives could overcome this problem. First, a task samplingapproach could enable the predictor to execute just part of the tasks, perstage and to then infer the behavior of the other tasks in the same stage[91, 140]. In this case, at least one complete execution of the applicationshould be executed in order to guarantee that the input files for all tasks willbe available for later sampling in heterogeneous machines, as an incomplete1656.2. LIMITATIONS AND FUTURE WORKexecution would prevent the execution of the next stage. In fact, for caseswhere partial execution of a stage can still produce useful results, thisapproach can already be applied to a homogeneous environment [91]. Basedon Montage and BLAST cases in this study (Section 3.3), a sample as smallas 20% of the tasks in a stage can be effective for inferring the behavior ofother tasks, and it would only lead to an additional 2-6% prediction error.Second, a relative performance approach [119, 120] proposes the use ofbasic operations to benchmark different types of machines, and then describethem in terms of relative performance among the different types. In this case,one complete execution of the task is still needed, but then the predictorcould infer the performance of each task in a different type of machine basedon the relative performance description. A hybrid approach that combinestask sampling and relative performance may also be effective.6.2.3 Support for Virtual Machines and More Complex Networkand Storage Device InteractionsUsing dedicated nodes (i.e., a space-shared platform) is the typical set-up for clusters and supercomputers, and they are also available in cloudplatforms. Virtual Machines, however, have become widely available as analternative for these platforms. One of the consequences of the use of virtualmachines is the performance interference of the physical machine. Althoughperformance insulation are available and effective for some of the machinecomponents (e.g., CPU), it is difficult to provide insulation for storage andnetwork devices [83, 161].This lack of perfect performance insulation can lead to larger errors bythe performance prediction mechanism and, hence, ineffective support forstorage configuration. In this case, the following extension could be useful:First, an evaluation would provide evidence whether the performanceinterference affect the storage to a point that prevents the performanceprediction mechanisms from supporting the user making her decisionsabout the storage system. Second, depending on the results of the evaluation,the prediction mechanism can be extended to incorporate more detailedmodel of the platform components (e.g., the use of NS2 [8] to capture1666.2. LIMITATIONS AND FUTURE WORKmore complex network interactions and DiskSim [1]). Alternatively, thisproblem can be addressed though the evolution of performance insulationmechanisms and Quality-of-Service (QOS) agreements [127].6.2.4 Support for GPU and Content-Based Chunking for DataDeduplicationThe work on an automated solution for data deduplication, presentedin Chapter 5, was an initial study that had the goal of assessing howonline monitoring can enable online adaptation and the initial experience ofapplying a power profile-based approach to estimate energy consumptionof a data deduplication technique.As discussed in Section 5.6, two extensions can enrich the automatedsolution proposed for data deduplication. First, the use of GPUs to offloadhash computations can speed up data deduplication operations. Based onthe small variance of the results per chunk, as evaluated in the StoreGPUproject [11, 13, 15, 76, 77], the approach proposed in Chapter 5 should beeffective in predicting response time. An extended evaluation is needed toassess the impact of this approach on predicting energy consumption.The second extension is related to the configuration of several parametersthat affect can the degree of detected data similarity. Tangwongsan et al.[153] propose an approach that can be an initial step towards this direction;their method estimates an “optimal” expected chunk size, which can beobtained as a result of the parameters to be configured over the same data.Tangwongsan et al. [153], however, also discuss that estimating how thedetectable degree of similarity changes is difficult, even for the same data(e.g., same virtual machine image). Such estimation should be harder forcases over different datasets (e.g., different checkpoint images), as the targetof this dissertation.1676.2. LIMITATIONS AND FUTURE WORK6.2.5 Study on the Use of Performance Predictors to SupportDevelopmentThe use-case discussed in this dissertation should be extended to evaluatea larger set of distributed systems and developers to verify the generaliza-tion of the observations presented in Chapter 4, and to understand whatinformation would provide more benefits to the developers (e.g., profilingaid).In addition to understanding what extra information can aid the de-velopers in the profiling and debugging process, it is necessary to assesswhen and how to use a predictor in the development process. Specifically,the community would benefit from understanding the trade-offs related toits use and the comparison with simpler alternatives such as back-of-the-envelope calculations, and the trade-offs between the extra effort of pursinga given performance level provided by a predictor versus the benefits ofobtaining such performance.168Bibliography[1] DiskSim. → pages 36, 94, 100, 167[2] Fuse: Filesystem in userspace. → pages17, 109[3] Stress. → pages 85[4] Watts up? pro product details. →pages 143[5] A conversation with Jim Gray. Queue, 1(4):8–17, June 2003. ISSN1542-7730. doi:10.1145/864056.864078. URL → pages 34[6] Overview of DOCK6., June 2011. → pages 26[7] IEEE 28th Symposium on Mass Storage Systems and Technologies, MSST2012, April 16-20, 2012, Asilomar Conference Grounds, Pacific Grove, CA,USA, 2012. IEEE. → pages 172, 183[8] The network simulator ns2., 2012. →pages 36, 37, 46, 166[9] pNFS., 2012. → pages 22[10] M. Abd-El-Malek, W. V. C. II, C. Cranor, G. R. Ganger, J. Hendricks,A. J. Klosterman, M. P. Mesnier, M. Prasad, B. Salmon, R. R. Sambasi-van, S. Sinnamohideen, J. D. Strunk, E. Thereska, M. Wachs, and J. J.Wylie. Ursa minor: Versatile cluster-based storage. In Proceedings ofthe Conference on File and Storage Technologies, December 2005. → pages1, 2, 3, 16, 18, 22, 23, 37, 49, 57, 109169BIBLIOGRAPHY[11] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ri-peanu. StoreGPU: Exploiting graphics processing units to acceleratedistributed storage systems. In Proceedings of the 17th InternationalSymposium on High Performance Distributed Computing, HPDC ’08,pages 165–174, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-997-5. doi:10.1145/1383422.1383443. URL → pages 15, 21, 25, 29, 126, 147, 150, 153, 163, 167[12] S. Al-Kiswany, M. Ripeanu, S. S. Vazhkudai, and A. Gharaibeh.stdchk: A Checkpoint Storage System for Desktop Grid Computing.In International Conference on Distributed Computing Systems, ICDCS,pages 613–624, Washington, DC, USA, 2008. IEEE Computer Society.ISBN 978-0-7695-3172-4. → pages 2, 3, 25, 29, 33, 127, 129, 135, 136,137, 141, 152, 153, 160[13] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, and M. Ripeanu. OnGPU’s viability as a middleware accelerator. Cluster Computing, 12(2):123–140, June 2009. ISSN 1386-7857. doi:10.1007/s10586-009-0076-0.URL → pages 167[14] S. Al-Kiswany, A. Gharaibeh, and M. Ripeanu. The case for a versatilestorage system. ACM SIGOPS Operating Systems Review, 44:10–14,March 2010. ISSN 0163-5980. → pages 1, 2, 3, 6, 11, 14, 15, 16, 21, 23,24, 25, 40, 44, 49, 69, 108, 156[15] S. Al-Kiswany, A. Gharaibeh, and M. Ripeanu. GPUs as storagesystem accelerators. IEEE Transactions on Parallel and DistributedSystems, 24(8):1556–1566, 2013. ISSN 1045-9219. doi: → pages 153, 163,167[16] S. Al-Kiswany, H. Hacıgümüs¸, Z. Liu, and J. Sankaranarayanan. Costexploration of data sharings in the cloud. In Proceedings of the 16thInternational Conference on Extending Database Technology, EDBT ’13,pages 601–612, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1597-5. doi:10.1145/2452376.2452447. URL → pages 165[17] S. Al-Kiswany, E. Vairavanathan, L. B. Costa, H. Yang, and M. Ri-peanu. The case for cross-layer optimizations in storage: A workflow-optimized storage system. CoRR, abs/1301.6195, 2013. → pages 15,16, 18, 20, 21, 28, 30, 32, 163170BIBLIOGRAPHY[18] S. Al-Kiswany, L. B. Costa, E. Vairavanathan, H. Yang, and M. Ripeanu.A software defined storage for scientific workflow applications. UnderReview, September 2014. → pages vii, 23, 102[19] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basiclocal alignment search tool. Journal of molecular biology, 215(3):403–410,October 1990. ISSN 0022-2836. → pages 4, 5, 26, 43, 60, 69[20] G. A. Alvarez, E. Borowsky, S. Go, T. H. Romer, R. Becker-Szendy,R. Golding, A. Merchant, M. Spasojevic, A. Veitch, and J. Wilkes.Minerva: An automated resource provisioning tool for large-scalestorage systems. ACM Transactions Computer Systems, 19(4):483–518,November 2001. ISSN 0734-2071. doi:10.1145/502912.502915. URL → pages 3, 7, 35, 40[21] E. Anderson, M. Hobbs, K. Keeton, S. Spence, M. Uysal, and A. Veitch.Hippodrome: Running circles around storage administration. InIn Proceedings of the Conference on File and Storage Technologies, pages175–188. USENIX Association, 2002. → pages 40, 93, 95[22] E. Anderson, S. Spence, R. Swaminathan, M. Kallahalla, and Q. Wang.Quickly finding near-optimal storage designs. ACM TransactionsComputer Systems, 23(4):337–374, November 2005. ISSN 0734-2071.→ pages 3, 7, 40, 93, 95[23] T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S. Roselli,and R. Y. Wang. Serverless network file systems. ACM Transactions onComputer Systems (TOCS), 14(1):41–79, 1996. → pages 2[24] S. Balsamo, A. Di Marco, P. Inverardi, and M. Simeoni. Model-basedperformance prediction in software development: A survey. IEEETransactions on Software Engineering, 30(5):295–310, 2004. → pages 106,107, 120, 121[25] L. A. Barroso and U. Hölzle. The Case for Energy-ProportionalComputing. IEEE Computer, 40(12):33–37, 2007. ISSN 0018-9162. →pages 25, 142, 150[26] S. Basu, L. B. Costa, F. Brasileiro, S. Banerjee, P. Sharma, and S.-J. Lee.Nodewiz: Fault-tolerant grid information service. Journal of Peer-to-Peer Networking and Applications, 2(4):348–366, 2009. ISSN 1936-6442.doi:10.1007/s12083-009-0030-1. URL → pages iv, 195171BIBLIOGRAPHY[27] B. Behzad, H. V. T. Luu, J. Huchette, S. Byna, Prabhat, R. Aydt,Q. Koziol, and M. Snir. Taming parallel i/o complexity with auto-tuning. In Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis, SC ’13, pages 68:1–68:12,New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2378-9. doi:10.1145/2503210.2503278. URL→ pages 1, 3, 7, 8, 95, 164[28] B. Behzad, S. Byna, S. M. Wild, M. Prabhat, and M. Snir. Improvingparallel i/o autotuning with performance modeling. In Proceedingsof the 23rd International Symposium on High-performance Parallel andDistributed Computing, HPDC ’14, pages 253–256, New York, NY, USA,2014. ACM. ISBN 978-1-4503-2749-7. doi:10.1145/2600212.2600708.URL → pages 8[29] C. L. Belady. In the Data Center, Power and CoolingCosts more than the IT Equipment it Supports. February2007. →pages 84[30] J. Bent, D. Thain, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, andM. Livny. Explicit control in a batch-aware distributed file system. InProceedings of the 1st Conference on Symposium on Networked SystemsDesign and Implementation, NSDI’04, pages 27–38, Berkeley, CA, USA,2004. USENIX Association. URL → pages 22, 24[31] J. Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, P. Tzelnic, andJ. Woodring. Jitter-free co-processing on a prototype exascale storagestack. In MSST DBL [7], pages 1–5. → pages 3, 24, 156[32] G. B. Berriman, J. C. Good, D. Curkendall, J. Jacob, D. S. Katz, T. A.Prince, and R. Williams. Montage: An on-demand image mosaicservice for the nvo. Astronomical Data Analysis Software and SystemsXII ASP Conference Series, 295:343, 2003. → pages 26, 43[33] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.-H. Su, andK. Vahi. Characterization of scientific workflows. In Workflows inSupport of Large-Scale Science, 2008. WORKS 2008. Third Workshop on,pages 1–10, 2008. → pages 26, 27, 43, 62, 63172BIBLIOGRAPHY[34] P. Bhatotia, R. Rodrigues, and A. Verma. Shredder: GPU-acceleratedincremental storage and computation. In 10th USENIX Conference onFile and Storage Tech. (FAST ’12), FAST, page 14, February 2012. →pages 153[35] P. Bodík, R. Griffith, C. Sutton, A. Fox, M. Jordan, and D. Patterson.Statistical machine learning makes automatic control practical forinternet datacenters. In Proceedings of the 2009 Conference on Hot Topicsin Cloud Computing, HotCloud’09, Berkeley, CA, USA, 2009. USENIXAssociation. URL→ pages 34, 39[36] P. Bodík, R. Griffith, C. Sutton, A. Fox, M. I. Jordan, and D. A.Patterson. Automatic exploration of datacenter performance regimes.In Proceedings of the 1st Workshop on Automated Control for Datacen-ters and Clouds, ACDC ’09, pages 1–6, New York, NY, USA, 2009.ACM. ISBN 978-1-60558-585-7. doi:10.1145/1555271.1555273. URL → pages 34, 39[37] G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, andJ. Dongarra. Dague: A generic distributed dag engine for high per-formance computing. In Parallel and Distributed Processing Workshopsand Phd Forum (IPDPSW), 2011 IEEE International Symposium on, pages1151–1158, May 2011. doi:10.1109/IPDPS.2011.281. → pages 26[38] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris,A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey:An operating system for many cores. In Proceedings of the 8th USENIXConference on Operating Systems Design and Implementation, OSDI ’08,pages 43–57, Berkeley, CA, USA, 2008. USENIX Association. URL → pages 26[39] D. Bradley, O. Gutsche, K. Hahn, B. Holzman, S. Padhi, H. Pi, D. Spiga,I. Sfiligoi, E. Vaandering, F. Würthwein, et al. Use of glide-ins in cmsfor production and analysis. Journal of Physics: Conference Series, 219(7):072013, 2010. → pages 24[40] F. Brüseke, G. Engels, and S. Becker. Decision support via automatedmetric comparison for the palladio-based performance blame analysis.In Proceedings of the 4th ACM/SPEC International Conference on Perfor-mance Engineering, ICPE ’13, pages 77–88, New York, NY, USA, 2013.173BIBLIOGRAPHYACM. ISBN 978-1-4503-1636-1. doi:10.1145/2479871.2479886. URL → pages 122[41] F. Cappello, E. Caron, M. Daydé, F. Desprez, Y. Jégou, P. Primet,E. Jeannot, S. Lanteri, J. Leduc, N. Melab, G. Mornet, R. Namyst,B. Quetier, and O. Richard. Grid’5000: a large scale and highlyreconfigurable grid experimental testbed. In Grid Computing, 2005.The 6th IEEE/ACM International Workshop on, pages 8 pp.+, 2005. →pages 61, 87[42] H. Casanova, A. Legrand, and M. Quinson. SimGrid: a GenericFramework for Large-Scale Distributed Experiments. In Proceedingsof the Tenth International Conference on Computer Modeling and Simula-tion, UKSIM ’08, pages 126–131, Washington, DC, USA, 2008. IEEEComputer Society. ISBN 978-0-7695-3114-4. → pages 46[43] J. S. Chase, D. C. Anderson, P. N. Thakar, A. M. Vahdat, and R. P.Doyle. Managing energy and server resources in hosting centers.In Proceedings of the Eighteenth ACM Symposium on Operating SystemsPrinciples, SOSP ’01, pages 103–116, New York, NY, USA, 2001. ACM.ISBN 1-58113-389-8. doi:10.1145/502034.502045. URL → pages 25, 34, 39[44] S. Chaudhuri and V. Narasayya. Autoadmin “what-if” index analysisutility. In Proceedings of the 1998 ACM SIGMOD International Conferenceon Management of Data, SIGMOD ’98, pages 367–378, New York, NY,USA, 1998. ACM. ISBN 0-89791-995-5. doi:10.1145/276304.276337.URL → pages 37[45] S. Chaudhuri and V. Narasayya. Self-tuning database systems: Adecade of progress. In Proceedings of the 33rd International Conferenceon Very Large Data Bases, VLDB ’07, pages 3–14. VLDB Endowment,2007. ISBN 978-1-59593-649-3. URL → pages 37[46] S. Chaudhuri and G. Weikum. Rethinking database system architec-ture: Towards a self-tuning risc-style database system. In Proceedingsof the 26th International Conference on Very Large Data Bases, VLDB’00, pages 1–10, San Francisco, CA, USA, 2000. Morgan KaufmannPublishers Inc. ISBN 1-55860-715-3. URL → pages174BIBLIOGRAPHY[47] S. Chaudhuri and G. Weikum. Foundations of automated databasetuning. In Proceedings of the 32Nd International Conference on Very LargeData Bases, VLDB ’06, pages 1265–1265. VLDB Endowment, 2006. URL → pages 37[48] Y. Chen, A. Ganapathi, and R. H. Katz. To compress or not tocompress - compute vs. IO tradeoffs for mapreduce energy efficiency.In Proceedings of the first ACM SIGCOMM workshop on Green networking,Green Networking ’10, pages 23–28, 2010. → pages 141, 151[49] W. Cirne and F. Berman. Using moldability to improve the per-formance of supercomputer jobs. Journal of Parallel and DistributedComputing, 62(10):1571–1601, 2002. → pages 4, 18, 44[50] W. Cirne, F. Brasileiro, N. Andrade, L. B. Costa, A. Andrade, R. Novaes,and M. Mowbray. Labs of the World, Unite!!! Journal of Grid Computing,4(3):225–246, 2006. → pages 195[51] V. Cortellessa and R. Mirandola. Deriving a queueing network basedperformance model from uml diagrams. In Proceedings of the 2ndinternational workshop on Software and performance, pages 58–70. ACM,2000. → pages 122[52] L. B. Costa and M. Ripeanu. Towards Automating the Configurationof a Distributed Storage System. In Proceedings of the 11th ACM/IEEEInternational Conference on Grid Computing, Grid’2010, October 2010.→ pages vi, 1, 2, 7, 15, 21, 25, 33, 69, 125, 163[53] L. B. Costa, S. Al-Kiswany, and M. Ripeanu. GPU Support for BatchOriented Workloads. In Performance Computing and CommunicationsConference (IPCCC), 2009 IEEE 28th International, pages 231–238, De-cember 2009. doi:10.1109/PCCC.2009.5403809. → pages iv, 25, 153,195[54] L. B. Costa, S. Al-Kiswany, R. V. Lopes, and M. Ripeanu. Assess-ing data deduplication trade-offs from an energy and performanceperspective. In Proceedings of the 2011 International Green ComputingConference and Workshops, pages 1–6. IEEE, 2011. doi:10.1109/IGCC.2011.6008567. → pages vi, 7, 15, 18, 21, 25, 33, 84, 125, 127, 163[55] L. B. Costa, S. Al-Kiswany, A. Barros, H. Yang, and M. Ripeanu.Predicting intermediate storage performance for workflow appli-cations. In Proceedings of the 8th Parallel Data Storage Workshop,175BIBLIOGRAPHYPDSW ’13, pages 33–38, New York, NY, USA, November 2013.ACM. ISBN 978-1-4503-2505-9. doi:10.1145/2538542.2538560. URL → pages v, 42, 163[56] L. B. Costa, J. Brunet, L. Hattori, and M. Ripeanu. Experience onapplying performance prediction during development: a distributedstorage system tale. Technical report, UBC/ECE/NetSysLab, Septem-ber 2013. → pages 12[57] L. B. Costa, S. Al-Kiswany, H. Yang, and M. Ripeanu. Supportingstorage configuration for I/O intensive workflows. In Proceedings ofthe 28th ACM International Conference on Supercomputing, ICS’14, pages191–200. ACM, June 2014. ISBN 978-1-4503-2642-1/14/06. doi:10.1145/2597652.2597679. URL →pages iv, 30, 43, 78, 163[58] L. B. Costa, S. Al-Kiswany, H. Yang, and M. Ripeanu. SupportingStorage Configuration and Provisioning for I/O Intensive Workflows.Under Review, 2014. → pages 43, 163[59] L. B. Costa, J. Brunet, L. Hattori, and M. Ripeanu. Experience onapplying performance prediction during development: a distributedstorage system tale. In Proceedings of 2nd International Workshopon Software Engineering for High Performance Computing in Computa-tional Science and Engineering, SE-HPCCSE ’14, pages 13–19, Piscat-away, NJ, USA, November 2014. IEEE Press. ISBN 978-1-4799-7035-3. doi:10.1109/SE-HPCCSE.2014.6. URL Previously “International Workshop on SoftwareEngineering for Computational Science and Engineering” and “Inter-national Workshop on Software Engineering for High PerformanceComputing Applications”. → pages v, 105[60] L. B. Costa, H. Yang, E. Vairavanathan, A. Barros, K. Maheshwari,G. Fedak, D. Katz, M. Wilde, M. Ripeanu, and S. Al-Kiswany. The casefor workflow-aware storage: An opportunity study. Journal of GridComputing, pages 1–19, 2014. ISSN 1570-7873. doi:10.1007/s10723-014-9307-6. URL Publishedonline in July 2014. → pages vi, vii, 3, 12, 15, 16, 18, 21, 23, 27, 28, 55,57, 156, 163[61] A. Crume, C. Maltzahn, L. Ward, T. Kroeger, M. Curry, and R. Oldfield.Fourier-assisted machine learning of hard disk drive access time176BIBLIOGRAPHYmodels. In Proceedings of the 8th Parallel Data Storage Workshop,PDSW’13, pages 45–51, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2505-9. doi:10.1145/2538542.2538561. URL → pages 36, 100[62] D. P. Da Silva, W. Cirne, and F. V. Brasileiro. Trading cycles forinformation: Using replication to schedule bag-of-tasks applicationson computational grids. In Proceedings of the 9th International EuropeanConference on Parallel Processing, Euro-Par’03, pages 169–180. Springer,2003. → pages 57, 100[63] B. Dageville and M. Zait. Sql memory management in oracle9i. InProceedings of the 28th International Conference on Very Large Data Bases,VLDB ’02, pages 962–973. VLDB Endowment, 2002. URL → pages 34, 39[64] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker,D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimiza-tion and auto-tuning on state-of-the-art multicore architectures. InHigh Performance Computing, Networking, Storage and Analysis, 2008.SC 2008. International Conference for, pages 1–12, November 2008.doi:10.1109/SC.2008.5222004. → pages 34, 38[65] A. Davies and A. Orsaria. Scale out with glusterfs. Linux Journal, 2013(235), Nov. 2013. ISSN 1075-3583. URL → pages 102[66] J. Dean and S. Ghemawat. Mapreduce: Simplified data processingon large clusters. In Proceedings of the 6th Conference on Symposiumon Operating Systems Design and Implementation - Volume 6, OSDI ’04,pages 137–149, Berkeley, CA, USA, 2004. USENIX Association. URL → pages 100[67] E. Deelman, G. Singh, M. hui Su, J. Blythe, Y. Gil, C. Kesselman,G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, andD. S. Katz. Pegasus: a framework for mapping complex scientificworkflows onto distributed systems. Scientific Programming Journal,13:219–237, 2005. → pages 26, 43[68] R. P. Doyle, J. S. Chase, O. M. Asad, W. Jin, and A. M. Vahdat. Model-based resource provisioning in a web service utility. In Proceedingsof the 4th Conference on USENIX Symposium on Internet Technologies177BIBLIOGRAPHYand Systems, USITS’03, pages 5–19, Berkeley, CA, USA, 2003. USENIXAssociation. URL→ pages 34, 39[69] D. Economou, S. Rivoire, and C. Kozyrakis. Full-system poweranalysis and modeling for server environments. In Proceedings ofWorkshop on Modeling Benchmarking and Simulation, MOBS ’06, 2006.→ pages 96[70] M. A. Erazo, T. Li, J. Liu, and S. Eidenbenz. Toward comprehen-sive and accurate simulation performance prediction of parallel filesystems. In Proceedings of the 42nd Annual IEEE/IFIP InternationalConference on Dependable Systems and Network, DSN’12, pages 1–12.IEEE, 2012. → pages 36, 94[71] X. Fan, W.-D. Weber, and L. A. Barroso. Power provisioning for awarehouse-sized computer. In Proceedings of the 34th Annual Interna-tional Symposium on Computer Architecture, ISCA ’07, pages 13–23, NewYork, NY, USA, 2007. ACM. ISBN 978-1-59593-706-3. doi:10.1145/1250662.1250665. URL→ pages 25[72] X. Feng, R. Ge, and K. W. Cameron. Power and energy profiling ofscientific applications on distributed systems. In Proceedings of the19th IEEE International Parallel and Distributed Processing Symposium(IPDPS’05) - Papers - Volume 01, IPDPS ’05, pages 34–, Washington,DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2312-9. doi:10.1109/IPDPS.2005.346. URL→ pages 96[73] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns:Elements of Reusable Object-oriented Software, chapter 5: BehavioralPatterns. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,USA, 1995. ISBN 0-201-63361-2. → pages 19[74] S. Gaonkar, K. Keeton, A. Merchant, and W. H. Sanders. Designingdependable storage solutions for shared application environments.IEEE Transactions on Dependable and Secure Computing, 7(4):366–380,October 2010. ISSN 1545-5971. doi:10.1109/TDSC.2008.38. URL → pages 38[75] A. Gharaibeh and M. Ripeanu. Exploring data reliability tradeoffs inreplicated storage systems. In Proceedings of the 18th ACM International178BIBLIOGRAPHYSymposium on High Performance Distributed Computing, HPDC ’09,pages 217–226, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-587-1. doi:10.1145/1551609.1551643. URL → pages 163[76] A. Gharaibeh, S. Al-Kiswany, S. Gopalakrishnan, and M. Ripeanu.A GPU accelerated storage system. In Proceedings of the 19th ACMInternational Symposium on High Performance Distributed Computing,HPDC ’10, pages 167–178, New York, NY, USA, 2010. ACM. ISBN978-1-60558-942-8. doi:10.1145/1851476.1851497. → pages 15, 21, 25,126, 147, 150, 153, 167[77] A. Gharaibeh, S. Al-Kiswany, and M. Ripeanu. ThriftStore: Finessingreliability trade-offs in replicated storage systems. Parallel and Dis-tributed Systems, IEEE Transactions on, 22(6):910–923, June 2011. ISSN1045-9219. doi:10.1109/TPDS.2010.157. → pages 163, 167[78] A. Gharaibeh, L. B. Costa, E. Santos-Neto, and M. Ripeanu. A yoke ofoxen and a thousand chickens for heavy lifting graph processing. InProceedings of the 21st International Conference on Parallel Architecturesand Compilation Techniques, PACT ’12, pages 345–354, New York, NY,USA, 2012. ACM. ISBN 978-1-4503-1182-3. doi:10.1145/2370816.2370866. URL → pagesiv, 194[79] A. Gharaibeh, L. B. Costa, E. Santos-Neto, and M. Ripeanu. On graphs,gpus, and blind dating: A workload to processor matchmaking quest.In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th InternationalSymposium on, pages 851–862, May 2013. doi:10.1109/IPDPS.2013.37.→ pages 194[80] A. Gharaibeh, E. Santos-Neto, L. B. Costa, and M. Ripeanu. Theenergy case for graph processing on hybrid CPU and GPU systems.In Proceedings of the 3rd Workshop on Irregular Applications: Architecturesand Algorithms, IA3 ’13, pages 2:1–2:8, New York, NY, USA, 2013.ACM. ISBN 978-1-4503-2503-5. doi:10.1145/2535753.2535755. URL → pages 194[81] A. Gharaibeh, E. Santos-Neto, L. B. Costa, and M. Ripeanu. Efficientlarge-scale graph processing on hybrid cpu and gpu systems. JournalSubmission, 2014. Under Review. → pages iv, 194179BIBLIOGRAPHY[82] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. InSOSP ’03: Proceedings of the 19th ACM Symposium on Operating SystemsPrinciples, pages 29–43, New York, NY, USA, 2003. ACM Press. ISBN1581137575. → pages 2, 16, 109[83] A. Gulati, A. Merchant, and P. J. Varman. mclock: Handling through-put variability for hypervisor io scheduling. In Proceedings of the 9thUSENIX Conference on Operating Systems Design and Implementation,OSDI ’10, pages 1–7, Berkeley, CA, USA, 2010. USENIX Association.URL → pages 166[84] S. Gurumurthi, A. Sivasubramaniam, M. Irwin, N. Vijaykrishnan, andM. Kandemir. Using complete machine simulation for software powerestimation: the softwatt approach. pages 141–150, Feb. 2002. → pages96[85] I. F. Haddad. PVFS: A Parallel Virtual File System for Linux Clusters.Linux Journal, 2000(80es), Nov. 2000. ISSN 1075-3583. → pages 2, 16,22, 49, 109[86] P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr)for linux clusters. Journal of Physics: Conference Series, 46(1):494, 2006.URL → pages 29[87] S. Harizopoulos, M. A. Shah, J. Meza, and P. Ranganathan. Energyefficiency: The new holy grail of data management systems research.In CIDR, 2009. → pages 150, 151[88] J. H. Hartman and J. K. Ousterhout. The zebra striped network filesystem. ACM Transactions Computer Systems, 13(3):274–310, 1995. →pages 16[89] C. Heger, J. Happe, and R. Farahbod. Automated root cause isolationof performance regressions during software development. In Proceed-ings of the ACM/SPEC international conference on International conferenceon performance engineering, pages 27–38. ACM, 2013. → pages 123[90] H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proceedings of the VLDBEndowment, 4(11):1111–1122, 2011. → pages 164[91] H. Herodotou, F. Dong, and S. Babu. No one (cluster) size fits all:Automatic cluster sizing for data-intensive analytics. In Proceedings180BIBLIOGRAPHYof the 2nd ACM Symposium on Cloud Computing, SoCC’11, pages 18:1–18:14, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0976-9.doi:10.1145/2038916.2038934. URL → pages 7, 44, 95, 159, 165, 166[92] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, andS. Babu. Starfish: A self-tuning system for big data analytics. InProceedings of the 5th Biennial Conference on Innovative Data SystemsResearch (CIDR), January 2011. URL → pages 95, 164[93] J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satya-narayanan, R. N. Sidebotham, and M. J. West. Scale and performancein a distributed file system. ACM Transactions on Computer Systems(TOCS), 6(1):51–81, 1988. → pages 2[94] H. Huang and A. Grimshaw. Automated performance control in avirtual distributed storage system. In Proceeding of the 9th IEEE/ACMInternational Conference on Grid Computing, Grid ’08, pages 242–249,Sept 2008. doi:10.1109/GRID.2008.4662805. → pages 39[95] J. V. Huber, Jr., A. A. Chien, C. L. Elford, D. S. Blumenthal, and D. A.Reed. PPFS: A high performance portable parallel file system. InProceedings of the 9th International Conference on Supercomputing, ICS’95, pages 385–394, New York, NY, USA, 1995. ACM. ISBN 0-89791-728-6. doi:10.1145/224538.224638. URL → pages 2, 3, 6, 18, 21[96] D. Ibtesham, D. DeBonis, K. B. Ferreira, and D. Arnold. Coarse-grained energy modeling of rollback/recovery mechanisms. In pro-ceedings the 4th Fault Tolerance for HPC at eXtreme Scale (FTXS) 2014,held in association with The 44th Annual IEEE/IFIP InternationalConference on Dependable Systems and Networks (DSN 2014), 2014.→ pages 84[97] S. Ihm, K. Park, and V. S. Pai. Wide-area network acceleration for thedeveloping world. In Proceedings of the USENIX Annual TechnicalConference, USENIX ATC ’10, Berkeley, CA, USA, 2010. USENIXAssociation. → pages 150[98] R. Jain. The Art of Computer Systems Performance Analysis. WileyInterscience. John Wiley & Sons, Inc., New York, NY, first edition,1991. → pages 61, 114, 136181BIBLIOGRAPHY[99] K. R. Jayaram, C. Peng, Z. Zhang, M. Kim, H. Chen, and H. Lei.An empirical analysis of similarity in virtual machine images. InProceedings of the Middleware 2011 Industry Track Workshop, Middleware’11, pages 6:1–6:6, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-1074-1. doi:10.1145/2090181.2090187. URL → pages 154[100] N. Kaviani, E. Wohlstadter, and R. Lea. Manticore: A frameworkfor partitioning software services for hybrid cloud. In 4th IEEEInternational Conference on Cloud Computing Technology and ScienceProceedings, CloudCom 2012, pages 333–340. IEEE, December 2012.ISBN 978-1-4673-4511-8. → pages 44, 159[101] K. Keeton and A. Merchant. A framework for evaluating storagesystem dependability. In Proceedings of the International Conference onDependable Systems and Networks, DSN ’04, pages 877–886, June 2004.doi:10.1109/DSN.2004.1311958. → pages 38[102] K. Keeton, D. Beyer, E. Brau, A. Merchant, C. Santos, and A. Zhang. Onthe road to recovery: Restoring data after disasters. In Proceedings ofthe 1st ACM SIGOPS/EuroSys European Conference on Computer Systems,EuroSys ’06, pages 235–248, New York, NY, USA, 2006. ACM. ISBN1-59593-322-0. doi:10.1145/1217935.1217958. URL → pages 38[103] R. Koller and R. Rangaswami. I/O Deduplication: Utilizing ContentSimilarity to Improve I/O Performance. In Proceedings of the USENIXConference on File and Storage Technologies, FAST ’10, 2010. → pages 150[104] R. Kothiyal, V. Tarasov, P. Sehgal, and E. Zadok. Energy and perfor-mance evaluation of lossless file data compression on server systems.In International Systems and Storage Conference, SYSTOR, pages 4:1–4:12,New York, NY, USA, 2009. ACM. ISBN 978-1-60558-623-6. → pages141, 151[105] E. Kruus, C. Ungureanu, and C. Dubnicki. Bimodal content definedchunking for backup streams. In Proceedings of the 8th USENIXConference on File and Storage Technologies, FAST’10, pages 239–252,Berkeley, CA, USA, 2010. USENIX Association. URL → pages 33[106] A. C. Laity, N. Anagnostou, G. B. Berriman, J. C. Good, J. C. Jacob,D. S. Katz, and T. Prince. Montage: An Astronomical Image Mosaic182BIBLIOGRAPHYService for the NVO. In P. Shopbell, M. Britton, and R. Ebert, editors,Astronomical Data Analysis Software and Systems XIV, volume 347 ofAstronomical Society of the Pacific Conference Series, page 34, December2005. → pages 60, 74[107] T. C. Lethbridge, J. Singer, and A. Forward. How software engineersuse documentation: The state of the practice. IEEE Software, 20(6):35–39, 2003. → pages 122[108] H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon:Reliable, memory speed storage for cluster computing frameworks.In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’14,pages 6:1–6:15, New York, NY, USA, November 2014. ACM. ISBN978-1-4503-3252-1. doi:10.1145/2670979.2670985. URL → pages 24[109] A. Liguori and E. V. Hensbergen. Experiences with content address-able storage and virtual disks. In M. Ben-Yehuda, A. L. Cox, andS. Rixner, editors, Workshop on I/O Virtualization. USENIX Association,2008. → pages 126, 154[110] M. Liu, Y. Jin, J. Zhai, Y. Zhai, Q. Shi, X. Ma, and W. Chen. ACIC: Auto-matic Cloud I/O Configurator for HPC Applications. In Proceedings ofthe International Conference on High Performance Computing, Networking,Storage and Analysis, SC ’13, pages 38:1–38:12, New York, NY, USA,November 2013. ACM. ISBN 978-1-4503-2378-9. doi:10.1145/2503210.2503216. URL → pages95[111] N. Liu, C. Carothers, J. Cope, P. Carns, R. Ross, A. Crume, andC. Maltzahn. Modeling a leadership scale storage system. In Pro-ceedings of the 9th International Conference on Parallel Processing andApplied Mathematics - Vol. Part I, PPAM’11, pages 10–19, 2012. ISBN978-3-642-31463-6. URL→ pages 94[112] N. Liu, J. Cope, P. H. Carns, C. D. Carothers, R. B. Ross, G. Grider,A. Crume, and C. Maltzahn. On the role of burst buffers in leadership-class storage systems. In Proceedings IEEE 28th Symposium on MassStorage Systems and Technologies DBL [7], pages 1–12. → pages 3, 24,94, 126, 156183BIBLIOGRAPHY[113] J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, and C. Jin. FlexibleI/O and integration for scientific codes through the adaptable I/Osystem (ADIOS). In Proceedings of the 6th International Workshop onChallenges of Large Applications in Distributed Environments, CLADE ’08,pages 15–24, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-156-9.doi:10.1145/1383529.1383533. URL → pages 22[114] L. Ma, C. Zhen, B. Zhao, J. Ma, G. Wang, and X. Liu. Towards fast de-duplication using low energy coprocessor. In International Conferenceon Networking, Architecture, and Storage, pages 395–402, Los Alamitos,CA, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4134-1. →pages 151[115] K. Maheshwari, J. M. Wozniak, H. Yang, D. S. Katz, M. Ripeanu,V. Zavala, and M. Wilde. Evaluating storage systems for scientificdata in the cloud. In Proceedings of the 5th ACM Workshop on ScientificCloud Computing, ScienceCloud ’14, pages 33–40, New York, NY, USA,2014. ACM. ISBN 978-1-4503-2911-8. doi:10.1145/2608029.2608034.URL → pages 15, 21, 163[116] D. A. Menasce, V. A. Almeida, L. W. Dowdy, and L. Dowdy. Perfor-mance by design: computer capacity planning by example. Prentice HallPTR, Upper Saddle River, NJ, USA, 2004. ISBN 0130906735. → pages121[117] A. Merchant and P. Yu. Analytic modeling of clustered RAID withmapping based on nearly random permutation. IEEE Transactions onComputers, 45(3):367–373, March 1996. ISSN 0018-9340. doi:10.1109/12.485575. → pages 35[118] M. Mesnier, E. Thereska, G. Ganger, D. Ellard, and M. Seltzer. Fileclassification in self-* storage systems. In Autonomic Computing, 2004.Proceedings. International Conference on, pages 44–51, May 2004. doi:10.1109/ICAC.2004.1301346. → pages 36[119] M. P. Mesnier, M. Wachs, R. R. Sambasivan, A. X. Zheng, and G. R.Ganger. Modeling the relative fitness of storage. In Proceedings ofthe 2007 ACM International Conference on Measurement and Modeling ofComputer Systems, SIGMETRICS ’07, pages 37–48, New York, NY, USA,2007. ACM. ISBN 978-1-59593-639-4. doi:10.1145/1254882.1254887.URL → pages 166184BIBLIOGRAPHY[120] M. P. Mesnier, M. Wachs, R. R. Sambasivan, A. X. Zheng, and G. R.Ganger. Relative fitness modeling. Communications of ACM, 52(4):91–96, Apr. 2009. ISSN 0001-0782. doi:10.1145/1498765.1498789. URL → pages 36, 166[121] R. Mian, P. Martin, and J. L. Vazquez-Poletti. Provisioning dataanalytic workloads in a cloud. Journal of Future Generation Com-puter Systems, 29(6):1452 – 1458, 2013. ISSN 0167-739X. URL →pages 7, 39[122] E. Molina-Estolano, C. Maltzahn, J. Bent, and S. Brandt. Building aparallel file system simulator. In Journal of Physics: Conference Series,volume 180, page 012050. IOP Publishing, 2009. → pages 1, 36, 93[123] A. Montresor and M. Jelasity. PeerSim: A scalable P2P simulator. InProceedings of the 9th International Conference on Peer-to-Peer (P2P’09),pages 99–100, Seattle, WA, Sep 2009. → pages 36, 37, 46[124] G. Moont. 3d-dock suite., 2012. →pages 26, 43[125] A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidthnetwork file system. In Proceedings of the 18th ACM Symposium onOperating Systems Principles, SOSP ’01, pages 174–187, New York, NY,USA, 2001. ACM. ISBN 1-58113-389-8. → pages 29, 126, 150, 152[126] D. Narayanan, E. Thereska, and A. Ailamaki. Continuous resourcemonitoring for self-predicting dbms. In Proceedings of the 13th IEEEInternational Symposium on Modeling, Analysis, and Simulation of Com-puter and Telecommunication Systems, MASCOTS ’05, pages 239–248,Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2458-3. doi:10.1109/MASCOT.2005.21. URL → pages 37[127] R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: Managingperformance interference effects for qos-aware clouds. In Proceedingsof the 5th European Conference on Computer Systems, EuroSys ’10, pages237–250, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-577-2.doi:10.1145/1755913.1755938. URL → pages 167[128] NLANR/DAST. Iperf. → pages 111185BIBLIOGRAPHY[129] S. Pakin and M. Lang. Energy Modeling of Supercomputers and Large-Scale Scientific Applications. In Proceedings for the 4th InternationalGreen Computing Conference, IGCC ’13, June 2013. → pages 96[130] N. Park and D. Lilja. Characterizing datasets for data deduplication inbackup applications. In Proceedings of the IEEE International Symposiumon Workload Characterization, IISWC, pages 1–10, December 2010. doi:10.1109/IISWC.2010.5650369. → pages 154[131] T. E. Pereira, L. Sampaio, and F. V. Brasileiro. On the accuracy oftrace replay methods for file system evaluation. In Proceedings of theIEEE 21st International Symposium on Modeling, Analysis & Simulationof Computer and Telecommunication Systems, MASCOTS ’13, pages 380–383. IEEE, 2013. → pages 48[132] M. Petre. Uml in practice. In 35th International Conference on SoftwareEngineering (ICSE 2013), May 2013. → pages 122[133] R. Prabhakar, E. Kruus, G. Lu, and C. Ungureanu. Eeffsim: A discreteevent simulator for energy efficiency in large-scale storage systems.In Energy Aware Computing (ICEAC), 2011 International Conference on,pages 1–6. IEEE, 2011. → pages 96[134] D. Qin, A. D. Brown, and A. Goel. Reliable writeback for client-sideflash caches. In Proceedings of USENIX Annual Technical Conference,ATC ’14, pages 451–462, Philadelphia, PA, June 2014. USENIX Associa-tion. ISBN 978-1-931971-10-2. URL → pages 2[135] S. Quinlan and S. Dorward. Venti: A new approach to archival datastorage. In Proceedings of the USENIX Conference on File and StorageTechnologies, FAST ’02, Berkeley, CA, USA, 2002. USENIX Association.→ pages 29, 135, 141, 150, 154[136] I. Raicu, I. T. Foster, and Y. Zhao. Special section on many-taskcomputing. IEEE Transactions on Parallel and Distributed Systems, 22(6):897–898, 2011. ISSN 1045-9219. doi: → pages 26[137] S. Ranjan, J. Rolia, H. Fu, and E. Knightly. QoS-driven server migrationfor internet data centers. In Proceedings of the 10th IEEE InternationalWorkshop on Quality of Service, pages 3–12. IEEE, 2002. → pages 34, 39186BIBLIOGRAPHY[138] D. Reimer, A. Thomas, G. Ammons, T. Mummert, B. Alpern, andV. Bala. Opening black boxes: using semantic information to combatvirtual machine image sprawl. In ACM International Conference onVirtual execution environments, VEE, pages 111–120, New York, NY,USA, 2008. ACM. ISBN 978-1-59593-796-4. → pages 126, 150[139] S. Rhea, R. Cox, and A. Pesterev. Fast, inexpensive content-addressedstorage in foundation. In Proceedings of the USENIX 2008 AnnualTechnical Conference, USENIX ATC ’08, pages 143–156, Berkeley, CA,USA, 2008. USENIX Association. → pages 29, 135, 141, 150[140] C. A. D. Rose, T. Ferreto, R. N. Calheiros, W. Cirne, L. B. Costa,and D. Fireman. Allocation strategies for utilization of space-sharedresources in bag-of-tasks grids. Future Generation Computer Systems,24(5):331 – 341, 2008. ISSN 0167-739X. doi: URL → pages 165[141] T. Samak, C. Morin, and D. Bailey. Energy Consumption Modelsand Predictions for Large-Scale Systems. In Proceedings of the 2013IEEE 27th International Symposium on Parallel and Distributed ProcessingWorkshops and PhD Forum, IPDPSW ’13, pages 899–906, Washington,DC, USA, 2013. IEEE Computer Society. ISBN 978-0-7695-4979-8.doi:10.1109/IPDPSW.2013.228. URL → pages 96[142] E. Santos-Neto, W. Cirne, F. Brasileiro, and A. Lima. Exploitingreplication and data reuse to efficiently schedule data-intensiveapplications on grids. In Proceedings of the 10th International Con-ference on Job Scheduling Strategies for Parallel Processing, JSSPP ’04,pages 210–232, Berlin, Heidelberg, 2005. Springer-Verlag. ISBN3-540-25330-0, 978-3-540-25330-3. doi:10.1007/11407522_12. URL → pages 21, 57, 100[143] E. Santos-Neto, S. Al-Kiswany, N. Andrade, S. Gopalakrishnan, andM. Ripeanu. Enabling cross-layer optimizations in storage systemswith custom metadata. In Proceedings of the 17th international symposiumon High performance distributed computing, HPDC ’08, pages 213–216,New York, NY, USA, 2008. ACM. ISBN 9781595939975. → pages 16,18, 25187BIBLIOGRAPHY[144] F. Schmuck and R. Haskin. GPFS: A shared-disk file system forlarge computing clusters. In Proceedings of the 1st USENIX Conferenceon File and Storage Technologies, FAST’02, pages 16–16, Berkeley, CA,USA, 2002. USENIX Association. URL → pages 22, 24[145] P. Schwan. Lustre: Building a file system for 1000-node clusters. InProceedings of the 2003 Linux Symposium, volume 2003, 2003. → pages22[146] P. Sehgal, V. Tarasov, and E. Zadok. Evaluating performance andenergy in file system server workloads. In Proceedings of the 8thUSENIX Conference on File and Storage Technologies, FAST ’10, pages253–266, Berkeley, CA, USA, 2010. USENIX Association. → pages 25,150[147] T. Shibata, S. Choi, and K. Taura. File-access patterns of data-intensiveworkflow applications and their implications to distributed filesys-tems. In Proceedings of the 19th ACM International Symposium on HighPerformance Distributed Computing, HPDC ’10, pages 746–755, NewYork, NY, USA, 2010. ACM. ISBN 978-1-60558-942-8. → pages 26, 27,43, 62, 63[148] A. Skelley. DB2 advisor: An optimizer smart enough to recommendits own indexes. In Proceedings of the 16th International Conference onData Engineering, ICDE ’00, pages 101–, Washington, DC, USA, 2000.IEEE Computer Society. ISBN 0-7695-0506-6. URL → pages 37[149] C. U. Smith. Performance engineering ofsoftware systems. Addison-Wesley, 1:990, 1990. → pages 121[150] C. U. Smith and L. G. Williams. Performance engineering evaluationof object-oriented systems with spe· ed tm. In Computer PerformanceEvaluation Modelling Techniques and Tools, pages 135–154. Springer,1997. → pages 122[151] A. J. Storm, C. Garcia-Arellano, S. S. Lightstone, Y. Diao, and M. Suren-dra. Adaptive self-tuning memory in DB2. In Proceedings of the 32NdInternational Conference on Very Large Data Bases, VLDB ’06, pages 1081–1092. VLDB Endowment, 2006. URL → pages 34, 39188BIBLIOGRAPHY[152] J. D. Strunk, E. Thereska, C. Faloutsos, and G. R. Ganger. Usingutility to provision storage systems. In Proceedings of the 6th USENIXConference on File and Storage Technologies, FAST ’08, pages 313–328,2008. → pages 1, 3, 4, 7, 18, 34, 39, 45, 97, 130, 164[153] K. Tangwongsan, H. Pucha, D. G. Andersen, and M. Kaminsky.Efficient similarity estimation for systems exploiting data redundancy.In Proceedings of the 29th Conference on Information Communications,INFOCOM’10, pages 1487–1495, Piscataway, NJ, USA, 2010. IEEEPress. ISBN 978-1-4244-5836-3. URL → pages 140, 154, 167[154] The HDF Group. Hierarchical Data Format, version 5, 1997-2014. → pages 95[155] E. Thereska and G. R. Ganger. IRONModel: robust performancemodels in the wild. In Proceedings of the 2008 ACM SIGMETRICSInternational Conference on Measurement and Modeling of ComputerSystems, pages 253–264, 2008. → pages 37[156] E. Thereska, M. Abd-El-Malek, J. J. Wylie, D. Narayanan, and G. R.Ganger. Informed data distribution selection in a self-predictingstorage system. In Proceedings of the 3rd International Conference onAutonomic Computing, pages 187–198, 2006. → pages 1, 3, 6, 7, 18, 34,37, 40, 69, 94, 95, 159[157] E. Thereska, D. Narayanan, and G. R. Ganger. Towards self-predicting systems: What if you could ask “what-if”? Knowl-edge Engineering Review, 21(3):261–267, September 2006. ISSN 0269-8889. doi:10.1017/S0269888906000920. URL → pages 37[158] E. Thereska, B. Salmon, J. Strunk, M. Wachs, M. Abd-El-Malek,J. Lopez, and G. R. Ganger. Stardust: Tracking activity in a dis-tributed storage system. In Proceedings of the Joint InternationalConference on Measurement and Modeling of Computer Systems, SIG-METRICS ’06/Performance ’06, pages 3–14, New York, NY, USA,2006. ACM. ISBN 1-59593-319-0. doi:10.1145/1140277.1140280. URL → pages 34, 38, 41, 94,159[159] E. Vairavanathan, S. Al-Kiswany, L. B. Costa, Z. Zhang, D. S. Katz,M. Wilde, and M. Ripeanu. A workflow-aware storage system:189BIBLIOGRAPHYAn opportunity study. In Proceedings of the 12th IEEE InternationalSymposium on Cluster Computing and the Grid, CCGrid ’12, pages 326–334, Los Alamitos, CA, USA, June 2012. IEEE Computer Society. ISBN978-0-7695-4691-9. → pages vii, 11, 15, 16, 17, 18, 21, 23, 24, 27, 30, 44,57, 60, 62, 66, 112, 163[160] A. Varga. Using the OMNeT++ Discrete Event Simulation System inEducation. IEEE Transactions on Education, 42(4), 1999. ISSN 0018-9359.→ pages 94[161] M. Wachs, M. Abd-El-Malek, E. Thereska, and G. R. Ganger. Argon:Performance insulation for shared storage servers. In Proceedings ofthe 5th USENIX Conference on File and Storage Technologies, FAST ’07,pages 5–20, Berkeley, CA, USA, 2007. USENIX Association. URL → pages 166[162] Y. Wang and P. Lu. Dataflow detection and applications to workflowscheduling. Concurrency and Computation: Practice and Experience, 23(11):1261–1283, 2011. → pages 30, 55[163] G. Weikum, A. Moenkeberg, C. Hasse, and P. Zabback. Self-tuningdatabase technology and information services: From wishful thinkingto viable engineering. In Proceedings of the 28th International Conferenceon Very Large Data Bases, VLDB ’02, pages 20–31. VLDB Endowment,2002. URL → pages37[164] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn.Ceph: A scalable, high-performance distributed file system. InProceedings of the 7th Symposium on Operating Systems Design andImplementation, OSDI ’06, pages 307–320, Berkeley, CA, USA, 2006.USENIX Association. ISBN 1-931971-47-1. URL → pages 2, 102[165] G. Wiederhold. What is your software worth? Communications of ACM,49(9):65–75, September 2006. ISSN 0001-0782. doi:10.1145/1151030.1151031. URL → pages107[166] M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. T.Foster. Swift: A language for distributed parallel scripting. ParallelComputing, 37(9):633–652, 2011. → pages 26, 30, 43, 55190BIBLIOGRAPHY[167] J. Wilkes. Market Oriented Grid and Utility Computing, chapter 4: Utilityfunctions, prices, and negotiation. John Wiley & Sons, Inc., October2008. ISBN 978-0-470-28768-2. → pages 4, 45, 97[168] L. G. Williams and C. U. Smith. Performance evaluation of softwarearchitectures. In Proceedings of the 1st international workshop on Softwareand performance, pages 164–177. ACM, 1998. → pages 122[169] J. M. Wozniak and M. Wilde. Case studies in storage access byloosely coupled petascale applications. In Proceedings of the 4th AnnualWorkshop on Petascale Data Storage, PDSW ’09, pages 16–20, New York,NY, USA, 2009. ACM. ISBN 978-1-60558-883-4. → pages 24, 26, 27, 43,44, 62, 63[170] H. Yang. Energy Prediction for I/O Intensive Workflow Applications.Master’s thesis, The University of British Columbia, September 2014.URL → pages 93[171] H. Yang, L. B. Costa, and M. Ripeanu. Energy prediction for I/Ointensive workflows. In Proceedings of the 7th Workshop on Many-TaskComputing on Clouds, Grids, and Supercomputers, MTAGS ’14. ACM,November 2014. → pages v, 12, 15, 21, 43, 84, 87, 91, 161, 163[172] T. Yang, H. Jiang, D. Feng, Z. Niu, K. Zhou, and Y. Wan. Debar: Ascalable high-performance deduplication storage system for backupand archiving. In IEEE International Symposium on Parallel DistributedProcessing (IPDPS), pages 1 –12, April 2010. doi:10.1109/IPDPS.2010.5470468. → pages 126, 150[173] T. Ye and S. Kalyanaraman. A Recursive Random Search Algorithmfor Large-scale Network Parameter Configuration. In Proceedings of the2003 ACM SIGMETRICS International Conference on Measurement andModeling of Computer Systems, SIGMETRICS ’03, pages 196–205, NewYork, NY, USA, 2003. ACM. ISBN 1-58113-664-1. doi:10.1145/781027.781052. URL → pages 164[174] Z. Zhang, D. S. Katz, J. M. Wozniak, A. Espinosa, and I. Foster. Designand analysis of data management in scalable parallel scripting. InProceedings of the International Conference on High Performance Comput-ing, Networking, Storage and Analysis, SC ’12, pages 85:1–85:11, LosAlamitos, CA, USA, 2012. IEEE Computer Society Press. ISBN 978-1-4673-0804-5. → pages 3, 17, 24, 69191BIBLIOGRAPHY[175] Z. Zhang, D. S. Katz, M. Wilde, J. M. Wozniak, and I. T. Foster. MTCEnvelope: Defining the Capability of Large Scale Computers in theContext of Parallel Scripting Applications. In Proceedings of the 22ndInternational Symposium on High-Performance Parallel and DistributedComputing - HPDC’13, pages 37–48, June 2013. → pages 75, 95192Appendix AResearch CollaborationsDuring my PhD studies, I have conducted the research presented in thisdissertation and collaborated with other researchers on various projects. ThePreface presents the publications directly related to the research described inthis dissertation. I have also collaborated with other researchers in differentprojects. Below, I briefly describe each of these projects and list the resultedpublications.TOTEMTOTEM is a graph-processing solution for many graph-based applications,such as social networks and web analysis. As the memory footprint requiredto represent the graphs grows, they become more challenging to be pro-cessed efficiently. TOTEM’s goal is to leverage commodity hybrid platforms(i.e., platforms built with processors optimized for sequential processingand accelerators optimized for massively-parallel processing as GraphicsProcessing Unit (GPU)) to accelerate large-scale graph processing at lowcost. The long-term goal is to leverage the different characteristics of thesetwo types of processors (sequential processors and parallel accelerators)to provide a platform that is neither expensive (e.g., supercomputers) norinefficient (e.g., commodity clusters).Abdullah Gharaibeh leads this project, which also includes the collabora-193tion of Elizeu Santos-Neto, and Matei Ripeanu. The results of this researchhave been published in the following:Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Sys-tems [81]. Abdullah Gharaibeh, Elizeu Santos-Neto, Lauro BeltrãoCosta, and Matei Ripeanu. Journal Submission. Pages 1–14. UnderReview. Submitted in January 2014.The Energy Case for Graph Processing on Hybrid CPU and GPU Sys-tems [80]. Abdullah Gharaibeh, Elizeu Santos-Neto, Lauro BeltrãoCosta, and Matei Ripeanu. In Proceedings of the Workshop on Irreg-ular Applications: Architectures & Algorithms (IA3) in conjunctionwith SuperComputing ’13. ACM, November 2013.On Graphs, GPUs, and Blind Dating: A Workload to Processor Match-making Quest [79]. Abdullah Gharaibeh, Lauro Beltrão Costa, ElizeuSantos-Neto and Matei Ripeanu. 27th IEEE International Parallel& Distributed Processing Systems (IPDPS 2013). Pages 851–862.Acceptance rate: 21%. IEEE, May 2013A Yoke of Oxen and a Thousand Chickens for Heavy Lifting GraphProcessing [78]. Abdullah Gharaibeh, Lauro Beltrão Costa, ElizeuSantos-Neto and Matei Ripeanu. In Proceedings of the IEEE/ACMInternational Conference on Parallel Architectures and CompilationTechniques (PACT 2012). Pages 345–354. Acceptance rate: 19%. ACM,September 2012GPU Support for Batch Oriented WorkloadsI led this project during 2009, and it was conducted in collaboration withSamer Al-Kiswany and Matei Ripeanu. This project explores the abilityto use GPUs as co-processors to harness the inherent parallelism of batchoperations. Specifically, we have chosen Bloom filters (space-efficient datastructures that support the probabilistic representation of set membership) asthe queries these data structures support are often performed in batches. We194implemented BloomGPU, an open-source library that supports offloadingBloom filter support to the GPU, which outperforms an optimized CPUimplementation of the Bloom filter for large workloads. The results of thiswork were published in:GPU Support for Batch Oriented Workloads [53]. Lauro Beltrão Costa,Samer Al-Kiswany, and Matei Ripeanu. In Proceedings of the IEEE28th International Performance Computing and CommunicationsConference (IPCCC). IPCCC ’09. Pages 231–238. Acceptance rate: 29.7%.IEEE, December 2009.NodeWiz Grid Information ServiceLarge scale grid computing systems provide multitudinous services, fromdifferent providers, whose quality of service varies. These services aredeployed and undeployed in the grid with no central coordination, requiringa Grid Information Service (GIS) to help other grid entities to find the mostsuitable set of resources on which to deploy their services. NodeWiz is aGIS that allows multi-attribute range queries to be performed efficientlyin a distributed manner, while maintaining load balance and resilience tofailures.This work is a result of collaboration with Hewlett-Packard Labs and theDistributed Systems Lab at the Universidade Federal de Campina Grande(UFCG). Most of the work related to this project was carried out before Ijoined UBC, including a working version of the system that was used inproduction with OurGrid [50]. Since starting my PhD studies at UBC, itresulted in the following publication:Nodewiz: Fault-tolerant grid information service [26]. Sujoy Basu, LauroBeltrão Costa, Francisco Brasileiro, Sujata Banerjee, Puneet Sharma,and S-J Lee. Journal of Peer-to-Peer Networking and Applications,2(4):348–366. Springer, December 2009.195


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items