Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Energy prediction for I/O intensive workflow applications Yang, Hao 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2014_november_yang_hao.pdf [ 1.79MB ]
JSON: 24-1.0166973.json
JSON-LD: 24-1.0166973-ld.json
RDF/XML (Pretty): 24-1.0166973-rdf.xml
RDF/JSON: 24-1.0166973-rdf.json
Turtle: 24-1.0166973-turtle.txt
N-Triples: 24-1.0166973-rdf-ntriples.txt
Original Record: 24-1.0166973-source.json
Full Text

Full Text

Energy Prediction for I/O IntensiveWorkflow ApplicationsbyHao YangB.E., Huazhong University of Science and Technology, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)September 2014© Hao Yang 2014AbstractAs workflow-based data-intensive applications have become increasingly popular, the lackof support tools to aid resource provisioning decisions, to estimate the energy cost of run-ning such applications, or simply to support configuration choices has become increasinglyevident. The goal of this thesis is to design techniques and tools to predict the energy con-sumption of these workflow-based applications, evaluate different optimization techniquesfrom an energy perspective, and explore energy/performance tradeoffs.This thesis proposes a methodology to predict the energy consumption for workflowapplications. More concretely, it makes three key contributions: First, it proposes a simpleanalytical energy consumption model that enables adequately accurate energy consumptionpredictions. This makes it possible not only to estimate energy consumption but also toreason about the relative benefits different system configuration and provisioning decisionsoffer. Second, an empirical evaluation of energy consumption is carried out using syntheticbenchmarks and real workflow applications. This evaluation quantifies the energy savingsof performance optimizations for the distributed storage system as well as the energyand performance impact of power-centric tuning techniques. Third, it demonstrates thepredictors ability to expose energy performance tradeoffs for the synthetic benchmarks andworkflow applications by evaluating the accuracy of the energy consumption predictions.Overall, the prediction obtained an average accuracy of more than 85% and a median of90% across different scenarios, while using less than 200x less resources than running thaniiactual applications.iiiPrefaceThis dissertation is based on the energy prediction work that I led during my M.A.Sc stud-ies, which also had the collaboration with Lauro Beltra˜o Costa and Matei Ripeanu. I wasthe main author of the work, and responsible for coming up with the research methodol-ogy, evaluating it, analyzing the results as well as writing academic papers. The researchprojects that are the building blocks for this work are described in Chapter 1. Based onthe research presented in this dissertation, part of the materials have been submitted forthe following publication.Energy Prediction for I/O Intensive Workflow Applications Hao Yang, LauroBeltra˜o Costa, and Matei Ripeanu [56].ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research Projects as Building Blocks for This Work . . . . . . . . . . . . . 41.3.1 MosaStore Storage System . . . . . . . . . . . . . . . . . . . . . . . 41.3.2 Research Projects as Building Blocks for This Work . . . . . . . . . 41.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6v1.5 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Many-Task Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Supporting Middleware and Usage Scenario . . . . . . . . . . . . . . . . . . 102.2.1 Backend Storage and Intermediate Storage . . . . . . . . . . . . . . 102.2.2 MosaStore Intermediate Storage System . . . . . . . . . . . . . . . 122.2.3 Configuration Decisions . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 The Discrete-event Performance Predictor . . . . . . . . . . . . . . . . . . 172.3.1 The System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Model Seeding: System Identification . . . . . . . . . . . . . . . . . 192.3.3 Workload Description . . . . . . . . . . . . . . . . . . . . . . . . . . 203 The Design of the Energy Consumption Predictor . . . . . . . . . . . . . 223.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Energy Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Energy Model Seeding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1 Experimental Setup and Platform . . . . . . . . . . . . . . . . . . . . . . . 304.1.1 Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.2 Testbed and Power Meters . . . . . . . . . . . . . . . . . . . . . . . 314.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Synthetic Benchmarks: Workflow Patterns . . . . . . . . . . . . . . . . . . 324.2.1 Evaluating Energy Prediction Accuracy on DSS . . . . . . . . . . . 344.2.2 Evaluating Energy Prediction Accuracy on WOSS . . . . . . . . . . 35vi4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3 Predicting the Energy Envelope of Real Applications . . . . . . . . . . . . 384.3.1 BLAST Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.2 Increasing the Workflow Complexity: Montage Results . . . . . . . 404.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Predicting the Energy Impact of Power-centric Tuning . . . . . . . . . . . 444.5 Predicting Energy-Performance Tradeoffs . . . . . . . . . . . . . . . . . . . 465 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.1 What Are the Causes of Inaccuracies? . . . . . . . . . . . . . . . . . . . . . 515.2 What to Do to Improve Accuracy? . . . . . . . . . . . . . . . . . . . . . . 525.3 What Is the Advantage of Using the Proposed Energy Model ComparedWith Others? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.4 Optimizing for Time VS. Optimizing for Energy . . . . . . . . . . . . . . . 536 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 587.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60viiList of Tables2.1 Platform Performance Parameters . . . . . . . . . . . . . . . . . . . . . . . 203.1 Platform Power Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1 Platform power parameters for Taurus cluster. These values change whenpower tuning techniques are applied or other clusters are considered. . . . . 314.2 Metadata attributes and the corresponding optimizations . . . . . . . . . . 364.3 Characteristics of small Montage workload . . . . . . . . . . . . . . . . . . . 414.4 Characteristics of large Montage workload . . . . . . . . . . . . . . . . . . . 424.5 Per-stage results of large Montage workflow workload . . . . . . . . . . . . . 43viiiList of Figures2.1 Problem types with respect to data size and number of tasks [45] . . . . . . . . . 92.2 Architecture of Blue Gene/P Supercomputer [49]. This tiered setup shows thebottleneck due to the limited bandwidth between the compute nodes and the storagenodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 High level architecture of a workflow system. The shared intermediate stor-age harnesses the storage space of the participating compute nodes and providesa low latency shared storage space. The input/output data is staged in/out fromthe backend storage. Depending on configuration, each node may run also storageservice that contributes space to the intermediate storage (i.e., storage services andcomputing tasks may be collocated) [12]. . . . . . . . . . . . . . . . . . . . . . 122.4 MosaStore Storage System Architecture. There are three high-level compo-nents presented: the System Access Interface (SAI) performing the storage client;the storage nodes that store file chunks; the manager that stores file metadata [50]. 142.5 Cross-layer communication as proposed by Al-Kiswany [16].(i) the solid lines showthe chunk allocation initiated by the client, and later processed by the pattern-specific data placement modules. (ii) the dashed lines show the file location requestsmade by workflow scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15ix2.6 Breakdown of broadcast benchmark presented by Costa et al. [28]. In this bench-mark an input file is staged-in to the intermediate storage, then in the first stageone client reads it and produces an intermediate file. In the second stage, the in-termediate file is read by processes running in parallel on different clients. Eachof these processes writes its output independently on the intermediate storage andlater stages-out the output. The figure shows the time to create replicas, to executethe actual workflow and the total time. . . . . . . . . . . . . . . . . . . . . . . 162.7 The queue-based model: The application driver replays the workflow trace to em-ulate the workflow execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1 The predictor receives the application description, platform performance and powercharacteristics. The performance predictor estimates the time for several events inthe system and passes this information to a module that uses power characteristicsof the platform and uses the energy model to estimate the energy consumption. . . 284.1 Pipeline, Reduce and Broadcast benchmarks. Circles represent a workflow taskperforming CPU processing using stress [6] and arrows represent data transfersamong stages. The labels on the arrows represent the file sizes used in the benchmarks. 334.2 Actual and predicted average energy consumption for pipeline, reduce, broadcastbenchmarks on DSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Actual and predicted average energy consumption for the pipeline benchmark. . . 374.4 Actual and predicted average energy consumption for the reduce benchmark. . . . 374.5 Actual and predicted average energy consumption for the broadcast benchmark. . 374.6 BLAST workflow. All nodes search BLAST database (1.8GB) in parallel. . . . . . 394.7 Actual and predicted average energy consumption and execution time for BLAST. 394.8 Montage workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40x4.9 Actual and predicted average energy consumption and execution time for smallMontage workload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.10 Actual and predicted average energy consumption and execution time for largeMontage workload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.11 Actual and predicted average energy consumption and execution time for BLASTfor various CPU frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.12 Actual and predicted average energy consumption and execution time for the pipelinebenchmark for various CPU frequencies. . . . . . . . . . . . . . . . . . . . . . . 454.13 Actual and predicted Montage energy-delay product (EDP) at the various scalesthe experiments can be executed in ‘Taurus’ cluster. . . . . . . . . . . . . . . . . 474.14 Actual and predicted Montage energy consumption and performance at the variousscales the experiments can be executed in ‘Taurus’ cluster. The numbers in theplot represent the number of allocated nodes in the executed scenarios. . . . . . . 474.15 Actual and predicted Montage energy-delay product (EDP) on up to 15 nodes in‘Sagittaire’ cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.16 Actual and predicted Montage energy consumption and performance at the variousscales in ‘Sagittaire’ cluster. The numbers in the plot represent the number ofallocated nodes in the executed scenarios. . . . . . . . . . . . . . . . . . . . . . 484.17 Predicted Montage energy-delay product (EDP) on up to 50 nodes in a hypotheticalcluster in which idle power is 10% of the peak. . . . . . . . . . . . . . . . . . . . 49xiList of AcronymsBLAST The Basic Local Alignment Search ToolDSS Default Storage SystemEDP Energy Delay ProductFITS Flexible Image Transport SystemFIFO First In, First OutFUSE Filesystem in UserspaceHPC High Performance ComputingMPI Message Passing InterfaceMTC Many Task ComputingPOSIX Portable Operating System InterfaceSAI System Access InterfaceWOSS Workflow-Optimized Storage SystemxiiAcknowledgementsFirst of all, I would like to thank my advisor Dr. Matei Ripeanu for his support duringmy graduate study. His insightful advices, patience and continuous support have carriedme throughout this journey. What I accomplished during my MASc study, as well as theimprovement of my reasoning, critical thinking would not have been possible without hisguidance.I also want to thank my collaborators Lauro Beltra˜o Costa, Samer Al-Kiswany, E-malayan Vairavanathan, Ketan Maheshwari, Abmar Barros for their advices, suggestionsand efforts on the projects we worked together. The projects would never been accom-plished without their contribution. I want to thank Prof. Karthik Pattabiraman, Prof.Sathish Gopalakrishnan and other members of the Computer Systems Reading Group (C-SRG) for the weekly paper discussion and giving me the opportunity to have a glimpse ofdifferent research areas. I want to extend my thanks as well to my labmates Bo Fang, ElizeuSantos-Neto, Abdullah Gharaibeh and others for creating a wonderful lab environment andfor the enjoyable discussions we had.I am grateful to my friends in Canada, China and other places in the world. I appreciatethe encouragement, friendship they provide.Last but not least, special thanks to my family. To me they are the foremost and ever-lasting source of support. I would never be whom I am today without my parents. I am alsograteful to my girlfriend. I deeply appreciate their unconditional love and understandingxiiiof my decisions (including going to Canada to pursue my master degree).xivDedicationTo my family.xvChapter 1Introduction1.1 MotivationScientific investigation relies increasingly on workflow-based data-intensive applications.These workflow applications are typically assembled using different standalone binaries,which communicate via temporary files stored on a distributed storage system [44]. Aworkflow management system schedules the tasks resulted from the execution of thesebinaries based on completion of dependent tasks [54].In this setup, the performance of the storage system plays a key role in the overallworkflow performance [27, 51]. In fact, the storage systems have evolved to incorporateadvanced techniques that enable trade-offs over interrelated performance metrics such asthroughput, reliability, and generated I/O overhead, to best match the application/de-ployment scenario at hand [11, 14]. At the same time, user/administrator decisions involveallocating resources (e.g., total number of nodes) and configuring the storage system (e.g.,chunk size, data placement policy, and replication level).Consequently, configuring and provisioning the system entails searching a complexmulti-dimensional configuration space. In this space, generally the user’s goal is “to opti-mize a multi-objective problem, in at least two dimensions: maximize performance (e.g.,reduce application execution time) while minimizing cost (e.g., reduce the total number ofCPU hours or dollar amount spent)” [27].11.2. Proposed ApproachWhile the trade-off space over these traditional performance metrics has been extensive-ly studied over the past decades, performance with regard to energy efficiency is relativelynew. Moreover, this aspect has grown in importance due its impact on cost or even thefeasibility of building large (exascale) data-centers/supercomputers [19]. There have beenincreasing initiatives in both industry and academia to drive energy-efficient optimizations,system design, modeling approaches and other energy-oriented aspects [5, 10, 18, 23, 31, 33].This context presents three main questions: First, in what scenarios, if any, do existingoptimization techniques lead to energy savings? Second, What is the performance andenergy sensitivity, in terms of time-to-solution and energy consumption, of power-centrictuning? Third, How can users balance time-to-solution and energy consumption when givena target application and its inputs? The goal of this thesis is to answer these questions andadditionally build tools to support the user/administrator in the complex and relativelyunexplored task of determining the desired balance between application’s time-to-solutionand its energy bill.1.2 Proposed ApproachThe focus of this thesis is to devise an energy prediction mechanism that is able to predictthe application energy consumption given the resource allocation and storage system con-figuration. To support users make system configuration, resource allocation, performanceand energy balance decisions, the mechanism should be time efficient (e.g., short runtime tosearch the configuration space) and easy to use (e.g., it should not require complex systeminstrumentation or seeding measurements. Section 3.1 discusses the detailed requirements.This work follows a two-pronged approach to build the mechanism: First, a coarse-grained energy consumption model is built to link the energy characteristics of the un-derlying computing platform and those of the workflow application. The energy model21.2. Proposed Approachcaptures different coarse-grained application states: idle state, application processing, s-torage operations, network transfers. The relatively high-level model enables time efficientpredictions for different configuration scenarios while still guaranteeing reasonable accura-cy.Second, this work augments an application performance predictor [27] our group hasbuilt (and I have contributed to) with the energy model to enable energy consumptionpredictions. The predictor uses a queue-based model to simulate the time each participatingmachine of the distributed platform spends in the aforementioned power states. Syntheticbenchmarks are used to gather the system service times of the platform and power profilesof according application states. To seed the energy estimation model, the energy predictortakes the underlying computing platform’s performance and power characteristics, theapplication traces logged by the storage system as inputs. Based on the workload trace thepredictor generates the I/O events to be processed and drives the prediction. It predicts thetime-to-solution and energy consumption of a real application execution given the resourceallocation setting and storage system configuration.The proposed approach uses a simple analytical model, seeding process, utilizes thesimulation model from a performance predictor and obtains sufficient accuracy. Comparedto an only analytical model based approach that focuses a single node setup, the approachtargets a distributed environment. Compared to a detailed simulation approach that con-siders low-level system events (e.g., simulating each network packet), this approach is morelightweight and does not require kernel/storage systems changes to enable prediction. Also,it still achieves sufficient accuracy to reason about different configuration decisions, andperformance energy tradeoffs.31.3. Research Projects as Building Blocks for This Work1.3 Research Projects as Building Blocks for This Work1.3.1 MosaStore Storage SystemMosaStore1 is a research project that builds a versatile storage system that “harnessesresources from network-connected machines to build a high-performance, yet low cost,storage system” [14].I have collaborated with Samer Al-Kiswany, Lauro Beltra˜o Costa, Emalayan Vaira-vanathan, Abmar Barros, and Matei Ripeanu on this project. I was one of the maindesigners and developers involved in the project. I implemented key modules, functionaltests, unit tests, synthetic benchmarks for the system, and I was involved in peer codereview, contributed to the automated testing framework improvement, application inte-gration and various debugging issues. The direct contribution of this project is an opensource working prototype, and more importantly, this prototype serves as the experimentalvehicle for this thesis.1.3.2 Research Projects as Building Blocks for This WorkThis thesis proposes an energy consumption model that uses power states and time esti-mates in each state to predict the overall energy consumption. The proposed methodologyin this thesis requires several essential building blocks: a performance prediction mechanis-m that can give time estimates to the energy prediction model, a evaluation platform thatcan run workflow applications efficiently, and support versatile configurations in which theaccuracy of the proposed energy prediction mechanism is evaluated. This section describesmy contributions to these building blocks that form together the experimental frameworkfor this thesis. The following text presents the description of these projects, the relationshipto my thesis, my contributions, and my publications related to these projects.1http://www.mosastore.net41.3. Research Projects as Building Blocks for This WorkIntermediate Storage Prediction and Provisioning for Workflow Applica-tions. Users have to make system provisioning, resource allocation, and configurationdecisions for I/O-intensive workflow applications. To enable selecting a good choice ina reasonable time, this project proposes an approach that accelerates the exploration ofthe configuration space based on a low cost performance predictor that estimates totalexecution time of a workflow application in a given setup.Relationship to my thesis: The performance predictor built by this project gives theessential inputs for the energy model proposed in my thesis. My thesis augments theperformance predictor with fine-grained time estimates of different power states. Also mythesis evaluates the configurations visited in this project from an energy perspective.My contributions were: proposing evaluation scenarios, designing system improvementsthat support this project, validation for the performance predictor.The results of this research have been published by Costa et al. [26], [27], [24].A Workflow-Optimized Storage System using Cross-Layer Optimization.This projects proposes using file system custom metadata as a bidirectional communi-cation channel between applications and the storage middleware. This channel can beused to pass hints that enable cross-layer optimizations.Relationship to my thesis: This project built an intermediate storage prototype that mythesis uses as the main evaluation platform. The experience gained in this project helpedme evaluate the proposed energy prediction mechanism using synthetic benchmarks andreal applications in my thesis.My contributions were: architecting and implementing system modules (e.g., read mod-ule optimizations), validation for the proposed workflow-optimized storage system, debug-ging the system integration with applications, proposing evaluation scenarios.The results of this work were published by Al-Kiswany et al. [13], [12],[16], and Costa51.4. Contributionset al. [28].Evaluating Storage Systems for Scientific Data in the Cloud. Infrastructure-as-a-Service (IaaS) clouds are an appealing resource for scientific computing. This projectsinvestigates the capabilities of a variety of POSIX-accessible distributed storage systems tomanage data access patterns resulting from workflow application executions in the cloud.Relationship to my thesis: This project improved my understanding of the storagesystem that is used by my thesis, and it helped the integration of the storage systemevaluated in my thesis and the workflow applications my thesis targets.My contributions were: packaging MosaStore system as a reusable cloud image, com-paring and analyzing different storage systems’ performance.The result of this work was published by Maheshwari et al. [41].1.4 ContributionsThis thesis makes the following contributions:1. Proposes a simple analytical energy consumption model that enables adequately ac-curate energy consumption predictions. This makes it possible not only to estimateenergy consumption but also to reason about the relative benefits different systemconfiguration and provisioning decisions offer.2. Carries out an empirical evaluation of energy consumption using synthetic bench-marks and real workflow applications. This evaluation quantifies the energy savingsof performance optimizations for the distributed storage system as well as the energyand performance impact of power-centric tuning techniques.3. Demonstrates the predictor’s ability to expose energy performance tradeoffs for thesynthetic benchmarks and workflow applications by evaluating the accuracy of the61.5. Dissertation Structureenergy consumption predictions. Overall, the prediction obtained an average accu-racy of more than 85% and a median of 90% across different scenarios, while usingless than 200x less resources than running actual applications.1.5 Dissertation StructureThe rest of this dissertation is organized as follows: Chapter 2 discusses the target appli-cations, supporting middleware, and the performance predictor that is used as the startingpoint of this thesis. Chapter 3 details the design of the energy predictor. Chapter 4 eval-uates the energy predictor using synthetic benchmarks and real-world applications underdifferent configuration choices. Chapter 6 describes previous research related to this work,while Chapter 5 discusses prediction inaccuracies and performance optimization vs. ener-gy optimization. Finally Chapter 7 concludes this dissertation and proposes some futuredirections.7Chapter 2BackgroundThis chapter discusses the application domain this work targets: workflow applications,and the underlying intermediate storage layer for workflow application execution. Then itexplains the rationale for using intermediate storage systems and demonstrates an exampleof such systems that our group built. Finally it discusses the performance predictor thatpredicts the workflow application’s execution time.2.1 Many-Task ApplicationsScientists from various fields like astronomy, astrophysics, chemistry, pharmaceutical do-main are dealing with an increasing amount of data. Based on the input data size andnumber of tasks, the problem space can be partitioned into four main categories [45](Figure2.1).To process a small number of tasks and small input size, tightly coupled Message Pass-ing Interface (MPI) applications are often used. On the other hand, data analytics likedata mining that handle a large amount of data often adopt MapReduce [29]. Whendealing with large number of tasks and increasingly large input size, another widely adopt-ed approach to support scientific workflow applications is the many-task approach [44].Many-Task Computing (MTC) keeps the data size of an individual task modest, however,it handles large amount of tasks and large datasets. Some MTC applications can have82.1. Many-Task ApplicationsFigure 2.1: Problem types with respect to data size and number of tasks [45]simple workflows (e.g., BLAST, The Basic Local Alignment Search Tool [17]), while othershave multiple workflow stages and various data access patterns (e.g., Montage [40] - an as-tronomy application that assembles Flexible Image Transport System (FITS) images intocustom mosaics).Many-task workflows use a loose task-coupling model: the standalone executables thatcompose the workflow communicate through temporary files stored in a shared storagesystem. This model offers several advantages when running large-scale computing equip-ments: (1) Rapid development. Researchers can use legacy application binaries and writein high-level scripting language such Swift [54] to produce highly parallel executions. (2)Easy to deploy. Because of the loosely task-coupled model, the tasks can be scheduled in adistributed system easily as long as the storage is shared. Typically the independent tasksof a workflow can be scheduled arbitrarily on the allocated machines. (3) Fault tolerance.Since MTC tasks communicate via file system operations (e.g, producing an intermediate92.2. Supporting Middleware and Usage Scenarioresult file, reading a file written by a previous stage), it is natural and easy to resume theworkflow using the intermediate files on the shared storage in the face of failure. However,there are several drawbacks related to the execution model. For instance, the workflowtasks that exhibit different patterns (e.g., metadata operation rate, I/O volume and fre-quency) are mapped uniformly to the underlying shared file systems. The generality of theapproach leads to poor performance in some cases.There are research works that propose different approaches and try to improve the per-formance of executing many-task applications. Some past work benchmarks key metricsof the performance of shared file systems in order to understand and alleviate the perfor-mance bottleneck [59], while some past work proposes using POSIX extended file attributesin the support storage to enable per-file optimized operations for application performanceimprovement [12].2.2 Supporting Middleware and Usage Scenario2.2.1 Backend Storage and Intermediate StorageTo efficiently run HPC workloads, scientists use traditional scientific environments suchas clusters, grids and supercomputers (e.g., IBM Blue Gene/P supercomputers [3]), butalso emerging computing platforms such as cloud environments (e.g., Amazon EC2 [1]).Traditionally scientific applications are supported on generic distributed storage systemsincluding GPFS [8] and Lustre [9]. These systems provide reliable and secure support tomost of the applications, however, these generic systems offer limited performance to sci-entific workflows without application-specific optimizations. As shown in Figure 2.2, in theBlue Gene/P supercomputer when a compute node (storage client) tries to write/retrievefiles from the file system, it sends I/O requests to I/O nodes, then the I/O nodes forward102.2. Supporting Middleware and Usage Scenariothe requests to file server nodes. Additionally these generic distributed storage systemsprovide strong consistency semantics and thus sacrifice performance (e.g., GPFS performspoorly when files are under the same directory [58]).40960 compute nodes (160K cores)   10 Gbps Switch Complex GPFS: deployed on 128 file server nodes (3 Petabytes storage capacity) 640 IO Nodes Torus Network 6.4 Gbps per link. Tree network (850 MBps  x 640) 10 Gb/s x 128  Figure 2.2: Architecture of Blue Gene/P Supercomputer [49]. This tiered setup shows thebottleneck due to the limited bandwidth between the compute nodes and the storage nodes.To avoid the latency to access the backend storage system and also to trade consisten-cy for performance, recent work [14, 20, 55] proposes using an in-memory shared storagelayer as an intermediate storage system among compute nodes themselves for inter-taskcommunication (Figure 2.3). Each compute node that participates in the workflow execu-tion contributes its local storage to form a shared intermediate storage. At the beginningof the execution, the input files are staged-in from the file server nodes to the computenodes. During the execution, the workflow runtime engine schedules workflow tasks tothe compute nodes that produce intermediate files to the shared storage, and the files areconsumed by later workflow tasks. When the execution finishes, the final output files arestaged-out from the intermediate storage back to the backend storage system. Therefore,112.2. Supporting Middleware and Usage Scenariothe intermediate storage provides a high-performance storage abstraction to support theworkflow execution.Shared Intermediate StorageApp. Task Local storageWorkflowRuntimeEngineCompute Nodes  ... App. Task Local storageApp. Task Local storageSchedules workflow tasksProgressupdateFigure 2.3: High level architecture of a workflow system. The shared intermediate storageharnesses the storage space of the participating compute nodes and provides a low latency sharedstorage space. The input/output data is staged in/out from the backend storage. Depending onconfiguration, each node may run also storage service that contributes space to the intermediatestorage (i.e., storage services and computing tasks may be collocated) [12].This relatively simple execution model has allowed assembling complex workflow appli-cations and executing them on large shared-nothing infrastructures. For example, Montageis an image processing workflow that assembles together tens of different standalone exe-cutables, generates tens of thousands of independent tasks, processes hundreds of GBs ofdata, and routinely uses hundreds of cluster nodes [40].2.2.2 MosaStore Intermediate Storage System2This section describes MosaStore, a highly configurable intermediate storage system that isdesigned to harness storage space from connected machines and build a high performance2Chapter 1 presents my contribution to MosaStore and to the resulting publications.122.2. Supporting Middleware and Usage Scenarioand scalable storage layer for different application environments.The direct result of the MosaStore project is a working prototype. Also the prototypeserves as the vehicle for this thesis and other research projects. It has been used bymultiple institutions in different projects. Besides the workflow application domain thisthesis targets, MosaStore is also used for a number of other domains: configurable security[38], checkpointing for desktop grid computing [15], and data deduplication [25].I was actively involved in Intermediate Storage Prediction and Provisioning for Work-flow Applications project [24, 26, 27], Using Cross-Layer Optimization in Workflow Opti-mized Storage project [12, 13, 16, 28], Evaluating Storage Systems for Scientific Data inthe Cloud project [41], and the energy consumption prediction project presented in thisthesis [56]. MosaStore is the main research and evaluation platform for the work presentedin this thesis.MosaStore ArchitectureMosaStore [14, 51] is an object-based distributed storage system that implements the afore-mentioned execution model and uses cross-layer optimization between the workflow sched-uler and storage system to further improve the application’s performance. Figure 2.4 showsthe architecture of MosaStore. It has three major components: the centralized metadataserver, the donor nodes (storage nodes) that store data chunks, and the System Access In-terface (SAI) at the client side which offers the I/O interface. The following text describesthe three components. A more complete description can be found in Al-Kiswany et al.[50].• The metadata manager. The manager maintains the persistent metadata infor-mation about the files and directories in the system. It is stateless (e.g., it does notkeep track of the cached data at the client side, or maintain the list of open files).132.2. Supporting Middleware and Usage ScenarioApplicationChunk_4infoChunk_3infoChunk_2infoChunk_1infoSystem AccessInterface - 1Donor node - 1Ext-3 file systemDonor node - 1Ext-3 file systemManagerRoot/project/file_1Control messagesData messagesMetadatamessagesFigure 2.4: MosaStore Storage System Architecture. There are three high-level componentspresented: the System Access Interface (SAI) performing the storage client; the storage nodes thatstore file chunks; the manager that stores file metadata [50].After the clients retrieve the metata information about the files needed, the clientscan directly initiate parallel I/O requests with the storage nodes. This design of themetadata service greatly simplifies the development complexity, while improves thesystem scalability. The manager uses the NDBM library [4] to store the metadatainformation.• The storage node. The storage nodes contribute the storage space (can be memoryor disk based) to the shared distributed storage. They publish their status to themanager using a soft-state registration process. Also it involves in serving the I/Orequests of the clients and participate in the garbage collection mechanism based onan epidemic protocol.• The System Access Interface (SAI). The SAI is a user-level file system imple-mentation file system using FUSE kernel module [2]. It provides a Portable Operating142.2. Supporting Middleware and Usage ScenarioSystem Interface (POSIX) API. The design principle is to improve the performanceof the application execution and not necessarily offer strong consistency semantics.Thus it relaxes some of the POSIX system calls (e.g., fsync, fstatfs) while optimizingthe major system calls (e.g., open, read, write).File-Level ConfigurationMosaStore enables multiple global configuration parameters, such as deployment medi-a (e.g., RAM, disks), chunk size (files are split into chunks for high performance I/O),maximum allowed per-node storage space. More importantly, MosaStore enables per-fileconfiguration, which leads to substantial performance benefits when an application hasmultiple file access patterns. For instance, when a large file is read multiple times by thesame client during the execution, users can increase the cache size for this file to avoidremote fetches from storage nodes.Figure 2.5: Cross-layer communication as proposed by Al-Kiswany [16].(i) the solid lines show thechunk allocation initiated by the client, and later processed by the pattern-specific data placementmodules. (ii) the dashed lines show the file location requests made by workflow scheduler.To enable per-file configuration, MosaStore uses POSIX extended attributes of files tokeep their access patterns. Figure 2.5 shows the design of cross-layer communication inMosaStore. From the client’s perspective, it can give hints about the data access patterns152.2. Supporting Middleware and Usage Scenarioof the files that it will later consume to MosaStore and set the hints using the extendedattributes. After receiving the attribute set call, MosaStore uses the corresponding dataplacement policies for the files instead of a default policy. From the runtime scheduler’sperspective, it can retrieve the file location and other information stored in the extendedattributes from MosaStore, thus make optimized scheduling decisions (e.g., putting thecoming computation to the node which stores the input files). The cross-layer communica-tion improves the workflow performance as it enables data access optimizations at file-levelgranularity. Previous work [16, 51] has shown that MosaStore as an optimized intermediatestorage can significantly reduce the execution time of complex workflow applications overthe default backend file systems.2.2.3 Configuration DecisionsFigure 2.6: Breakdown of broadcast benchmark presented by Costa et al. [28]. In this benchmarkan input file is staged-in to the intermediate storage, then in the first stage one client reads itand produces an intermediate file. In the second stage, the intermediate file is read by processesrunning in parallel on different clients. Each of these processes writes its output independentlyon the intermediate storage and later stages-out the output. The figure shows the time to createreplicas, to execute the actual workflow and the total time.162.3. The Discrete-event Performance PredictorDespite the advantages brought by this workflow application execution model, there aremany configuration choices (e.g., data placement policies, file chunk size, number of nodesto allocate to storage) [28] to be made for the application execution. Different workflowapplications, however, achieve optimal performance with different configuration choices[11, 14]. Figure 2.6 highlights the problem using a broadcast benchmark. As one increasesthe number of replicas of the intermediate file, the time to create all the replicas increases,however, the workflow time deceases due to the fact that there are more data access pointswhen there are more replicas in the system. The figure shows the total execution timereaches a minimum when the storage system sets the number of replicas to 8, which isnot known beforehand. There exists a configuration space where system configurationparameters can be tuned and various parameter setup can lead to substantial performancedifferences.Exploring the configuration space via application runs is time consuming and costssubstantial computational resources. As a result, there has been increasing demand for op-timized configuration and resource provisioning decisions via time-efficient and lightweightapproaches. To this end, our group built a performance predictor to efficiently predictthe application performance given a certain resource and storage configuration. The nextsection presents the performance predictor.2.3 The Discrete-event Performance Predictor3As shown in previous sections, the time-to-solution of a workflow application can varysignificantly depending on the configuration choices. However, manually exhausting allthe configuration scenarios is both time and resource consuming. Costa at el. [27] build aperformance predictor in the context of workflow applications that addresses the problem of3Chapter 1 presents my contribution to the performance predictor and to the resulting publications.172.3. The Discrete-event Performance Predictorconfiguration space exploration and supporting decisions. Since this performance predictoris the starting point for this work on predicting energy consumption of workflow applicationexecution, the rest of this subsection presents it in more detail. Additional details can befound in Costa et al. [27].The performance predictor uses a queue-based model to represent the distributed stor-age system. It takes as inputs a description of the overall workflow composition, a charac-terization of the workload generated by each stage of the workflow application, the systemconfiguration (e.g., the replication level, system-wide chunk size used by the shared stor-age system), and the performance characteristics of the hardware platform (summarized inFigure 3.1). The predictor uses the system configuration and performance characteristicsto instantiate the intermediate storage system model, and uses the workload descriptionto drive a discrete-event simulation to obtain runtime estimates for each stage of the work-flow and for the aggregate runtime. The remaining of this section explains the key buildingblocks of the performance predictor.2.3.1 The System ModelThe predictor models the participating system components of an intermediate storage. Thesystem components are modeled similarly: each system component is composed of a servicemodule with its in- and out- queues (shown in Figure 2.7). An application driver emulatesthe workflow scheduler and replays the application traces for all workflow stages. Then, thestorage layer is modeled by the following components: the manager component is respon-sible for storage metadata operations. The storage component stores and replicates filechunks. The client component provides an I/O interface to the application by communi-cating with storage components (via read/write operations) and metadata manager. Eachmodule runs as a service that handles various request types. A service has configurable182.3. The Discrete-event Performance PredictorNet Manager Service Net Storage Service Network  core In queue Out queue Service queue Net Client Service Application Driver Figure 2.7: The queue-based model: The application driver replays the workflow trace to emulatethe workflow execution.service times for different request types. Each request is placed in the corresponding FIFOqueue and later removed from the queue after it is fulfilled.The rationale behind the prediction mechanism development is increasing accuracy ofthe modeling until it reaches a adequate level (correctly predicting the relative performanceof different configuration choices does not require perfect accuracy). To avoid over engi-neering, the predictor captures core I/O events and models file operations at chunk-levelgranularity. However, it simplifies the modeling of metadata operations and captures con-trol paths at coarser granularity as the accuracy of modeling metadata operations hardlyimpact the total execution time.2.3.2 Model Seeding: System IdentificationThe aforementioned system model is a generic abstraction of actual hardware platforms.To determine the performance characteristics of an actual platform, the predictor uses alightweight identification process for its key performance-related parameters (Table 2.1).192.3. The Discrete-event Performance PredictorTable 2.1: Platform Performance Parameterslocal network service time µlocalremote network service time µremotemanager service time µmastorage node service time µsmclient processing time µcliThe parameters include the service time of the three system components (manager - µma,storage - µsm, client - µcli). Each of the three modules has a network component thathandles the incoming and outgoing requests. Since the client can request the data from acollocated or remote storage module, the network service time is different in the two cases.Thus, the key parameters for system identification also include remote network service timeµ(remote) and local network service time (µlocal).A non-intrusive and low-cost procedure at client-level (no changes required to the kernelmodule or underlying system) identifies the value to seed storage system’s parameters. Theseeding process deploys one client, one storage module and one manager on the actualmachines, and measures the time to read/write files of different sizes. Also a script runs anetwork utility tool (e.g., iperf) measures the network service times in scenarios where theclient and the storage module are collocated and non-collocated.The predictor is seeded with an empirical distribution of obtained values for each perfor-mance parameter. Use a distribution instead of a fix value reflects the actual applicationsruns on the computing platform.2.3.3 Workload DescriptionThe workload description is an application trace logged by the distributed storage. Thetrace reveals two important pieces of information of the workflow application : (i) it con-tains per client I/O operations (e.g., open, read, write, flush) with timestamps, operation202.3. The Discrete-event Performance Predictorsize, offset and type as well as application compute times between these operations. (ii)it shows the files’ dependency graph and the predictor can use this graph to infer theworkflow stages and tasks.After the workload description is obtained, the predictor processes the trace to inferthe operations’ runtime, compute times of every workflow task, generates I/O events to besimulated, and later simulates the application’s I/O execution by driving the I/O events.21Chapter 3The Design of the EnergyConsumption PredictorThe previous chapter discusses the application domain, the middleware and usage scenarios,and highlights the performance configuration problem. The main focus of this dissertationis to address the energy-based analogue of the configuration and provisioning problemfocused on application turnaround time in the same execution context. The proposedapproach is to design energy prediction tools that enable exploring the system configurationand provisioning space, and evaluating the performance-energy tradeoffs. This chapterdiscusses the requirements for the energy predictor (Section 3.1), the analytical energymodel that estimates the application execution’s energy consumption (Section 3.2), howthe energy model is seeded with the actual power profiles obtained from the computingplatform (Section 3.3), and finally the implementation of the energy predictor (Section3.4).3.1 RequirementsTo reach the goal of designing energy prediction tools to find optimal configuration in-stead of running actual experiments, predicting energy consumption should not be time-consuming (so that the tool can be used to explore multiple configurations) and should223.2. Energy Model Descriptionprovide adequate energy consumption accuracy (so that the relative estimates of energyperformance tradeoffs are valid among different configurations, thus making it possible toevaluate the tradeoffs among different configuration choices). Additionally, the predictiontools should be simple to use: it should not require complex system instrumentation orseeding measurements.These considerations make the performance predictor described in Section 2.3 a goodstarting point for a energy prediction tool. It provides a breakdown of time spent byeach system component, which can be leveraged as input to the energy predictor whilesatisfying a common set of requirements: (i) simple model and seeding mechanism, (ii)effective identification of the desired system configuration, (iii) scalability to predict aworkflow application run on an entire cluster while using much less resources than runningthe actual application [27].3.2 Energy Model DescriptionA typical stage of a workflow progresses as follows: (i) each node brings the input datafrom the intermediate storage to memory (likely through multiple I/O operations), (ii) theprocessor loads the data from the memory and processes it, (iii) the output is pushed backto the intermediate storage.Thus different phases of the workload can be associated with different power profiles:(1) Idle state - part of the power is spent simply to keep the node on; (2) Applicationprocessing state - the node runs application task binaries on the CPU with the data alreadyin memory (once it has already been fetched from storage); (3) Storage servicing state- serving read/write requests; and (4) Network IO state - performing inter-node datatransfers. The energy usage modeling is guided according to the power consumption profileof each of these states: idle (P idle), CPU processing (PApp), storage operations (P storage),233.2. Energy Model Descriptionand network transfers (Pnet). As RAMDisks are used for the shared storage4, the I/Ooperations are mainly operations over RAM. This idea of profiling different power states isin the same principle as the coarse-grained energy consumption models used by Costa etal. [25] and Ibtesham et al. [35] in different contexts. Also these power states are sufficientto represent major execution states of the workflow runs that this work targets.With this mindset, the total energy spent during a workflow application execution canbe expressed as the sum of the energy consumption of individual nodes:Ecluster =N∑iEtotali (3.1)where N is the total number of nodes, and Etotali is the total energy consumption ofnode i. For each node the energy usage during a workflow application execution is:Etotali = Ebasei + EAppi + EWSi (3.2)The base energy is the energy spent to maintain the node active:Ebasei = Pidlei ∗ Ttotal (3.3)where P idlei is the node’s idle power and Ttotal is the predicted application runtime.This base portion of the total energy consumption accounts mainly for the non energy-proportionality of the hardware platform. As platforms become increasingly energy pro-portional, it is expected that the share of the idle power in the total energy envelope todecrease.The application energy is the additional energy the application consumes once the data4This is a common setup for running workflow applications sometimes often imposed by the infrastructureitself (e.g., IBM BG/P nodes are not equipped with hard drives.)243.2. Energy Model Descriptionneeded has been fetched. It is modeled byEApp = (PApp − P idle) ∗ TApp (3.4)The workflow system energy (EWSi ) is the energy spent by the underlying workflowsystem that performs data reads and writes. It is modeled byEWSi = Estoragei + Eneti (3.5)EWSi is the sum of the energy spent on reading/writing from/to the local storage(Estoragei ) and sending/receiving data to/from other compute node over the network (Eneti ).Further, Estoragei is modeled byEstoragei = (Pstorage − P idle) ∗ T storagei (3.6)where P storage is the node power consumption performing storage operations andT storagei is the time spent on these operations.Similarly, the following is the estimation for the energy spent on network transfers:Eneti = (Pnet − P idle) ∗ Tneti (3.7)which is the product of node power when doing network transfers and the time spentdoing this. As the performance predictor tracks the network events, it is feasible to estimatethe time spent on each read/write data from/to the network.The high-level energy consumption analytical model captures key parts of the totalconsumption during the workflow execution: Ebasei is the idle consumption due to nonenergy-proportionality; EAppi is the energy spent by the application doing necessary compu-253.3. Energy Model Seedingtation; EWSi is the energy spent on performing storage I/O and network I/O. Architecturaldesign improvement can reduce the energy cost of Ebasei , while algorithmic optimizationscould reduce EAppi . And workflow scheduling and intermediate storage optimizations cancontribute to less energy overhead of I/O (EWSi ).The model estimates main ingredients of the energy cost using the average value ofdifferent power profiles and the associate time spent in those profiles. When the intermedi-ate storage is initiated with different configuration decisions, the model consumes differenttime inputs. When power tuning techniques (e.g., Dynamic Voltage and Frequency Scaling(DVFS)) are used, the model consumes different power states. Thus, this model is able tocapture the energy cost of various configuration and technique decisions this work targets.Admittedly, the coarse grained model will not achieve perfect accuracy for each configura-tion point. Since the requirement of the model is to guide good configuration choices, themodel meets the requirement as long as it estimates the correct relative values of differentconfigurations (Chapter 4 evaluates the predictor using various configurations).This linear model requires both power and time input. Section 3.3 explains how thepower parameters are gathered, and Section 3.4 explains the changes made to augment theperformance predictor [27] to generate the time input for the energy model.3.3 Energy Model SeedingTo seed the parameters in the energy model, one needs to get both the power characteristicsof the nodes5 and the corresponding time spent on each power profile.Synthetic workloads that resemble different phases of the workflow application execu-tion are used to obtain an estimate of the power consumption of different power states:5Since this work considers homogenous compute nodes that have similar performance and power, onenode is used to perform power identification. In the case of a heterogenous platform, the seeding processshould be performed for each node type.263.4. ImplementationTable 3.1: Platform Power Parametersidle node power P idleinode power when stressing CPU only PAppinode power when performing storage operations P storageinode power when doing network transfer Pneti(i) P idlei , the power samples are retrieved when the nodes are idle over a period; (ii) PAppi ,stress [6] is used to impose load on CPU and measure power; (iii) P storagei , local write andread throughput tests are used to get this profile; and (iv) Pneti , remote writes from a clientto a storage service are performed and the power is measured at the client side. Table 3.1shows the gathered power parameters.Accordingly, the energy model is also seeded with the time spent on each power states:(i) T total, the total execution time is estimated by the performance predictor; (ii) TApp, thetime spent on CPU processing is inferred by the predictor using the logged application traceby the storage system; (iii) T storagei , the time spent on performing local I/O operations isestimated by the predictor; (iv) Tneti , the time spent on doing network transfers is alsoestimated by the predictor which keeps track of the network events during simulation andgathers time estimates.3.4 ImplementationFigure 3.1 presents the energy-related augments to the performance predictor presented inSection 2.3. The predictor is augmented to track the time each node spends on the differentphases of a workflow task: executing the compute intensive part of the application, writingto/reading from storage, and receiving/sending network data. As explained in Section 2.3.3,a workflow scheduler executes the application on a minimum setup, and the intermediatestorage system logs the client side trace. Then the predictor preprocesses the trace to infer273.4. ImplementationPrediction  Framework                 Time InputsPerformance PredictorWorkload DescriptionEnergyModelPredicted EnergySystem ConfigurationPlatform Power CharacteristicsPlatformPerformance characteristicsFigure 3.1: The predictor receives the application description, platform performance and powercharacteristics. The performance predictor estimates the time for several events in the system andpasses this information to a module that uses power characteristics of the platform and uses theenergy model to estimate the energy consumption.the time spent on doing computation. To track the time spend on network, each networkrequest is instrumented to track the traffic sent to each machine. The storage entityis instrumented to track the time spent on each read/write request. The performancepredictor consumes an empirical distribution of numbers for each performance parameter.When one entity processes a network request/store request, the predictor increments theservice time based on the requested data size and performance parameter inputs.When the simulation completes, the predictor obtains detailed statistics of each ma-chine’s storage I/O time and network transfer time. The detailed statistics also enablesfuture opportunity to predict energy consumption on a heterogenous cluster where eachmachine has different power profiles.The aforementioned energy model is implemented as modules that augment the discrete-event performance simulator in Java. The statistics module keeps track of the time spenton each power state. The energy module estimates the energy consumption. This moduleimplements the model described in Section 3.2, receives the parameters that describes the283.4. Implementationpower consumption of the platform (Section 3.3) in each different phase of a workflow task,and it uses the same system configuration as the performance predictor (e.g., the numberof storage nodes, the number of clients). Finally, the module obtains the estimated timethat is spent on different power states from the performance predictor and estimates thepredicted energy.To make the picture clearer, consider predicting the energy consumption of a singleworkflow task. The client contacts the manager regarding the location of the input files.The manager replies with a set of storage entities that store the files. The client retrievesthe data chunks and does computation based on the fetched data. After the computationis done, the client contacts the manager to store output files to the storage entities. In thisprocess the storage entities keep track of their read/write service time. Also each entitytracks the time spent in network transfers. The energy module harnesses the gatheredpower characteristics and the time estimates of power states to predict the overall con-sumption. The consumption includes reading input files, writing output files, computationand network transfers.29Chapter 4EvaluationThis chapter uses synthetic benchmarks that represent common data access patterns thatexist in workflow applications and real applications to evaluate the accuracy of the pro-posed energy consumption predictor. The two representative applications (BLAST [17],Montage [40]) incorporate multiple patterns to demonstrate the accuracy of the predictor.Additionally, the predictor is evaluated using different configuration choices, power tuningtechniques (Section 4.4) and analyze energy-performance tradeoffs when one varies the sizeof the resource allocation (i.e., the number of allocated nodes) (Section 4.5).Summary of results: For synthetic benchmarks the predictor achieves 89% accuracy inaverage. For real workflow applications the predictor achieves 91% accuracy in average.Overall, the median accuracy is 90% across different scenarios. Additionally, the predictoris able to capture the energy impact of CPU throttling when applied to applications withdifferent characteristics. Finally, the predictor accurately makes the resource allocationdecisions on different platforms when considering energy-performance tradeoff.4.1 Experimental Setup and Platform4.1.1 Storage SystemThe evaluation uses MosaStore [14, 16] as the intermediate storage system because it hasmultiple configuration knobs (e.g., replication level, chunk size, data placement policy) and304.1. Experimental Setup and Platformone can evaluate their impact on energy consumption. The storage services use RAMDisksas the storage media as RAMDisks are commonly used to support workflow applicationsand they are the only option in some supercomputers (e.g., IBM BG/P machines).4.1.2 Testbed and Power MetersThe evaluation uses 11 nodes from Grid5000 ‘Taurus’ cluster at ‘Lyon’ site [22]. Eachnode has two 2.3GHz Intel Xeon E5-2630 CPUs (each with six cores), 32GB memoryand 10 Gbps NIC. A dedicated node runs the metadata manager and workflow scheduler,while other nodes run the storage service, the I/O client service, and application processes.Each node is connected to a SME Omegawatt power-meter, which provides 0.01W powerresolution at 1Hz sampling rate. The power consumption is aggregated, for each node,for the duration of the workflow execution to measure the total energy consumption. Theevaluation does not consider the energy consumed by the node that runs metadata serviceand workflow scheduler as these are fixed and not subject to the configuration changes thatthe prediction mechanism targets (e.g., replication level, number of nodes).Table 4.1: Platform power parameters for Taurus cluster. These values change when powertuning techniques are applied or other clusters are considered.idle power P idlei 91.6Wpower when stressing CPU only PAppi 125.2Wpower when performing storage operations P storagei 129.0Wpower when doing network transfer Pneti 127.7Wpeak power P peaki 225.0WThe Machine Power Profile. The energy model seeding procedure is executed asdescribed in Section 3.3 to identify the power consumption of a node in the differentpower states: Idle, App, storage, net, peak. Table 4.1 shows the identified values of thedefault testbed configuration (‘Taurus’ cluster). These values change when power tuning314.2. Synthetic Benchmarks: Workflow Patternstechniques are applied (Section 4.4) or other clusters are considered (Section 4.5).4.1.3 Evaluation MetricsThe evaluation focuses on prediction accuracy by comparing the energy consumption andexecution time of actual runs and predictions. Prediction inaccuracy is reported as definedby I(E) = |1− Epred/Eactual| for energy, and I(T ) = |1− Tpred/Tactual| for time.Plots report average of 10 trials with error bars showing the standard deviation.4.2 Synthetic Benchmarks: Workflow PatternsPrevious work has identified common data access patterns in real applications [21, 48, 55,57]. The evaluation uses synthetic benchmarks that mimic these patterns and evaluatesthe predictor’s accuracy for each pattern independently before evaluating real applicationsthat have multiple patterns.The workflow patterns used to evaluate are: pipeline, reduce and broadcast. Thesynthetic benchmarks involve multiple concurrent clients and storage nodes, and each clientperforms intensive ‘read-process-write’ procedures mimicking workflow stages.Pipeline benchmark models a set of compute tasks assembled in a number of parallelsequences in such a way that the output of a previous task is the input of a next task ina chain (Figure 4.1). In this experiment, 10 application pipelines run concurrently on10 compute nodes and perform three processing stages that read/write files from/to thedistributed shared storage, and also stress CPU.Reduce benchmark represents a single task that consumes the outputs producedby multiple computations. In the experiments, 10 processes run in parallel on differentnodes, and each produces an intermediate result. A following reduce task consumes thoseintermediate files, and produces the final output.324.2. Synthetic Benchmarks: Workflow Patterns            ...Pipeline            ...Reduce Broadcast1GB2GB2GB100MB1GB2GB2GB100MB200MB200MB200MB200MB2GB2GB2GB200MB 200MB200MBFigure 4.1: Pipeline, Reduce and Broadcast benchmarks. Circles represent a workflow taskperforming CPU processing using stress [6] and arrows represent data transfers among stages. Thelabels on the arrows represent the file sizes used in the benchmarks.334.2. Synthetic Benchmarks: Workflow PatternsBroadcast benchmark has a single task producing an output file that is consumedby multiple concurrent tasks. 10 processes run in parallel and consumes the file producedin an earlier stage.Label DSS (Default Storage System) is used for experiments running a default systemconfiguration: data chunks are striped across storage nodes in a round-robin fashion andno optimization is provided for any workflow pattern. Label WOSS (Workflow OptimizedStorage System) is used for the experiments where the system configuration is optimizedfor a specific workflow pattern (including location aware scheduling, data placement orreplication) [16, 51]. The goal of showing results for these two configurations is two-fold: (i)demonstrate the predictors accuracy in a default configuration setting (Section 4.2.1), and(ii) show its ability to predict energy savings when performance optimized configurationsare used (Section 4.2.2).4.2.1 Evaluating Energy Prediction Accuracy on DSSThe predictor’s accuracy is firstly evaluated when the workflow patterns are running on adefault storage system (DSS). DSS uses default global configuration for all the applications(i.e., no replication, files are striped round-robin across all storage nodes). Figure 4.2presents the predicted and actual energy consumption for the synthetic patterns on DSS.The pipeline benchmark exhibits the best accuracy (only 5.2% inaccuracy). For reducethe average inaccuracy is 16.4%, while for broadcast it is 15.9%. Overall, the energyconsumption predictions have an average of 12.5% inaccuracy and typically close to onestandard deviation interval. Importantly, the predictor response time is 20-30x times fasterthan running the actual benchmark, resulting in the usage of 200x-300x less resources(machines × time) and showing that the results satisfy the objectives presented in Section3.1.344.2. Synthetic Benchmarks: Workflow Patterns01 02 03 04 05 06 07 08 0Energy Conusmption (kJ) Actual PredictedReduce Pipeline Broadcast  Figure 4.2: Actual and predicted average energy consumption for pipeline, reduce, broadcastbenchmarks on DSS.4.2.2 Evaluating Energy Prediction Accuracy on WOSSIn the default storage system configuration (DSS), a round-robin data placement policy isused: the files produced by workflow tasks are striped across the nodes of the shared storage.Thus, when one task consumes the input files, it needs to connect to other compute nodesand receive file chunks, which generates high network contention and results in suboptimalperformance.As discussed in Section 2.2.2, users can use POSIX file extended attributes to inform thestorage system about specific data access patterns. File extended attributes enable work-flow optimizations including moving computation near data, and location aware scheduling.Hence, this section evaluates the predictor’s ability to capture the energy savings when us-ing a workflow optimized storage system (WOSS). Table 4.2 shows the metadata attributesand the corresponding optimization used in different patterns. The hints about differentpatterns are set in key-value pairs. The key is the name of the attribute, while the latter354.2. Synthetic Benchmarks: Workflow PatternsTable 4.2: Metadata attributes and the corresponding optimizationsPipeline pattern set (DP, local)Indicates preference toallocate the file blockson the local storage node.Reduce patternset(DP, collocation| <group-name>)Preference to allocatethe blocks for all fileswithin the same<group-name> onone node.Broadcast patternset(Replication,<repNum>)Replicate the blocksof the file <repNum> the concrete value of the attribute.In the pipeline scenario, unlike DSS that stores the data chunks of each file across allstorage nodes, WOSS stores the intermediate pipeline files on the storage node co-locatedwith the application. The workflow scheduler later places the task that consumes thefile on the same node to improve performance via data locality. In the reduce patternscenario, WOSS co-places the output files on a single node and exposes their location,so that the scheduler can schedule the reduce task on the machine. In the broadcastpattern scenario, parallel tasks consume the same file concurrently, which creates an accessbottleneck. WOSS creates multiple replicas of the bottleneck file so that the parallel taskshave multiple data access points.Figure 4.3, Figure 4.4 and Figure 4.5 show the actual and predicted average energyconsumption for the three benchmarks on DSS and WOSS.The right plot in Figure 4.3 shows the actual and predicted energy for the pipelinebenchmark: the predictor achieves 13.4% inaccuracy. The predictor achieves 12.2% inac-curacy for reduce. Figure 4.5 shows the results using 4 replicas: the predictor achieves364.2. Synthetic Benchmarks: Workflow Patterns01 02 03 04 05 06 07 08 0DSS WOSSEnergy Consumption (kJ) Actual PredictedFigure 4.3: Actual and predicted average energy consumption for the pipeline benchmark.051 01 52 02 53 03 54 0DSS WOSSEnergy Consumption (kJ) Actual PredictedFigure 4.4: Actual and predicted average energy consumption for the reduce benchmark.01 02 03 04 05 06 0DSS WOSSEnergy Consumption (kJ) Actual PredictedFigure 4.5: Actual and predicted average energy consumption for the broadcast benchmark.374.3. Predicting the Energy Envelope of Real Applications16% inaccuracy for broadcast. WOSS exploits data locality and location-aware scheduling,thus it reduce the energy spent on data movement as well as idle energy since it acceleratethe application execution. The predictor accurately predicts the energy savings comparedwith DSS. It has 2.7% inaccuracy for the energy savings of replacing DSS with WOSSin the pipeline scenario, 22.6% inaccuracy for predicting energy savings for reduce, 15.6%inaccuracy for broadcast.4.2.3 SummaryThe predictor captures the energy consumption for both DSS and WOSS configurationswith adequate accuracy and, more importantly, accurately predicts the energy savingsbrought by WOSS. As a result, the predictor can help users to make storage system con-figuration decisions based on the energy consumption metric.4.3 Predicting the Energy Envelope of Real ApplicationsThis section evaluates the framework’s prediction availability when it is used for real work-flow applications. It uses two applications: BLAST [17] and Montage [40] with represen-tative workloads to evaluate the proposed predictor.4.3.1 BLAST ResultsBLAST [17] is a DNA search tool. Each node receives 8 DNA sequence queries as input(a file for each node) and all nodes search the same database file (i.e., BLAST has thebroadcast pattern) (Figure 4.6). The database refseq rna has 18 files and they are storedon the intermediate storage. The input files are staged in the intermediate storage andeach node produces one output file that is staged out to the backend storage.384.3. Predicting the Energy Envelope of Real ApplicationsOutput 1Search Search SearchOutput 2 Output nQuery file 1 Query file nQuery file 2Blast DBFigure 4.6: BLAST workflow. All nodes search BLAST database (1.8GB) in parallel.05 0100150200Actual PredictedEnergy consumption (kJ) (a) Energy050100150200Actual PredictedExecution Time (s) (b) TimeFigure 4.7: Actual and predicted average energy consumption and execution time for BLAST.394.3. Predicting the Energy Envelope of Real ApplicationsAs shown in Figure 4.7, the predictor achieves 11% inaccuracy for energy prediction and5.2% inaccuracy for time prediction. Compared with the previously discussed broadcastbenchmark, the real application prediction achieves better results.4.3.2 Increasing the Workflow Complexity: Montage ResultsI mage - 1m Pr ojmOverlapsm D if fImage - 2m Pr ojm D if fImage - nm Pr ojm D if fmC o ncatFitmBg Mo de lm ImgTblI n f or mati o n i s u sed t o  schedu le the m Di f f  task sm BackGrou n dm BackGrou n dm BackGrou n dm Ad dFi nal ImageI n f or mati o n is u sed t o  schedu le the m Backgro u nd  tas k sm J PE Gpipelinem F it Planem FitPlanemFitPlanepipeline pipelinepipelinepipeline pipelinepipelinereduce reducereducereduce reducereducepipelineFigure 4.8: Montage workflowMontage [40] is a complex astronomy workflow composed of 10 different stages (Fig-ure 4.8), and a highly variable I/O communication intensity among the workflow stages(Table 4.3 shows a small workload and Table 4.4 shows a large workload). Additionally,the application has a number of distinct workflow patterns (e.g., mProject, mDiff and m-Background have pipeline pattern; mConcatFit and mAdd have reduce pattern). In total404.3. Predicting the Energy Envelope of Real ApplicationsTable 4.3: Characteristics of small Montage workloadStage Data #Files File SizestageIn 320MB 163 1.7MB-2.1MBmProject 1.3GB 324 3.3MB-4.2MBmImgTbl 50KB 1 50KBmOverlaps 54KB 1 54KBmDiff 409MB 895 100KB - 3MBmFitPlane 1.8MB 449 4KBmConcatFit 21KB 1 21KBmBgModel 8.3KB 1 8.3KBmBackground 1.3GB 325 3.3MB - 4.2MBmAdd 1.3GB 2 503MBmJPEG 15MB 1 15MBstageOut 518MB 2 15MB-503MB020406080100120140160Actual PredictedEnergy Consumption (kJ) (a) Energy02 04 06 08 0100120140Actual PredictedExecution Time (sec) (b) TimeFigure 4.9: Actual and predicted average energy consumption and execution time for smallMontage workload.414.3. Predicting the Energy Envelope of Real Applicationsthe small workload contains around 2000 tasks, while the large workload contains around13200 tasks.Table 4.3 shows the small Montage workload used for evaluation. Although it is thesmaller of two Montage workloads, it still has complex workflows and a large amount oftasks. Figure 4.9 shows the prediction results. The predictor achieves 13.8% inaccuracy intime and 15.9% inaccuracy in energy for Montage.Table 4.4: Characteristics of large Montage workloadStage Data #Files File SizestageIn 1.9GB 955 1.7MB-2.1MBmProject 8.0GB 1910 3.3MB-4.2MBmImgTbl 284KB 1 284KBmOverlaps 336KB 1 336KBmDiff 2.6GB 5654 100KB - 3MBmFitPlane 12MB 2833 12MBmConcatFit 575KB 1 575KBmBgModel 49KB 1 49KBmBackground 8.0GB 1910 3.3MB - 4.2MBmAdd 6.0GB 2 3.0GBmJPEG 46MB 1 46MBstageOut 3.05GB 2 46MB-3GBTable 4.4 shows the large Montage workload used for evaluation. Figure 4.10 showsthe prediction results, while Table 4.5 shows the per-stage prediction results of the largeworkload (StageIn and StageOut are writing raw inputs to the intermediate storage andwriting final outputs to the backend stoage. Thus they are not considered in the prediction.mImgTbl is included as part of mProject stage). In terms of total energy cost and timeto solution, the predictor achieves 1.5% inaccuracy for energy consumption and 1% inac-curacy for time. However, as the per-stage results show, some stages are under-predicted(e.g., mProject, mAdd), while some are over-predicted (e.g., mConcatFit, mBackground).424.3. Predicting the Energy Envelope of Real Applications0100200300400500600700800900Actual PredictedEnergy Consumption (kJ) (a) Energy0100200300400500600700Actual PredictedExecution Time (sec) (b) TimeFigure 4.10: Actual and predicted average energy consumption and execution time for largeMontage workload.Table 4.5: Per-stage results of large Montage workflow workloadStage Actual Predicted InaccuracymProject 148.7kJ 131.6kJ 11.6%mOverlaps 2514J 2625J 4.4%mDiff 145.5kJ 160.5kJ 10.4%mFitPlane 58.1kJ 57.8kJ 0.6%mConcatFit 25.2kJ 29.2kJ 15.8%mBgModel 35.7kJ 34.4kJ 3.8%mBackground 50.5kJ 58.7kJ 16.2%mAdd 310.2kJ 283.3kJ 8.7%mJPEG 85.4kJ 92.0kJ 7.8%434.4. Predicting the Energy Impact of Power-centric TuningImproving the per-stage accuracy is an ongoing work.4.3.3 SummaryOverall, the energy predictions are more accurate for the real applications than for syntheticbenchmarks. This happens because the synthetic benchmarks are designed to produce ahigh stress on the I/O subsystem, which results in contention and higher variance that isharder to capture when modeling the storage system.4.4 Predicting the Energy Impact of Power-centric TuningCPU frequency scaling (a.k.a. CPU throttling) is an important technique where processorsrun at less-than-maximum frequency to conserve power. Frequency scaling, however, limitsthe number of instructions a processor can issue in a given amount of time; thus this tech-nique can prolong the execution time while conserving instantaneous power. Therefore, itis not clear whether frequency scaling can reduce the energy cost for workflow applications.To evaluate the predictor’s ability to predict the energy impact of CPU frequencyscaling, two types of representative applications are used: (i) BLAST, with same workloadas in the previous section, representing a mix of I/O and CPU intensive applications; (ii)the pipeline benchmark, performing just I/O operations (the benchmark is modified toonly have a minimum CPU stressing stage), representing an I/O intensive application.The processors are set at different frequencies (1200MHz, 1800MHz and 2300MHz), andfor each frequency independent seeding is performed.Figures 4.11 and 4.12 show the actual and predicted energy consumption for BLASTand, respectively for the pipeline benchmark, for different frequencies. Since BLAST ismore CPU intensive, using the minimum frequency (1200MHz) just prolongs the runtimeand leads to 85.5% more energy consumed than when using the maximum frequency. The444.4. Predicting the Energy Impact of Power-centric Tuning05 01 0 01 5 02 0 02 5 03 0 03 5 04 0 04 5 01 2 0 0 1 8 0 0 2 3 0 0Energy Conusmption (kJ) CPU frequency (MHz)  Actual Predicted(a) Energy05 01 0 01 5 02 0 02 5 03 0 03 5 04 0 04 5 05 0 01 2 0 0 1 8 0 0 2 3 0 0Execution Time (s) CPU frequency (MHz)  Actual Predicted(b) TimeFigure 4.11: Actual and predicted average energy consumption and execution time for BLASTfor various CPU frequencies.051 01 52 02 53 03 54 04 55 01 2 0 0 1 8 0 0 2 3 0 0Energy Consumption (kJ) CPU frequency (MHz)  Actual Predicted(a) Energy051 01 52 02 53 03 54 01 2 0 0 1 8 0 0 2 3 0 0Execution Time (s) CPU frequency (MHz)  Actual Predicted(b) TimeFigure 4.12: Actual and predicted average energy consumption and execution time for the pipelinebenchmark for various CPU frequencies.454.5. Predicting Energy-Performance Tradeoffspredictor accurately estimates the much higher (96.5%) energy cost. For pipeline, usingminimum frequency does not increase runtime. In fact, since the instantaneous power isreduced, CPU throttling actually brings energy savings, which is partially captured by thepredictions. The actual runs show 17% energy savings, while the predictor estimates 11%savings.Summary: The results for the two workloads highlight that, depending on the com-putational and I/O characteristics of the workflow application, CPU throttling can bringenergy savings or lead to additional energy costs. The predictor provides an effectivemechanism to predict the energy consumption when the platform enables power tuningtechniques like frequency scaling and can be used in practice to make configuration deci-sions.4.5 Predicting Energy-Performance TradeoffsAnother important decision available to users/administrators is the allocation size. Asone allocates more compute and storage nodes for executing workflow applications, theperformance should improve because of the extra computing resources. However, due toscale overheads and non-energy proportionality the total energy cost will likely increase.For instance, the Montage workload evaluated has the lowest energy footprint when usingonly one node (yet in this case it displays the highest time to solution). A popular hybridmetric that finds a compromise between these two metrics is the energy-delay product(EDP). This section evaluates the predictor’s ability to accurately estimate this hybridmetric for various setups.Energy-delay product is estimated while varying the number of allocated nodes. Due toplatform size limit, the same Montage workload used in previous sections is executed on upto 10 nodes and evaluate the accuracy of the predictor. Figure 4.13 shows the EDP results,464.5. Predicting Energy-Performance Tradeoffs05000100001500020000250003000035000400000 2 4 6 8 1 0Energy-Dealy Product (kJs) # Nodes  Actual PredictedFigure 4.13: Actual and predicted Montage energy-delay product (EDP) at the various scalesthe experiments can be executed in ‘Taurus’ cluster.02 0 0 0 04 0 0 0 06 0 0 0 08 0 0 0 01 0 0 0 0 01 2 0 0 0 01 4 0 0 0 01 6 0 0 0 00 2 0 0 4 0 0 6 0 0Energy (J) Performance (s)  ActualPredicted1  2  4  6  10  8  Figure 4.14: Actual and predicted Montage energy consumption and performance at the variousscales the experiments can be executed in ‘Taurus’ cluster. The numbers in the plot represent thenumber of allocated nodes in the executed scenarios.474.5. Predicting Energy-Performance Tradeoffsand Figure 4.14 shows the predicted and actual energy consumption and performance. Theexperiments suggest that the predictor can be used to make resource allocation decisionsin ‘Taurus’ cluster: the actual runs indicate that using 8 nodes gives the best EDP for theworkload evaluated, and the predictor suggests that 8 - 10 nodes are good choices.05 00 001 00 00 01 50 00 02 00 00 02 50 00 03 00 00 00 5 1 0 1 5Energy-Delay Product (kJs) # Nodes  ActualPredictedFigure 4.15: Actual and predicted Montage energy-delay product (EDP) on up to 15 nodes in‘Sagittaire’ cluster.01 0 02 0 03 0 04 0 05 0 06 0 07 0 08 0 00 5 0 0 1 0 0 0 1 5 0 0Energy (kJ) Performance (s)  ActualPredicted1 5  10  5  1  Figure 4.16: Actual and predicted Montage energy consumption and performance at the variousscales in ‘Sagittaire’ cluster. The numbers in the plot represent the number of allocated nodes inthe executed scenarios.484.5. Predicting Energy-Performance TradeoffsThe predictor is also evaluated using 15 nodes (each of which has two 2.4GHz AMDOpteron CPUs (each with one core), 2GB RAM and 1 Gbps NIC) from Grid5000 ‘Sagittare’cluster. ‘Sagittaire’ has a larger number of machines, but it is less energy proportional than‘Taurus’ cluster (servers in ‘Sagittaire’ are circa 2006 generation, while servers in ‘Taurus’are circa 2012 generation). Figure 4.15 shows the predicted and average EDP on 15 nodesusing the same Montage workload, while Figure 4.16 shows the predicted and actual energyconsumption and performance. The predictor accurately shows that using 5 nodes for theworkload is the best decision compared with other executed scenarios.010 0020 0030 0040 0050 0060 0070 0080 0090 0010 0000 10 20 30 40 50 60Energy Delay Product (kJs) # Nodes  Figure 4.17: Predicted Montage energy-delay product (EDP) on up to 50 nodes in a hypotheticalcluster in which idle power is 10% of the peak.Figure 4.17 shows the Montage EDP results when the predictor is evaluated in a hy-pothetical cluster where the value of power states are the same as ‘Taurus’ except thatthe idle power is 10% of the peak power (in ‘Taurus’ the idle power is 40% of the peak).This hypothetical cluster represents a more energy proportional cluster than the clustersin Grid5000. The results suggest using 20 nodes is the best decision.Summary: This section evaluates the predictor’s ability to make resource allocationdecisions when the users consider hybrid energy and performance metric. The accuracyis evaluated on different platforms. The best decision on the least energy-proportional494.5. Predicting Energy-Performance Tradeoffsplatform (‘Sagittaire’) is choosing 5 nodes for the target workload. The best on the mostenergy-proportional platform (the hypothetical cluster) is using 20 nodes. The predictoraccurately demonstrates that as the non-peak efficiency of the platform improves, the userscan allocate more resources to optimize for energy-delay-product.50Chapter 5Discussion5.1 What Are the Causes of Inaccuracies?Although some inaccuracy is expected from the simplicity of the model and its seedingmechanism as mentioned in §3, it is important to discuss in more depth the sources ofinaccuracies. They fall in three main categories: first, the cluster used in the evaluationshares the same networking switch with two other clusters, thus interference can impactthe accuracy of the seeding measurements: the platform characteristic gathered duringseeding could be different from the one during experiments. This factor can be addressedby having more exclusive reservations to limit network interference.Second, despite that the nodes in the cluster used for evaluation are homogenous ma-chines, they can have different performance and power profiles. For instance, the peakpower of machine can vary around 3%, and the idle power can vary around 5%. Thus, theperformance and power parameters gathered from one node might not accurately reflectthe characteristics of other nodes in the same cluster.The third source, and more important in this context, is attributing inaccuracies pre-cisely to time or to energy modelling. One approach to validate the inaccuracy source is tocompare the energy prediction results between giving real time inputs to the energy predic-tor (in this case accurate breakdown of the time spent in each power profile is provided toenergy model) and giving predicted times to the predictor. Experiments are conducted to515.2. What to Do to Improve Accuracy?log the full I/O trace of the synthetic benchmarks we used in Chapter 4. Consider Figure4.2 that shows actual and predicted energy consumption for the synthetic benchmarks onDSS, the energy predictor achieves 5% inaccuracy for the pipeline benchmark, while for thereduce benchmark the inaccuracy is 16%. For the pipeline scenario, the predicted timesand actual times are close (within 1%), thus the final 5% inaccuracy can be attributed tothe energy model. For the reduce benchmark, the overall predicted time is underestimatedby 6%, which leads to energy underprediction. When giving the actual time inputs to theenergy model, the prediction inaccuracy is reduced to 9%. Thus, out of the 16% inaccuracy,9% can be attributed to the energy model and 7% can be attributed to the time prediction.5.2 What to Do to Improve Accuracy?As discussed in the previous section, the nodes in a cluster can have different characteristics.One approach to increase accuracy is to perform performance and power seeding processon multiple nodes and obtain the average power value per state.The energy model captures the major execution states, however, it does not includethe energy spent on metadata path and workflow scheduling. Implementation the energycost of those operations can improve the prediction accuracy.Since the energy predictor requires time estimates from the performance predictor, andgood time estimates lead to accurate energy prediction. As suggested by Costa et al. [26],currently the performance predictor does not capture workflow scheduler overheads, theworkflow task launch overheads. Modeling these overheads can improve the accuracy ofthe time to solution, as well as the spent energy.525.3. What Is the Advantage of Using the Proposed Energy Model Compared With Others?5.3 What Is the Advantage of Using the Proposed EnergyModel Compared With Others?The power states based model proposed in this thesis highlights the major execution statesduring the workflow run. The model treats the activities within a power state as a blackbox. Thus, it does not require modeling low-level activities. One can reason about theenergy spent by non-energy proportionality, by the application’s algorithmic operations,and by the underlying workflow supporting storage. Additionally, the inputs to this modelare empirically average power per state, which are easy to obtain and does not requireheavy system instrumentation.As it is shown in Chapter 4, the predictor which implements the proposed energymodel is sufficient for decision making in different scenarios (e.g., performance optimization,power-tuning techniques, resource allocation). Other energy models could increase theaccuracy of the prediction at the expense of increasing complexity, but it is not clear thepractical benefits brought by a more complex model.5.4 Optimizing for Time VS. Optimizing for EnergyTo optimize for time, one can use optimized storage configuration or add more resources(i.e., add more nodes). The former approach usually exploits data locality and location-aware scheduling, which generally reduces the amount of data transfers and, thus, it reducesthe energy costs as well. Due to non-energy proportionality of the state-of-the-art plat-forms, idle power, however, remains a large portion of the total power consumption (forthe cluster used in the evaluation the idle power is 40% of the peak power). Increasingthe allocation size of a workload could improve performance at the cost of spending moreenergy. The experiments demonstrate that (Chapter 4.4), for a subclass of applications,535.4. Optimizing for Time VS. Optimizing for Energyit is additionally possible to optimize for energy only by using power-tuning techniqueslike CPU throttling. However, as demonstrate in the previous chapter, these techniquesneed to be carefully considered, as they can bring energy savings or lead to additionalcosts depending on the specific application patterns. The proposed energy prediction toolis particularly useful to support this type of decisions.54Chapter 6Related WorkThis chapter explains an overview of related work in the areas of energy consumptionmodeling and how this work differs.Power state based modeling. Previous work use coarse grained power state basedenergy modeling in different contexts. Costa et al. [25] proposed using machine’s idlepower, peak CPU load power and peak I/O power, extra energy spent on hash computationto model data deduplication tradeoff. Ibtesham et al. [35] presents a coarse grained energyconsumption model for rollback and recovery mechanisms. It uses three power profilesincluding application running, checkpointing compression, and checkpoint commit. Whilethis thesis targets a distributed setup and enables exploration in a richer configurationspace.Resource utilization based modeling. Economou et al. [30] presents a non-intrusive method for modeling full-system power consumption based on the idle powerand utilization metrics: CPU utilization, off-chip memory access count, hard disk I/Orate, network I/O rate. Fan at el. [32] uses CPU utilization as the main indicator of esti-mating power usage of individual machines. For the application domain this work targets,it is hard to obtain the resource utilization in a minimum system setup and predict thevalues in a large scale.System events based modeling. SimplePower [53], Soft-Watt [39], and Mambo [47]provide low level analytical models tied to architectural events and uses simulations to pre-55Chapter 6. Related Workdict power consumption. These more granular models, however, typically lead to a longerprediction time and are often highly coupled to the underlying architecture. Additionally,some parameters of the model are hard to obtain unless one executes the application inevery possible configuration.Simulation-based modeling. Similar to the proposed approach in this thesis, pastwork uses simulation instead of simply analytical modeling. For instance, OMNeT++ [52]provides general and accurate network modeling, while DiskSim [7] can model the storagedevices at the hardware level. These tools could be modified and integrated to build adetailed simulator. However, due to the low component level simulation, they often lackfast time to solution. The proposed approach in this dissertation has achieved reasonableaccuracy while remains lightweight and provides a simple and effective way to seed themodel [27].Time-series data based statistical modeling. Samak et al. [46] present an ap-proach to analyze large power consumption datasets from computing infrastructures. Theyuse Pig data processing on Hadoop for computing statistics and aggregating data. ThenR framework is used to consume the time-series data and derive hourly and daily powerconsumption prediction models. The main goal of their work is to provide a large datasetsprocessing pipeline and predict the consumption of whole infrastructures at a coarse level,while this paper aims at providing a framework for accurate consumption prediction ofspecific workloads and also evaluate the configuration choices, energy-performance trade-offs.Analytical modeling of distributed applications. Some work use analytical mod-els to evaluate the energy efficiency of distributed scientific applications. Feng et al. [34]developed a predictor for profiling power and energy characteristics of parallel scientificapplications. Their results suggested for fixed problem size of some workloads, increasing56Chapter 6. Related Workthe number of nodes always increases energy consumption but does not always improveperformance, which is also observed in our evaluation. Ge et al. [37] have similar scope asthis paper and uses two steps approach: (1) it develops analytical models for both ener-gy and performance profiles of parallel scientific workload, and estimate the performanceand energy costs for varying configurations (e.g., number of cores, CPU frequencies), (2)it explores the configuration space to find the optimal setting. The main difference fromthis paper is that Ge et al. [37] focus on parallel applications while this work focuseson the distributed storage layer of workflow applications that have various patterns andmuch more I/O operations. For the performance modeling it extended Amdahl’s to derivespeedups, while this work uses discrete event based simulation to obtain the runtime andobtain higher accuracies. Pakin et al. [42] focus on evaluating the energy savings whenDVFS is enabled. It presents normalized energy and performance models using compute-boundedness and normalized frequencies, and evaluated energy prediction with variousCPU frequencies. This work considers modeling the I/O operations in the memory andnetwork to get better accuracy. Unlike Pakin et al. [42] that use compute-boundness indeveloping the models, Freehy et al. [36] use CPU criticality to evaluate the energy andperformance tradeoff.Distributed storage modeling. Some works focus on evaluating the energy efficiencyin the distributed storage layer. EEffSim [43] presents a configurable energy simulatorfor modeling multi-server storage systems. It can model the the energy consumption ofmultiple energy-saving techniques, including write offloading, opportunistic spin-down andheterogeneous storage devices and others. However, the accuracy of the simulator for realworld applications is not evaluated.57Chapter 7Conclusion and Future Work7.1 ConclusionThis thesis presents an energy consumption predictor for estimating workflow-based appli-cations’ energy usage. The accuracy of the predictor is evaluated using synthetic bench-marks that represent common data access patterns and two real world workflow applica-tions. Additionally, the ability of the proposed predictor is evaluated to support choosingbetween different storage configurations (a default configuration - DSS, and a workflow op-timized configuration - WOSS), to support configuration decisions for processor frequencyscaling, and resource provisioning decisions. Overall, the experiments demonstrate thatthe predictor is a low cost, time-efficient tool for evaluating power-tuning techniques thattarget a multitude of scenarios and success metrics (e.g., energy, energy-delay product).As described in the Preface, the energy prediction work presented in this thesis is onepart of my contributions to the MosaStore research project. I also contributed to thegeneral system development, performance prediction and provisioning project, cross-layeroptimization project and supporting scientific data in the cloud project.7.2 Future WorkFuture work will improve the proposed energy prediction mechanism in multiple directions:(i) improve the accuracy of the energy prediction by addressing the points described in587.2. Future WorkChapter 5, (ii) explore a richer space of different configuration choices (e.g., replicationlevel, chunk size, other power tuning techniques), (iii) explore platforms that have differentenergy proportionality, (iv) apply different optimization criteria (e.g., dollar cost, joulesper task).59Bibliography[1] Amazon Elastic Compute Cloud (Amazon EC2). http://[2] Fuse: Filesystem in userspace.[3] IBM BlueGene/P (BG/P)., 2008.[4] New Database Manager (NDBM) library, Berkeley.[5] SPEC, The SPEC Power Benchmark.[6] Stress.[7] DiskSim.[8] General Parallel File System.[9] Lustre file system.[10] The Green Grid.[11] Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger,James Hendricks, Andrew J. Klosterman, Michael P. Mesnier, Manish Prasad, Bran-don Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno60BibliographyThereska, Matthew Wachs, and Jay J. Wylie. Ursa minor: Versatile cluster-basedstorage. In Proc. of the Conf. on File and Storage Technologies, Dec. 2005.[12] Samer Al-Kiswany, Lauro Beltra˜o Costa, Hao Yang, Emalayan Vairavanathan, andMatei Ripeanu. A Cross-Layer Optimized Storage System for Workflow Applications.In submission to IEEE Transactions on Parallel and Distributed Systems (TPDS),2014.[13] Samer Al-Kiswany, Lauro Beltra˜o Costa, Hao Yang, Emalayan Vairavanathan, andMatei Ripeanu. A Software Defined Storage for Scientific Workflow Applications. InPreparation, 2014.[14] Samer Al-Kiswany, Abdullah Gharaibeh, and Matei Ripeanu. The Case for a VersatileStorage System. SIGOPS Oper. Syst. Rev., 44:10–14, March 2010.[15] Samer Al-Kiswany, Matei Ripeanu, and Sudharshan S. Vazhkudai. A checkpointstorage system for desktop grid computing. CoRR, abs/0706.3546, 2007.[16] Samer Al-Kiswany, Emalayan Vairavanathan, Lauro B. Costa, Hao Yang, and MateiRipeanu. The case for cross-layer optimizations in storage: A workflow-optimizedstorage system. CoRR, abs/1301.6195, 2013.[17] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic localalignment search tool. Journal of molecular biology, 215(3):403–410, Oct. 1990.[18] Luiz Andre´ Barroso and Urs Ho¨lzle. The case for energy-proportional computing.Computer, 40(12):33–37, December 2007.[19] Christian L. Belady. In the Data Center, Power and Cooling Costs more than the ITEquipment it Supports. February 2010.61Bibliography[20] John Bent, Douglas Thain, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, andMiron Livny. Explicit control a batch-aware distributed file system. In Proceedings ofthe 1st Conference on Symposium on Networked Systems Design and Implementation- Volume 1, NSDI’04, pages 27–27, Berkeley, CA, USA, 2004. USENIX Association.[21] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, Mei-Hui Su, and K. Vahi. Char-acterization of scientific workflows. In Workflows in Support of Large-Scale Science,2008. WORKS 2008. 3rd Workshop on, pages 1–10, 2008.[22] Franck Cappello, Eddy Caron, Michel Dayde´, Fre´de´ric Desprez, Yvon Je´gou, PascalePrimet, Emmanuel Jeannot, Stephane Lanteri, Julien Leduc, Noredine Melab, Guil-laume Mornet, Raymond Namyst, Benjamin Quetier, and Olivier Richard. Grid’5000:a large scale and highly reconfigurable grid experimental testbed. In Grid Computing,2005. The 6th IEEE/ACM Intl. Workshop on, pages 8 pp.+, 2005.[23] Wu chun Feng, Xizhou Feng, and Rong Ge. Green supercomputing comes of age. ITProfessional, 10(1):17–23, 2008.[24] Lauro Beltra˜o Costa, Samer Al-Kiswany, Abmar Barros, Hao Yang, and Matei Ri-peanu. Predicting intermediate storage performance for workflow applications. InProceedings of the 8th Parallel Data Storage Workshop, pages 33–38. ACM, 2013.[25] Lauro Beltra˜o Costa, Samer Al-Kiswany, Raquel Vigolvino Lopes, and Matei Ripeanu.Assessing data deduplication trade-offs from an energy and performance perspective.In 2011 Intl. Green Computing Conf. and Workshops, 2011.[26] Lauro Beltra˜o Costa, Samer Al-Kiswany, Hao Yang, and Matei Ripeanu. SupportingStorage Configuration and Provisioning for I/O Intensive Workflows. In submissionto IEEE Transactions on Parallel and Distributed Systems (TPDS), 2014.62Bibliography[27] Lauro Beltra˜o Costa, Samer Al-Kiswany, Hao Yang, and Matei Ripeanu. SupportingStorage Configuration for I/O Intensive Workflows. In 28th International Conferenceon Supercomputing (ICS2014), 2014.[28] L.B. Costa, H. Yang, E Vairavanathan, A. Barros, K. Maheshwari, G. Fedak, D. S.Katz, M. Wilde, M. Ripeanu, and S. Al-Kiswany. The Case for Workflow-AwareStorage: An Opportunity Study. In Journal of Grid Computing, 2014.[29] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on largeclusters. Commun. ACM, 51(1):107–113, January 2008.[30] Dimitris Economou, Suzanne Rivoire, and Christos Kozyrakis. Full-system poweranalysis and modeling for server environments. In In Workshop on Modeling Bench-marking and Simulation (MOBS, 2006.[31] EPA. EPA Report to Congress on Server and Data Center Energy Efficiency. Technicalreport, U.S. Environmental Protection Agency, 2007.[32] Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andr Barroso. Power provisioning for awarehouse-sized computer. In The 34th ACM International Symposium on ComputerArchitecture, 2007.[33] Wu-chun Feng and Kirk Cameron. The green500 list: Encouraging sustainable super-computing. Computer, 40(12):50–55, December 2007.[34] Xizhou Feng, Rong Ge, and Kirk W. Cameron. Power and energy profiling of scientificapplications on distributed systems. In Proceedings of the 19th IEEE InternationalParallel and Distributed Processing Symposium (IPDPS’05) - Papers - Volume 01,IPDPS ’05, pages 34–, Washington, DC, USA, 2005. IEEE Computer Society.63Bibliography[35] Kurt Brian Ferreira, Dewan Ibtesham, David DeBonis, and Dorian Arnold. Coarse-grained Energy Modeling of Rollback/Recovery Mechanisms. Mar 2014.[36] Vincent W. Freeh, David K. Lowenthal, Feng Pan, Nandini Kappiah, Robert Springer,Barry Rountree, and Mark E. Femal. Analyzing the energy-time trade-off in high-performance computing applications. IEEE Trans. Parallel Distrib. Syst., 18(6):835–848, 2007.[37] Rong Ge, Xizhou Feng, and Kirk W. Cameron. Modeling and evaluating energy-performance efficiency of parallel processing on multicore based power aware systems.In IPDPS, pages 1–8. IEEE, 2009.[38] Abdullah Gharaibeh, Samer Al-Kiswany, and Matei Ripeanu. Configurable security forscavenged storage systems. In Proceedings of the 4th ACM International Workshop onStorage Security and Survivability, StorageSS ’08, pages 55–62, New York, NY, USA,2008. ACM.[39] S. Gurumurthi, A. Sivasubramaniam, M.J. Irwin, N. Vijaykrishnan, and M. Kandemir.Using complete machine simulation for software power estimation: the softwatt ap-proach. pages 141–150, Feb. 2002.[40] A. C. Laity, N. Anagnostou, G. B. Berriman, J. C. Good, J. C. Jacob, D. S. Katz,and T. Prince. Montage: An Astronomical Image Mosaic Service for the NVO. InP. Shopbell, M. Britton, and R. Ebert, editors, Astronomical Data Analysis Softwareand Systems XIV, volume 347 of Astronomical Society of the Pacific Conf. Series,page 34, Dec 2005.[41] Ketan Maheshwari, Justin Wozniak, Hao Yang, Daniel S. Katz, Matei Ripeanu, Victor64BibliographyZavala, and Michael Wilde. Evaluating storage systems for scientific data in the cloud.In 5th Workshop on Scientific Cloud Computing (ScienceCloud), 2014.[42] Scott Pakin and Michael Lang. Energy Modeling of Supercomputers and Large-ScaleScientific Applications. IEEE, 2013. In Proceedings for the IGCC 2013 : InternationalGreen Computing Conference.[43] Ramya Prabhakar, Erik Kruus, Guanlin Lu, and Cristian Ungureanu. Eeffsim: Adiscrete event simulator for energy efficiency in large-scale storage systems. In Ener-gy Aware Computing (ICEAC), 2011 International Conference on, pages 1–6. IEEE,2011.[44] Ioan Raicu, Ian T. Foster, and Yong Zhao. Many-Task Computing for Grids andSupercomputers. In IEEE Workshop on Many-Task Computing on Grids and Super-computers (MTAGS08) 2008.[45] Ioan Raicu, Zhao Zhang, Mike Wilde, Ian Foster, Pete Beckman, Kamil Iskra, and BenClifford. Toward loosely coupled programming on petascale systems. In Proceedingsof the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pages 22:1–22:12,Piscataway, NJ, USA, 2008. IEEE Press.[46] Taghrid Samak, Christine Morin, and H. Bailey, David. Energy consumption modelsand predictions for large-scale systems. In The Ninth Workshop on High-Performance,Power-Aware Computing, Boston, E´tats-Unis, 2013. IEEE, IEEE.[47] H. Shafi, P. Bohrer, J. Phelan, C. Rusu, and J. Peterson. Design and validation ofa performance and power simulator for PowerPC systems. IBM Journal of Researchand Development, 47(5):641–651, 2003.65Bibliography[48] Takeshi Shibata, SungJun Choi, and Kenjiro Taura. File-Access Patterns of Data-Intensive Workflow Applications and their Implications to Distributed Filesystems.In Proc. of the 19th ACM Intl. Symp. on High Performance Distributed Computing,HPDC ’10, pages 746–755, 2010.[49] Emalayan Vairavanathan. Towards a high-performance scalable storage system forworkflow applications. Master’s thesis, The University of British Columbia, September2014.[50] Emalayan Vairavanathan, Samer Al-Kiswany, Lauro Beltra˜o Costa, Hao Yang, andMatei Ripeanu. MosaStore functional and design specification. 2012.[51] Emalayan Vairavanathan, Samer Al-Kiswany, Lauro Beltra˜o Costa, Zhao Zhang,Daniel S. Katz, Michael Wilde, and Matei Ripeanu. A Workflow-Aware Storage Sys-tem: An Opportunity Study. In Cluster Computing and the Grid, IEEE Intl. Symp.on, pages 326–334, 2012.[52] A. Varga. Using the OMNeT++ Discrete Event Simulation System in Education.Education, IEEE Trans. on, 42(4), 1999.[53] N. Vijaykrishnan, M. J. Irwin, H. S. Kim, and W. Ye. Energy-driven integratedhardware-software optimizations using simplepower. pages 95–106, 2000.[54] Michael Wilde, Mihael Hategan, Justin M. Wozniak, Ben Clifford, Daniel S. Katz, andIan T. Foster. Swift: A language for distributed parallel scripting. Parallel Computing,37(9):633–652, 2011.[55] Justin M. Wozniak and Michael Wilde. Case Studies in Storage Access by LooselyCoupled Petascale Applications. In Proc. of the 4th Annual Workshop on PetascaleData Storage, PDSW ’09, pages 16–20, 2009.66Bibliography[56] Hao Yang, Lauro Beltra˜o Costa, and Matei Ripeanu. Energy Prediction for I/O Inten-sive Workflows. In submisson to 7th Workshop on Many-Task Computing on Cloud-s, Grids, and Supercomputers (MTAGS) 2014 (Co-located with Supercomputing/SC2014).[57] Ustun Yildiz, Adnene Guabtni, and Anne H. H. Ngu. Towards Scientific WorkflowPatterns. In Proceedings of the 4th Workshop on Workflows in Support of Large-ScaleScience, WORKS ’09, pages 13:1–13:10, New York, NY, USA, 2009. ACM.[58] Zhao Zhang, Allan Espinosa, Kamil Iskra, Ioan Raicu, Ian T. Foster, and MichaelWilde. Design and evaluation of a collective io model for loosely coupled petascaleprogramming. CoRR, abs/0901.0134, 2009.[59] Zhao Zhang, Daniel S. Katz, Michael Wilde, Justin M. Wozniak, and Ian Foster. Mtcenvelope: Defining the capability of large scale computers in the context of parallelscripting applications. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing, HPDC ’13, pages 37–48, New York,NY, USA, 2013. ACM.67


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items