UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Embracing diversity : optimizing distributed storage systems for diverse deployment environments Al-Kiswany, Samer 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_fall_alkiswany_samer.pdf [ 2.85MB ]
Metadata
JSON: 24-1.0074016.json
JSON-LD: 24-1.0074016-ld.json
RDF/XML (Pretty): 24-1.0074016-rdf.xml
RDF/JSON: 24-1.0074016-rdf.json
Turtle: 24-1.0074016-turtle.txt
N-Triples: 24-1.0074016-rdf-ntriples.txt
Original Record: 24-1.0074016-source.json
Full Text
24-1.0074016-fulltext.txt
Citation
24-1.0074016.ris

Full Text

EMBRACING DIVERSITY: OPTIMIZING DISTRIBUTED STORAGE SYSTEMS FOR DIVERSE DEPLOYMENT ENVIRONMENTS  by Samer Al-Kiswany B.Sc., Jordan University of Science and Technology, 2003 M.A.Sc, The University of British Columbia, 2007  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Electrical and Computer Engineering)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) July 2013  © Samer Al-Kiswany, 2013  Abstract Distributed storage system middleware acts as a bridge between the upper layer applications, and the lower layer storage resources available in the deployment platform. Storage systems are expected to efficiently support the applications’ workloads while reducing the cost of the storage platform. In this context, two factors increase the complexity of the design of storage systems: First, the applications’ workloads are diverse among number of axes: read/write access patterns, data compressibility, and security requirements to mention only a few. Second, storage system should provide high performance within a certain dollar budget. This dissertation addresses two interrelated issues in this design space. First, can the computational power of the commodity massively multicore devices be exploited to accelerate storage system operations without increasing the platform cost? Second, is it possible to build a storage system that can support a diverse set of applications yet can be optimized for each one of them? This work provides evidence that, for some system designs and workloads, significant performance gains are brought by exploiting massively multicore devices and by optimizing the storage system for a specific application. Further, my work demonstrates that these gains are possible while still supporting the POSIX API and without requiring changes to the application. Finally, while these two issues can be addressed independently, a system that includes solutions to both of them enables significant synergies.  ii  Preface The materials of Chapters 2, and 3 of this dissertation have been either published or submitted for publication. The author of this dissertation was the leader of all the projects presented in this dissertation performing most or all of the design, implementation, and evaluation. He also led the writing effort and co-authored the corresponding papers. Below are the details for each chapter: for each chapter this section presents the corresponding publications and details the role of the author of this dissertation.   Chapter 2: The findings of this chapter are published in four publications. The author of this dissertation was the main contributor for this project and its publications starting from the initial idea, system development, and collaborating on evaluating the system protoype. Abdullah Gharaibeh was the leader of the design, development, and evaluation of the CrystalGPU framework used in this project. The author of this dissertation led and collaborated on writing the papers. Related papers: a) S. Al-Kiswany, A. Gharaibeh, M. Ripeanu, GPUs as Storage System Accelerators, IEEE Transactions on Parallel and Distributed Systems (TPDS). May 2012. b) A. Gharaibeh, S. Al-Kiswany, S. Gopalakrishnan, M. Ripeanu, A GPU Accelerated Storage System, ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, IL, June, 2010. (acceptance rate 25%) c) S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M. Ripeanu, On GPU's Viability as a Middleware Accelerator, Journal of Cluster Computing, Springer, Volume 12, Issue 2, Pages 123 – 140, 2009. d) S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, M. Ripeanu, StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems, ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC 2008), Boston, MA, June, 2008. (acceptance rate 17%)    Chapter 3. The findings of this chapter are published in eight papers. The author of this dissertation was the main contributor for the project idea, system design, and a key contributor to the system development. Emalayan Vairavanathan, Hao Yang, and Lauro Beltrão Costa have integrated the system with the Swift and pyFlow schedulers, helped iii  evaluate the system on the cluster and BG/P platforms, and contributed sizably to the system development. Further, Emalayan Vairavanathan was the leader for the workflow based preliminary study summarized in this chapter. The author of this dissertation led and collaborated on writing the papers. Related papers: a) S. Al-Kiswany, E. Vairavanathan, L. B. Costa, H. Yang, M. Ripeanu, The Case for Cross-Layer Optimizations in Storage: A Workflow-Optimized Storage System, Submitted to the ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC 2013), 2013. b) L. B. Costa, A. Barros, E. Vairavanathan, S. Al-Kiswany, M. Ripeanu, Predicting Intermediate Storage Performance for Workflow Applications, Submitted to IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2013 c) S. Al-Kiswany, A. Barros, L. B. Costa, H. Yang, G. Fedak, D. Katz, M. Wilde, M. Ripeanu, A Case for Workflow-Aware Storage: An Opportunity Study using MosaStore, E. Vairavanathan, Journal of Future Generation Computer Systems, Sept 2012. d) E. Vairavanathan, S. Al-Kiswany, L. Costa, Z. Zhang, D. Katz, M. Wilde, M. Ripeanu, A Workflow-Aware Storage System: An Opportunity Study, International Symposium on Clusters, Cloud, and Grid Computing (CCGrid ‘12), Ottawa, Canada, May 2012. (selected as one of the best papers fast tracked for journal publication) e) A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, ThriftStore: Finessing Reliability Tradeoffs in Replicated Storage Systems, IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 22, Issue 6, Pages 910 – 923, 2010. f) S. Al-Kiswany, A. Gharaibeh, M. Ripeanu, Workshop on Hot Topics in Storage and File Systems (HotStorage), A Case for Versatile Storage System, October, 2009. Also appeared as ACM SIGOPS Operating Systems Review, 44, 1, January 2010. (acceptance rate 20%) g) E. Santos-Neto, S. Al-Kiswany, N. Andrade, S. Gopalakrishnan, M. Ripeanu, Beyond Search and Navigability: Custom Metadata Can Enable Cross-Layer iv  Optimizations in Storage Systems, ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC) - Hot Topics Track, Boston, MA, June, 2008 h) S. Al-Kiswany, M. Ripeanu, S. Vazhkudai, A. Gharaibeh, stdchk: A Checkpoint Storage System for Desktop Grid Computing, International Conference on Distributed Computing Systems (ICDCS ‘08), Beijing, China, June, 2008. (acceptance rate 16%). Summary of other projects conducted during my PhD study During my PhD study I collaborated on a number of projects with researchers from IBM research lab, NEC research lab, Argonne national research lab, Oak Ridge national research lab, Microsoft Azure team, and students at the NetSysLab research lab at UBC. The following is a brief summary of the main projects.   VMFlock ([31] and patents [21, 28]): This project was conducted in collaboration with a team of researchers at IBM Almaden research lab. This project proposes VMFlockMS, a migration service optimized for cross-datacenter transfer and instantiation of groups of virtual machine (VM) images that comprise an application-level solution (e.g., a threetier web application). VMFlockMS employs two main techniques: first, data deduplication within the VMFlock, and among the VMs across the source and destination data centers. Second, a workload informed VM instantiation technique that prioritizes data migration based on VM access history. VMFlockMS provides an incrementally scalable and high performance migration service. The evaluation shows that VMFlockMS can reduce the data volumes to be transferred over the network to as low as 3% of the original VMFlock size, enables the complete transfer of the VM images belonging to a VMFlock over transcontinental link up to 3.5x faster than alternative approaches, and enables booting these VM images with as little as 5% of the compressed VMFlock data available at the destination.    A Cloud-based Data Management Platform ([27] and patent [26]). This project was conducted in collaboration with NEC lab research team. This project starts from the observation that enabling the mobile applications to seamlessly share user data can bring significant competitive advantages, and enable a rich set of application features. For v  instance, a coupon recommendation application can better recommend coupons to a user if it knows her current location, the places she frequently visits, or places her friends (on the social network) frequently visit. This project addresses the two main challenges facing the design of a cloud data sharing service: first, it builds a sharing framework that enables maintaining hundreds of sharing agreements and meeting their service level agreement (SLA) requirements while controlling the infrastructure costs. Second, provides an SLA exploration tool. The sharing framework can support SLAs with different staleness, accuracy and cost, creating a complex three dimensional configuration space. While this 3D space provides the application developer with high flexibility, selecting an SLA is a confusing and challenging task. The project casts this problem as a multi-objective optimization problem and builds an intuitive SLA exploration tool.   Deduplication Energy Impact [55]. This project was conducted in collaboration with Lauro Costa from NetSysLab research lab, UBC. This project evaluates the energy tradeoffs brought by data deduplication in distributed storage systems. Depending on the workload, deduplication can enable a lower storage footprint, reduce the I/O pressure on the storage system, and reduce network traffic, at the cost of increased computational overhead. From an energy perspective, data deduplication enables a trade-off between the energy consumed for additional computation and the energy saved by lower storage and network load. The experimental evaluation shows that, while for non energy-proportional machines performance- and energy-centric optimizations have break-even points that are relatively close, for the newer generation of energy proportional machines the break-even points are significantly different. An important consequence of this difference is that, with newer systems, there are higher energy inefficiencies when the system is optimized for performance.    Secure Aggregate Storage System [75]. This project was conducted in collaboration Abdullah Gharaibeh from NetSysLab, UBC. This project studies and designs security mechanisms for storage systems scavenging storage spaces from untrusted storage nodes (e.g. users’ desktop machines. However, selecting the security level and designing the security mechanisms for such systems is challenging as scavenging idle storage opens the door for security threats absent in traditional storage systems that use dedicated nodes vi  under a single administrative domain. Moreover, increased security often comes at the price of performance and scalability. This work develops a general threat model for systems that use scavenged storage, presents the design of a protocol that addresses these threats and is optimized for throughput, and evaluates the overheads brought by the new security protocol when configured to provide a number of different security properties.   GPU Support for Batch Oriented Workloads [56]. This project was conducted in collaboration with Lauro Costa from NetSysLab research lab, UBC. This work explores the ability to use Graphics Processing Units (GPUs) as co-processors to harness the inherent parallelism of batch operations in systems that require high performance. To this end we have chosen Bloom filters (space-efficient data structures that support the probabilistic representation of set membership) as the queries these data structures support are often performed in batches. Bloom filters exhibit low computational cost per amount of data, providing a baseline for more complex batch operations. We implemented BloomGPU a library that supports offloading Bloom filter support to the GPU and evaluate this library under realistic usage scenarios. By completely offloading Bloom filter operations to the GPU, BloomGPU outperforms an optimized CPU implementation of the Bloom filter as the workload becomes larger.  vii  Table of Contents Abstract ..................................................................................................................................... ii Preface...................................................................................................................................... iii Table of Contents ................................................................................................................... viii List of Tables .......................................................................................................................... xii List of Figures ........................................................................................................................ xiv Acknowledgments.................................................................................................................. xxi 1  Introduction and Overview .................................................................................................1 1.1  StoreGPU: Harnessing the Computational Power of Multicore Platforms .............2  1.1.1 Contributions............................................................................................................4 1.2  Cross Layer Optimizations in Storage Systems .......................................................4  1.2.1 Contributions............................................................................................................6 1.3  Synergy between the Two Projects ..........................................................................7  1.4  Research Platform: The MosaStore Storage System ...............................................8  1.4.1 MosaStore Architecture ...........................................................................................8 1.4.2 MosaStore-Based Projects .......................................................................................9 1.5 2  Dissertation Structure.............................................................................................10  StoreGPU: Harnessing the Computational Power of Multicore Platforms ......................11 2.1  Context ...................................................................................................................11  2.1.1 Chapter Structure ...................................................................................................13 2.2  Contributions..........................................................................................................13  2.3  Background: Use of Hashing in Storage Systems .................................................16  2.4  Background: GPU-Related Background ................................................................18  2.4.1 GPU Performance Factors .....................................................................................19 2.5  System Design .......................................................................................................21  2.5.1 Design and Integration Challenges ........................................................................21 2.5.2 A GPU Accelerated Storage System Prototype .....................................................24 2.5.2.1 MosaStore...................................................................................................... 24 2.5.2.2 HashGPU....................................................................................................... 25 2.5.2.2.1 HashGPU API.......................................................................................... 25 2.5.2.2.2 Design of the Direct Hashing Module ..................................................... 26 viii  2.5.2.2.3 The Sliding Window Hashing Module .................................................... 27 2.5.2.2.4 Optimized Memory Management ............................................................ 28 2.5.2.2.5 Other Optimizations ................................................................................ 28 2.5.2.3 CrystalGPU ................................................................................................... 29 2.5.2.4 Integration and System Prototype ................................................................. 31 2.6  Experimental Evaluation........................................................................................32  2.6.1 HashGPU Evaluation .............................................................................................33 2.6.1.1 Synthetic Benchmarks ................................................................................... 33 2.6.1.1.1 Experiment Design .................................................................................. 33 2.6.1.1.2 Experimental Results ............................................................................... 37 2.6.1.1.3 Dissecting the Overheads ........................................................................ 41 2.6.1.2 Application Level Performance .................................................................... 46 2.6.2 Evaluating the Performance Gains Enabled by CrystalGPU .................................47 2.6.3 Add a CPU or a GPU? ...........................................................................................51 2.6.4 Integrated System Evaluation ................................................................................51 2.6.5 What Would the System Performance be if we had Infinite Compute Power? .....57 2.6.6 The Impact on Competing Applications ................................................................58 2.6.7 Evaluation Summary..............................................................................................63  3  2.7  Related Work .........................................................................................................64  2.8  Discussion ..............................................................................................................65  2.9  Conclusions ............................................................................................................67  Cross-Layer Optimizations in Storage Systems ...............................................................69 3.1  Context ...................................................................................................................70  3.1.1 Chapter Structure ...................................................................................................72 3.2  Contributions..........................................................................................................72  3.3  Background ............................................................................................................74  3.3.1 Data Access Patterns of Workflows ......................................................................76 3.4  The First Preliminary Evaluation Study: Workflow Optimized Storage System: a Limit Study ...........................................................................................................79  3.4.1 Hacks: Customizing MosaStore............................................................................80 3.4.2 Experimental Results .............................................................................................81 ix  3.4.2.1 Micro Benchmark: The Impact of Locality................................................... 81 3.4.2.2 Synthetic Benchmarks ................................................................................... 84 3.4.2.2.1 Experiment Setup .................................................................................... 84 3.4.2.2.2 Pipeline Pattern Evaluation ..................................................................... 86 3.4.2.2.3 Broadcast Pattern Evaluation................................................................... 88 3.4.2.2.4 Reduce Pattern Evaluation....................................................................... 90 3.4.2.2.5 Scatter Pattern Evaluation ....................................................................... 92 3.4.3 Summary ................................................................................................................93 3.5  The Second Preliminary Evaluation Study: A Checkpoint Storage System for Desktop Grid Computing .......................................................................................93  3.5.1 Context ...................................................................................................................94 3.5.2 Design Considerations for a Checkpoint Storage System .....................................97 3.5.2.1 Checkpoint I/O Workload Characteristics .................................................... 98 3.5.2.2 Design Goals ................................................................................................. 98 3.5.3 Stdchk: A Checkpoint-Friendly Storage System ...................................................99 3.5.3.1 System Architecture ...................................................................................... 99 3.5.3.2 Write Optimizations for High Throughput.................................................. 102 3.5.3.3 Support for Incremental Checkpointing ...................................................... 103 3.5.3.4 Support for Automated, Time-Sensitive Data Management ....................... 105 3.5.3.5 Providing a Traditional File System Interface ............................................ 105 3.5.4 Evaluation ............................................................................................................107 3.5.4.1 Platform Characterization ........................................................................... 107 3.5.4.2 Write Throughput ........................................................................................ 108 3.5.4.3 Effect of Configuration Parameters............................................................. 110 3.5.4.4 Write Performance on a 10Gbps Testbed ................................................... 111 3.5.4.5 Incremental Checkpointing: Feasibility and Performance .......................... 112 3.5.4.6 Stdchk Scalability........................................................................................ 116 3.5.4.7 Putting Everything Together ....................................................................... 118 3.5.5 Stdchk Related Work ...........................................................................................118 3.5.6 Summary ..............................................................................................................120 3.6  System Design .....................................................................................................121 x  3.6.1 Design Requirements ...........................................................................................121 3.6.2 Storage System Design ........................................................................................122 3.6.3 Prototype Implementation Details .......................................................................124 3.6.4 Integration with a Workflow Runtime System ....................................................126 3.7  Evaluation ............................................................................................................127  3.7.1 Testbeds ...............................................................................................................128 3.7.2 Synthetic Benchmarks .........................................................................................128 3.7.3 Simple Real Applications: BLAST, ModFTDock ...............................................133 3.7.4 A Complex Workflow: Montage .........................................................................136 3.7.5 Exploring WOSS Overheads/Gains .....................................................................138  4  3.8  Related Work .......................................................................................................139  3.9  Discussion and Summary.....................................................................................143  Summary and Impact......................................................................................................147  References ..............................................................................................................................151  xi  List of Tables Table 1. The processing stages of a GPU task. ....................................................................... 21 Table 2. The list of factors considered in the experiments and their respective levels. Note that the sliding-window hashing module has extra parameters. ........................... 34 Table 3. Online similarity detection throughput (in MBps) and speedup using SHA1. ......... 46 Table 4. Online similarity detection throughput (in MBps) and speedup using MD5. .......... 47 Table 5. Popular workflow data access patterns. Circles represent computations. An outgoing arrow indicates that data is produced (to a temporary file) while an incoming arrow indicates that data is consumed (from a temporary file). There may be multiple inputs and outputs via multiple files. (Notation similar to that used by Wozniak et al. [155]). Arrows are labeled with extended attribute API calls used to pass hints to enable the optimizations. (The corresponding hints are presented in detail in Table 12) ............................................................................. 78 Table 6. File sizes for different workflow patterns ................................................................. 86 Table 7. Time to write a 1 GB file. ....................................................................................... 108 Table 8. Characteristics of the collected checkpoints. .......................................................... 113 Table 9. Comparison of similarity detection heuristics. The table presents the average rate of detected similarity and the throughput in MB/s (in brackets) for each heuristic.114 Table 10. The effect of m and k on CbCH no-overlap performance. The table presents the ratio of detected similarity (in percentage), the heuristic’s throughput in MB/s, the average resulting checkpoint size in KB, and the average minimum and maximum chunk sizes (Values for m in bytes and for k in bits) .......................................... 115 Table 11. The execution time and volume of generated data for BLAST application checkpointing to local disk and stdchk. .............................................................. 118 Table 12. Implemented metadata attributes (hints) and the corresponding optimizations ... 126 Table 13. File sizes for different workflow patterns. ............................................................ 129 Table 14. Generated network trafic (GB) ............................................................................. 132 Table 15. Average BLAST execution time (in seconds) for NFS, DSS and various replication levels controlled in WOSS. ................................................................................. 135 Table 16. The characteristics of each stage for the Montage workflow ............................... 137 xii  Table 17. WOSS microbenchmark. ...................................................................................... 139 Table 18. Survey of related projects. The table compares WOSS with current approaches on number of axes .................................................................................................... 142  xiii  List of Figures Figure 1. Research summary..................................................................................................... 2 Figure 2. GPU Schematic Architecture .................................................................................. 18 Figure 3. System architecture. At the client node, the FUSE Linux module is used to implement a user-level filesystem (MosaStore SAI). The Linux virtual file system (VFS) calls, through FUSE, user-defined callbacks, that implement MosaStore file system functionality. Note that the GPU is needed only on the client machines. Figure 6 presents in detail the structure of the MosaStore SAI that integrates the GPU. ....................................................................................................................... 24 Figure 4. Direct hashing module architecture. The blocks with circular arrows represent the standard hashing kernel. Stages numbers correspond to Table 1. .......................... 27 Figure 5. Sliding window hashing module architecture. The blocks with circular arrows represent the standard hashing kernel. Stage numbers correspond to Table 1. ...... 27 Figure 6. MosaStore SAI architecture and operations flow. The data written by the application is stored in a buffer. When the buffer is full, the content addressability module submits to CrystalGPU a request to execute the HashGPU sliding window (SW) hashing kernel on this data. The resulting sliding window hashes are used to detect the blocks boundaries. Upon detecting these boundaries, the content addressability module submits a second computation to CrystalGPU that uses, this time, the HashGPU direct hashing kernel to compute block hashes. Once these are received the content addressability module uses the blocks’ hashes to detect similarity and decide if any of the blocks are new and need to be transferred to the storage nodes. All buffer management and HashGPU kernel calls are managed by CrystalGPU. The dashed ovals and arrows represent operations that are only executed in content based chunking content addressability, and the double lined arrow represents the flow of operation in fixed size blocks content addressability. ................................................................................................................................ 30 Figure 7. Speedups for MD5 direct hashing module for fully optimized GPU implementations running on GeForce 8600 and 8800, and multithreaded CPU implementations harnessing all available cores. The value x=1 separates the speedup (right) from the slowdown values (left). .................................................. 36 xiv  Figure 8. HashGPU speedup for the MD5 implementation of the direct hashing module. The shaded bars use GeForce 8800 GTX GPU. ............................................................ 37 Figure 9. HashGPU speedup for the SHA1 implementation of the direct hashing module. The shaded bars use GeForce 8800 GTX GPU. ............................................................ 37 Figure 10. HashGPU sliding-window hashing module speedup for MD5. Window = 20 bytes, offset = 4 bytes, reduced hash. ............................................................................... 39 Figure 11. HashGPU sliding window hashing module speedup for SHA1. Window = 20bytes, Offset = 4bytes, Reduced hash................................................................. 39 Figure 12. HashGPU sliding-window hashing module speedup for MD5. Window size=52 bytes. Offset=52 bytes. ........................................................................................... 40 Figure 13. HashGPU sliding-window hashing module speedup for SHA1. Window size=52 bytes. Offset=52 bytes. ........................................................................................... 40 Figure 14. Stage 1 (preprocessing and initialization) execution time for MD5 direct hashing module of HashGPU. ............................................................................................. 42 Figure 15. Stage 2 (input data transfer) execution time of MD5 direct hashing module with and without using pinned memory feature. ............................................................ 42 Figure 16. Kernel execution time for MD5 direct hashing module with/without shared memory optimization enabled. ............................................................................... 43 Figure 17. Execution times for MD5 direct hashing module data transfer operation from the GPU global memory to the host memory with and without using pinned memory feature. .................................................................................................................... 44 Figure 18. Execution times for the last stage – the hash aggregation. .................................... 44 Figure 19. Percentage of total execution time spent on each stage when none of the optimizations are enabled. ...................................................................................... 45 Figure 20. Percentage of total execution time spent on each stage with pinned and shared memory optimizations enabled. ............................................................................. 45 Figure 21. Percentage of total HashGPU sliding window hashing execution time spent on each stage without any optimization. ..................................................................... 48 Figure 22. Percentage of total HashGPU direct hashing module execution time spent on each stage without any optimization............................................................................... 48  xv  Figure 23. Achieved speedup for sliding window hashing for a stream of 10 jobs. Small data sizes in the left figure (logarithmic scale on Y axis), and large data size in the right figure (linear scale on Y axis). Values below 1 indicate slowdown. The baseline is the performance on a single core on Intel Xeon Quad-core 2.33 GHz. The numbers over the “Dual socket CPU” and “Overlap, Buffer reuse” data points indicate the achieved processing throughput in MBps. ............................................................. 50 Figure 24. Achieved speedup for direct-hashing for a stream of 10 jobs. Small data sizes in the left figure (logarithmic scale on Y axis), and large data size in the right figure (linear scale on Y axis). Values below 1 indicate slowdown. The baseline is the performance on a single core on Intel Xeon Quad-core 2.33 GHz. The numbers over the “Dual socket CPU” and “Overlap Buffer reuse” data points indicate the achieved data processing throughput in MBps....................................................... 50 Figure 25. Average throughput while writing 40 different files back-to-back in the fixed block configuration. Note the logarithmic scale on the y-axis. .............................. 54 Figure 26. Average throughput while writing 40 different files back-to-back in the content based chunking configuration. Note the logarithmic scale on the y-axis. .............. 54 Figure 27. Average throughput while writing the same file 40 times back-to-back in the fixed block configuration. Next section discusses the CA-Infinite configuration. Note the logarithmic scale on the y-axis. .............................................................................. 55 Figure 28. Average throughput while writing the same file 40 times back-to-back in the content based chunking configuration. Next section discusses the CA-Infinite configuration. Note the logarithmic scale on the y-axis. ........................................ 56 Figure 29. Average throughput while writing 100 BLAST checkpoints back-to-back while varying the block size. “Fixed” denoted evaluation with the fixed block size configuration while “CB” denotes the content based chunking configuration. The content based chunking configuration was tuned to produce average chunks sizes close to the sizes indicated on x-axis. The numbers on top of the bars denote the average similarity percentage detected using the configuration. ........................... 56 Figure 30. (Left) MosaStore average achieved throughput under the different workload while running a competing compute intensive application. (Right) Competing application slowdown (the lower the better). ............................................................................ 60 xvi  Figure 31. (Left) MosaStore average achieved throughput under the ‘similar’ workload while running a competing compute intensive application. (Right) Competing application slowdown (the lower, the better). ........................................................................... 60 Figure 32. (Left) MosaStore average achieved throughput under the ‘checkpoint’ workload while running a competing compute intensive application. (Right) Competing application slowdown (the lower, the better). ........................................................ 61 Figure 33. (Left) MosaStore's average achieved throughput while running an I/O intensive application with the different workload. (Right) Competing application slowdown (lower is better). ..................................................................................................... 62 Figure 34. (Left) MosaStore's average achieved throughput while running an I/O intensive application with the similar workload. (Right) Competing application slowdown (lower is better). ..................................................................................................... 62 Figure 35. (Left) MosaStore's average achieved throughput while running an I/O intensive application with the checkpoint workload. (Right) Competing application slowdown (lower is better). .................................................................................... 63 Figure 36. Usage scenario and high-level architecture. The workflow optimized storage system (WOSS) aggregates the storage space of the compute nodes and is used as an intermediate file-system. Input/output data is staged in/out from the backend storage. The workflow scheduler queries WOSS for data location to perform location-aware scheduling. The scheduler submits tasks to individual compute nodes and includes hints on the data usage patterns. ............................................. 76 Figure 37. I/O throughput when the storage node is backed by spinning disk (left plot) and RAMdisk (right plot). For each plot there are two sets of columns presenting the write and, respectively, the read performance. Note that the axes use different scales in the two plots. Figures represent average throughput, and standard deviation in error bars, over 30 reads/writes. ......................................................... 83 Figure 38.Summary of synthetic benchmarks for pipeline, broadcast, reduce, and scatter patterns. Nodes represent workflow stages (or stage-in/out operations) and arrows represent data transfers through files. Labels on the arrows represent file sizes for the ‘small’ workload. The other workload sizes are presented in Table 6. ........... 85  xvii  Figure 39. Pipeline pattern – small files. Average execution time (in seconds) for small file sizes. Error bars represent standard deviation for all stages of the workflow (the entire experiment time). ......................................................................................... 87 Figure 40. Pipeline pattern – medium files. Average execution time (in seconds) for mediumsize file. Error bars represent standard deviations for the entire experiment. ........ 87 Figure 41. Pipeline pattern large files. Average execution time (in seconds) for large file sizes. ....................................................................................................................... 87 Figure 42. Average execution time for broadcast synthetic benchmark with medium workload. All storage systems are deployed on spinning disks. ............................ 89 Figure 43. Average execution time for broadcast synthetic benchmark with large workload. All storage systems are deployed on spinning disks. ............................................. 90 Figure 44. Breakdown of broadcast benchmark for the ‘medium’ workload. It shows the time to create requested replicas and the actual workflow time separately.................... 90 Figure 45: Reduce pattern. Average benchmark execution time (in seconds). ...................... 91 Figure 46: Scatter pattern large files. Average execution time (in seconds) and standard deviation for the scatter stage of the benchmark (large file sizes) ......................... 93 Figure 47. File system call path through FUSE. ................................................................... 107 Figure 48. The average observed application bandwidth (OAB) for three write optimized techniques: complete local writes (CLW), incremental writes (IW), and sliding window (SW). For comparison the figure also shows: the throughput of writing to the Local-I/O, to local I/O through the FUSE module (FUSE), and to a dedicated NFS server (NFS) running on the same node....................................................... 109 Figure 49. The average ASB for complete local writes (CLW), incremental writes (IW), and sliding window (SW)............................................................................................ 109 Figure 50 The OAB for the sliding window write for different number of benefactors and allocated buffer size (in MB)................................................................................ 110 Figure 51. The ASB for sliding window writes for different number of benefactors and allocated buffer sizes (in MB). ............................................................................. 111 Figure 52. The OAB and ASB of the sliding-window interface varying the stripe width for a testbed of 10Gbps client and manager and four 1Gbps benefactors. ................... 112  xviii  Figure 53. The observed application bandwidth (OAB, left plot) and the achieved storage bandwidth (ASB, right plot) for the sliding window with and without incremental checkpointing supported by fixed block compare-by-hash. ................................. 116 Figure 54. stdchk throughput at larger scale: 7 clients generate a synthetic workload to stress a stdchk pool supported by 20 benefactor nodes. ................................................. 117 Figure 55. stdchk throughput at larger scale: up to 560 clients, running a synthetic MPI application, checkpoint to a stdchk pool supported by 35 donor nodes. .............. 118 Figure 56.The main components of a distributed storage system (also used by WOSS). .... 122 Figure 57. WOSS metadata manager design. For clarity, the figure shows WOSS integration with a workflow runtime engine (Scheduler) and details WOSS metadata manager. The figure shows: (i) in solid lines, the path followed by a client chunk allocation request: the request is processed by a pattern-specific data placement ‘DP’ module based on the corresponding file tags/hints, (ii) the data path as data is produced by the clients (the solid lines going to storage nodes), and, (iii) the path of a request to retrieve file location information (dashed lines). ................................................. 124 Figure 58. Pipeline, broadcast, reduce, and scatter synthentic benchmarks. Nodes represent workflow stages (or stage-in/out operations) and arrows represent data transfers through files. Labels on the arrows represent file sizes. Horizontal dashed lines represent the crossing boundary between backend and intermediate storage (e.g., stage-in reads a file from the backend and writes to the intermediate storage). .. 129 Figure 59. Average time for pipeline benchmark. ................................................................ 131 Figure 60. Average times for broadcast pattern. ................................................................... 131 Figure 61. Average runtime for reduce benchmark .............................................................. 131 Figure 62. Average runtime for stage 2 of scatter benchmark .............................................. 131 Figure 63. modFTDoc worflow. Labels on arrows represent the tags used ......................... 134 Figure 64. Average runtime for modFTDock on cluster. The experiment runs 9 pipelines in parralel using 18 nodes. ........................................................................................ 134 Figure 65. modFTDock runtime on BG/P while varying the number of nodes allocated to the application. The workload size increases proportionally with the resource pool. .............................................................................................................................. 135  xix  Figure 66. BLAST workflow. The BLAST database (1.8GB) is used by all nodes that search in parallel different queries. Labels on arrow represent the tags used to hint the data usage patterns................................................................................................ 135 Figure 67. Montage workflow. The characteristics of each stage are described in Table 15. Labels on arrow represent the tags used to hint the data usage patterns. ............. 137 Figure 68. Montage workflow execution time (averages over four experiments). Note that, to better highlight the differences, y-axis does not start at zero. .............................. 138  xx  Acknowledgments In my life I was lucky to meet few people that had significant impact on my life and character. I find myself speechless when I try to find words to describe my gratitude to them. My adviser Matei Ripeanu is on the top of this very short list. His continuous support, attention to my research skills development, and emphasis on pursuing my research interests, significantly helped shape my research skills. Further, his help during many critical moments throughout my study was critical for continuing my PhD journey. But, his leadership by personal example in academia and life had the largest impact on my character. While I will always be indebted to Matei I find some relieve following my father’s advice: “In life you will meet people who will help you and you will not find a way to return back their favors, the only way to pay them back is to follow their example and help others”. If I prove to be a good researcher and adviser it is Matei’s merit. I would like to thank Sathish Gopalakrishnan for his support, fruitful collaboration, and mentorship. I am grateful to my mentors, Prasenjit Sarkar, Karthik Pattabiraman, Sameh Elnikety, and Mihai Budiu for their insightful advises and guidance in making my career decisions. I am grateful to my dear friends in Vancouver, especially Abdullah and Fatimah Gharaibeh, Omar and Hamza Khasawneh, Muna Kris, Anas and Assem Bsoul, Zaid and Mohammad Azaizeh, Omar Trad, and Ameer Abdelhadi, for their friendship, support, and continuous encouragement. I am grateful to the Networked Systems research group students, Emalayan Vairavanathan, Lauro Costa, Hao Yang, Elizeu Santos-Nato, Bo Fang, Yazan Boshmaf and Mohammad Afrasiabi for their friendship, fruitful collaborations, and insightful comments. Through my graduate study I faced exceptionally challenging situations; it was impossible to continue my graduate study without the encouragement and support of my family: my parents, Ishaq and Nabilah, and my brothers, Khaled and Bilal. Their unconditional love and confidence in my abilities are deeply appreciated. Many thanks to my parents-in-law for their affection and the confidence they have placed in me. Continuing my study was impossible without the support of my wife, Dima. Her continues support and encouragement, ability to reignite my enthusiasm for my work when it faded out, and her telling me that I have to do the dishes if I do not work on the dissertation, were critical for keeping me on the course and complete my study. Last, I thank my 1 year old daughter, Jana, for bringing joy to our home. I thank her for dancing when I come home, for keeping me company over the nights, and for insisting on helping me in writing this dissertation with random keystrokes. Any typos in this dissertation is her contribution!!.  xxi  DEDICATION  To My Parents, Wife, Brothers, and Daughter  xxii  Chapter 1 1 Introduction and Overview Distributed storage system middleware acts as a bridge between the upper layer applications, and the lower layer storage resources available in the deployment platform. Storage systems are expected to efficiently support the applications’ workloads while reducing the cost of the storage platform. In this context, two factors increase the complexity of the design of storage systems: First, the applications’ workloads are diverse among number of axes: read/write access patterns, data compressibility, and security requirements to mention only a few. Second, storage system should provide high performance within a certain dollar budget. This dissertation addresses two interrelated issues in this design space. First, can the computational power of the commodity massively multicore devices be exploited to accelerate storage system operations without increasing the platform cost? (Section 1.1). Second, is it possible to build a storage system that can support a diverse set of applications yet can be optimized for each one of them? (Section 1.2). My work provides evidence that, for some system designs and workloads, significant performance gains are brought by exploiting massively multicore devices and by optimizing the storage system for a specific application. Further, my work demonstrates that these gains are possible while still supporting the POSIX API and without requiring changes to the application. Finally, while these two issues can be addressed independently, a system that includes solutions to both of them enables significant synergies (discussed in section 1.3). I explore these questions and the possible synergies using a single prototype implementation: MosaStore storage system (Section 1.4). The decision to base my exploration on a single prototype has two advantages: First, it enables demonstrating the synergies between the two research directions. Second, it focuses the development effort on a single storage system prototype, leading to a more stable codebase, that can be used by others as a research platform. Section 1.4.2 surveys a number of MosaStore based projects. Figure 1 summarizes the objectives of this dissertation, outlines the dissertation structure, and exposes the links between my publications and each of the projects.  1  StoreGPU  Cross Layer Optimizations  Software HashGPU library, and StoreGPU artifacts storage system  Workflow optimized storage system  • Opportunity evaluation study (stdchk ICDCS ’08 [30], CCGrid ‘12 [151], FGCS ’13 [150]. • Workflow optimized storage system (HPDC ’08 HotTopics [132], HotStorage ’09 [22], HPDC ‘13 submission [32]).  Chapter 3  Publications  • HashGPU development and evaluation (HPDC ’08 [25], JoCC ‘09 [24], PCCC ’09 [56]). • StoreGPU Storage system prototype (HPDC ’10 [74]). • Performance evaluation using StoreGPU fixed size and content based module. (TDPS ’12 [23]).  3.1 & 3.2  Is it possible to build an applicationoptimized storage system yet is able to support a diverse set of applications?  Chapter 2  Can the GPU computational power be Research exploited to accelerate storage system question operations?  Section 1.4  MosaStore Purpose: a research-oriented exploration platform, and support for large scale scientific research projects. Milestones: • MosaStore design and implementation (ICDCS ’08 [30], HotStorage ’09 [22], StorageSS ’08 [75], SC ’07 poster [20]) • Large scale deployments: ANL [159], ORNL (section 3.5.4.6) Figure 1. Research summary  1.1 StoreGPU: Harnessing the Computational Power of Multicore Platforms This project starts from the observation that a number of techniques that enhance distributed system reliability and/or performance (e.g., hashing, erasure codes [153], on-the-fly data similarity detection [110]) incur computational overheads that often hinder their effective usage with today’s commodity hardware. This project (detailed in chapter 2) studies the viability of offloading these data processing intensive operations to massively multicore devices (GPUs in particular), in the context of storage systems. The decision to use GPUs is motivated by several factors. Recent commodity GPUs like the NVIDIA GeForce GTX 480 GPU (480 cores at 1.4 GHz) offer a ten-fold higher peak 2  computational rate than Intel processors in the same price range. Further, the decision to use GPUs as coprocessors aligns with the current technology trends: First, while multi-core CPUs currently do not offer this level of parallelism, they are expected to provide similar parallelism over the next decade and will be able to use some of the techniques prototyped in the GPU context. Second, future massively multicore processors will likely extend the scope of heterogeneity already present in existing processors (e.g., AMD Fusion, Intel MIC, and IBM’s Cell BE); they will likely integrate heterogeneous cores and multiple execution models (e.g., SIMD/MIMD cores) [89]. This work highlights that this significant drop in the cost of computation triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project highlights the dramatic change, often overlooked by system designers, in the systems’ design space and demonstrates the significant benefits brought by exploiting GPU compute power. Further, this project demonstrates the significant performance improvement brought by offloading in a domain thought to be a bad fit for GPU computing, namely data intensive operations. This project demonstrates that, after applying a set of optimizations, GPU offloading is viable even for data intensive systems. In terms of delivery, this project proposes HashGPU (section 2.5) [25], a library that transparently exploits the GPU computational power to support specialized use of hashing used in content addressability, on-the-fly similarity detection, data integrity, and load balancing. HashGPU library opens the possibility of efficiently incorporating these mechanisms into distributed storage systems, thereby unleashing a valuable set of optimization techniques. Further, this project presents the challenges and proposes design changes to the storage system to exploit the GPU computational power. I have evaluated the performance of HashGPU library independently, as well as a complete storage system prototype integrating HashGPU. The HashGPU library evaluation demonstrates that, depending on the workload, HashGPU can offer up to 18x speedup compared to traditional CPU-based processing. This brings in a new overhead tradeoff balance point where the above techniques can be effectively used to support highperformance computing system middleware. The complete storage system evaluation under real workloads shows that this technique brings tangible performance gains to the application  3  without negatively impacting the performance of other concurrently running applications (section 2.6) [23, 74]. 1.1.1 Contributions This project contributes two significant findings: First, this project demonstrates the significant change in the computing systems design spectrum caused by the recent advancement in computing devices. This project demonstrates that significant benefits can be reaped and compute intensive mechanisms become viable when exploiting the computational power of multicore devices. Based on this project findings, system designers have a wider design spectrum to consider in their designs. Second, contrary to the belief that GPU acceleration is only viable for compute intensive applications, this project demonstrates the viability of employing GPUs to support data-intensive storage system services. Further this project has three additional contributions: First, this project explores and demonstrates the feasibility of exploiting the GPU to accelerate data- and compute intensive primitives (hashing in particular). Second, by evaluating the performance of a complete system, it demonstrates the benefits of offloading storage system compute intensive operations to massively multicore processors. Third, it sheds light on the impact of GPU offloading on competing applications running on the same node as the distributed storage middleware. Finally, it introduces techniques and develops open source tools to efficiently leverage the processing power of GPUs. Section 2.2 discusses the contributions in detail.  1.2 Cross Layer Optimizations in Storage Systems Today’s  large-scale  computing  systems  (e.g.,  supercomputers,  cloud  computing  infrastructures) aggregate thousands of computing nodes and offer ample computational power. These systems support large-scale scientific applications that generate an intense storage workload. For instance, data intensive applications [9, 34] often use thousands of computing nodes to search through or analyze terabytes of stored data. For such applications, the storage system throughput and scalability play a key role in the overall application performance. The risk is that the I/O system is the bottleneck for an expensive set of compute resources. Moreover, applications that use these compute resources are highly heterogeneous over multiple axes related to storage system usage patterns and required semantics. These include: access pattern, file sizes, read vs. write composition (e.g., read- or write-only access pattern), 4  data-lifetimes, durability requirements (e.g., some files can be recomputed), consistency requirements, data sharing levels (e.g., sometimes thousands of nodes concurrently access the same data), and security requirements (e.g., in terms of authentication, data integrity and confidentiality). Further, data is rarely shared between applications. A one-size-fits-all dedicated storage system that serves such diverse requirements while meeting the access requirements of data-intensive applications is particularly complex and costly. Moreover, this approach often introduces a performance or scalability bottleneck. An alternative approach is specialization: that is, exploiting workload characteristics to optimize the data store for application-specific usage patterns. Google file system [79] and PVFS [49] are only two examples of this approach: Google file system optimizes for large datasets and append-intensive access patterns, while PVFS optimizes for sequential read/write access to large datasets in HPC environments. This project (chapter 3) proposes an approach to specialize the storage system, at runtime, for the specific application using the system. In particular, this project proposes providing a cross layer communication channel through which applications and the storage system can exchange hints. On one side, applications can convey hints about their data usage patterns to the storage system. These hints can be used by the storage system to optimize its performance. On the other side, the storage system can provide applications with information only available at the storage layer (e.g., data placement, caching status, status of a replication process). This project proposes providing the cross layer communication channel through an elegant use of extended file system attributes. Extending the file system metadata to support user-defined custom metadata is an approach adopted by a number of storage systems [16, 18, 36, 95] mainly to improve search and navigability when dealing with large data volumes. In these systems, the user or the application (e.g., an mp3 player) defines custom attributes, not part of the standard file system file attributes, to help better define the file content (e.g., the album name for mp3 files). This project postulates that besides improving search and navigability, custom metadata can be used as a bidirectional communication channel between applications and the storage system [132]. The proposed approach falls in the category of ‘guided mechanisms’ (i.e., solutions for 5  applications to influence data placement, layout, and lifecycle), the focus of other projects as well. In effect, the wide range (and incompatibility!) of past such solutions proposed in the storage space in the past two decades (and incorporated to some degree by production systems - pNFS, PVFS, GPFS, Lustre, and other research projects [33, 37, 38, 70, 135, 142]), only highlights that adopting an unifying abstraction is an area of high potential impact. The novelty of this approach comes from the "elegant simplicity" of the solution proposed. Unlike past work, the proposed approach maintains the existing API (predominantly POSIX compatible), and, within this API, uses the existing extended file attributes as a flexible, application-agnostic mechanism to pass information across the application/storage divide. This project demonstrates this approach by building a POSIX compatible, yet application-optimized, storage system to efficiently support the execution of scientific workflows (section 3.6). The project presents a general storage system architecture to support cross layer optimization. The prototype demonstrates that it is possible to achieve this goal without changing the application code or tasking the application developer to annotate their code to reveal the data usage patterns. The prototype evaluation (section 3.7) using synthetic benchmarks as well as three real-world workflows demonstrates that this design brings sizeable performance gains. The prototype achieves 30% to up to 2x higher performance under synthetic benchmarks and 20-30% application-level performance gain compared to a standard distributed storage system (i.e. non-optimized). 1.2.1 Contributions This project debunks the generally accepted principle of the necessity to abandon POSIX for higher performance [15, 81, 115, 139]. The project proposes an innovative approach that enables significant improvements, without abandoning POSIX. Further, unlike systems specialized for specific workloads, this work demonstrates that it is feasible to have a POSIX compatible storage system yet optimized for each application (or application mix) even if the application exhibits a heterogeneous data access pattern across files. The contribution of this project is twofold: First, this project proposes a new approach that uses custom metadata to enable cross-layer optimizations between applications and the storage system. Further, this project argues that an incremental adoption path exists for adopting this approach. Second, it designs and evaluates a storage system architecture 6  supporting the cross layer optimization approach. The design provides generic and extensible storage system building blocks that can be adopted to support a wide range of cross-layer optimizations. The resulting storage system is capable of using application hints to optimize the storage system operations at run time, and to provide specific storage system hints to the application. Finally, the project demonstrates using synthetic benchmarks as well as three real-world workflows that this design brings sizeable performance gains. Section 3.2 discusses the contributions in detail.  1.3 Synergy between the Two Projects While the two projects presented above address separate issues, a system that includes solutions to both of them enables significant synergies. In particular, the cross-layer optimizations project aims to enable runtime optimizations that better match the application or deployment platform characteristics to the storage system middleware, the StoreGPU is purely performance and efficiency focused and aims to exploit the platform hardware capabilities to enable efficient outsourcing of compute-intensive storage system optimizations. Using cross-layer communication applications can identify the optimizations and the files for which GPU accelerated optimizations should be applied to. The rest of this section presents two example scenarios demonstrating this synergy. A checkpointing optimized storage system. Checkpointing is an indispensable fault tolerance technique adopted by long-running applications. Consecutive checkpoint images present the application state at consecutive time steps and hence may have a high level of data similarity. Efficient similarity detection (a.k.a. data deduplication) may result in significant throughput gains and considerable savings in storage space and network effort. StoreGPU can be integrated with a cross-layer optimized storage system to provide an efficient checkpointing-specialized storage system. The application can hint which files are checkpoint files to enable checkpoint optimizations such as data deduplication, further this optimization can be offloaded to the GPU by StoreGPU, for higher deduplication throughput. Section 2.6.4 presents a prototype evaluation of exactly this scenario. Backup Storage Systems. Generic storage systems often allow the users to identify which files should be backed up. A cross-layer optimized system can allow the application/OS to both, identify which files need to be backed up, and classify the file content (e.g. media, database file, VM image/disk file, regular document, executable …etc). The storage system 7  can then exploit this information to optimize the storage of backed up files, for instance through compressing their content to reduce the space utilization. The storage system can accelerate the compression operation by GPU offloading. The GPU based compression module can implement a set of compression techniques that are optimized for certain data type (for instance, deduplication for large files, gzip compression for documents, or image compression for photos). Based on the application hint, the GPU based compression module can apply the compression technique that best fits the file content, enabling further performance gains. Finally, these synergies are further enabled by two factors: First, the solutions for the two issues my work addresses can be integrated into one system architecture. Second, the exploration of the two issues is based on a single codebase: the MosaStore storage system prototype (presented in the next section).  1.4 Research Platform: The MosaStore Storage System MosaStore (www.mosastore.net) is a storage system prototype that aggregates storage spaces from network connected nodes to provide a traditional file system abstraction. The MosaStore prototype serves two purposes: First, it serves as a platform for the explorations proposed in this proposal, and to demonstrate the synergies between the two projects. Second, The MosaStore prototype serves as a platform to support computational science research projects. So far, MosaStore has proved efficient in supporting a large scale scientific workflow application in test deployment on the BlueGene/P machine at Argonne National Lab [159] and on the Cray XT4 machine at Oak Ridge National Laboratory (section 3.5.4.6). As MosaStore is the platform for the two projects discussed in the next two chapters, I here present its architecture for ease of reference. The rest of this section briefly describes the MosaStore architecture, and lists the different projects using MosaStore. 1.4.1 MosaStore Architecture For the sake of completeness I briefly present here the MosaStore architecture. This architecture presents the base of my exploration in the two projects. The system components related to each project are discussed in the next two chapters. MosaStore integrates three components: a metadata manager, a number of donor nodes that contribute storage space to the system, and the client side system access interface.  8    The metadata manager maintains the entire system metadata (e.g., donor node status, file chunk distribution and dataset attributes). Similar to a number of other storage systems MosaStore decouples the metadata service from the stored data.    The donor nodes (a.k.a. benefactors) contribute storage space (memory or disk based) to the system. Donor nodes interact with the manager to publish their status via softstate registration, serve clients’ chunk store/retrieve requests, and perform garbage collection.    The system access interface (SAI) implements the mechanisms to access the storage space. From an application perspective, the SAI provides two methods to access the storage system: a mountable kernel module that supports the POSIX file system API and a storage system library that facilitates direct integration with the application.  Datasets are fragmented into smaller chunks that are striped across donor nodes for fast storage and retrieval. Data storage and retrieval operations are initiated by the client (the SAI). To retrieve a file, the SAI first contacts the metadata manager to obtain the chunk-map (i.e., the location of all chunks corresponding to the file). Then, the actual transfer of chunks occurs directly between the storage nodes and the SAI. The write operation follows a similar protocol. After the completion of the write operation, the new data chunks are replicated asynchronously in the system. 1.4.2 MosaStore-Based Projects MosaStore prototype was deployed in test deployments to support scientific research projects and was used as a research platform for a number of storage system research projects. The following is a brief list of projects using MosaStore: 1.  A large scale test deployment at ANL as an intermediate storage system to support bioinformatics applications running on 96K cores [159].  2.  A test deployment at Oak Ridge National Lab as a checkpointing storage system (section 3.5.4.6)  3.  A platform for exploring reliability tradeoffs in hierarchical storage systems [77, 78].  4.  A platform for building storage system auto-tuning mechanism [57].  5.  A platform for exploring data deduplication trade-offs from an energy perspective [55].  6.  A platform for building a secure aggregate storage system [75]. 9  7.  A backend storage system to support high throughput GridFTP deployments [20].  1.5 Dissertation Structure The rest of this dissertation presents the two projects. Chapter 2 presents the StoreGPU project, the system design, and the system evaluation. Chapter 3 motivates the cross layer optimization approach for storage systems, presents two preliminary evaluations studies, presents the system design, and the system evaluation with synthetics and real workloads. Finally, Chapter 4 summarizes the dissertation and highlights the contributions and the impact this dissertation had.  10  Chapter 2 2 StoreGPU: Harnessing the Computational Power of Multicore Platforms Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs’ computational power to improve the performance, reliability, and security of distributed storage systems. In this context, this project presents the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. The project evaluates the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, the evaluation evaluates the impact of offloading to the GPU on competing applications’ performance. The results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.  2.1 Context The development of massively multicore processors has led to a rapid increase in the amount of computational power available on a single die. There are two potential approaches to make the best use of this ample computational capacity available through massive parallelism: one is to design applications that are inherently parallel, while the other is to enhance the functionality of existing applications via ancillary tasks that improve an application’s behavior along dimensions such as reliability and security.  11  While it is possible to expose the parallelism of existing applications, a significant investment is needed to refactor them. Moreover, not all applications offer sufficient scope for parallelism. Therefore, in the context of this project, I look into the second approach and investigate techniques to enhance applications along non-functional dimensions. Specifically, this project starts from the observation that a number of techniques that enhance the reliability, scalability and/or performance of distributed storage systems (e.g., erasure coding, content addressability [120, 124], online data similarity detection [110], integrity checks, digital signatures) generate computational overheads that often hinder their use on today’s commodity hardware. I consider the use of Graphics Processing Units (GPUs) to accelerate these tasks, in effect using a heterogeneous massively multicore system that integrates different execution models (MIMD and SIMD) and memory management techniques (hardware and application managed caches) as the experimental platform. This project advocates the feasibility and evaluates the performance gains of building a GPU-accelerated storage system. This project starts by building a programming library, HashGPU, to accelerate hashing-based primitives which, although computationally demanding, are often used in storage systems (Section 2.5). Then quantifies the end-to-end benefits of integrating GPU-offloading in a complete storage system (Section 2.5). To this end, I have prototyped a distributed storage system which integrates the HashGPU library with the MosaStore content-addressable storage system (Section 2.5). Most of the integration challenges are addressed by CrystalGPU, a generic runtime layer that optimizes task execution on the GPUs (Section 2.5.2.3). The experimental evaluation (Section 2.6) demonstrates that the proposed architecture enables significant performance improvements compared to a traditional architecture that does not offload compute-intensive primitives. The design approach this project has advocated, that is, exploiting GPUs as storage system accelerators, has recently been adopted by others in the storage systems community, including a GPU accelerated encrypted storage system (PTask based system [127], GPUStore [143]), a deduplicated storage system (Shredder [44] P-Dedupe [156], and GHOST [94]), low cost RAID storage [93], and file matching service [106], erasure coding based reliable storage [149].  12  2.1.1 Chapter Structure The rest of this chapter first highlights this work contributions (section 2.2), presents background related to use of hashing in storage systems (section 2.3) and GPU technology related background (section 2.4). Then the chapter presents the StoreGPU system design (section 2.5) and evaluation (section 2.6), and surveys the related work (section 2.7). The chapter discusses related issues in (section 2.8) and concludes in (section 2.9).  2.2 Contributions Although this work focuses on hash-based primitives, I argue that the proposed approach can be extended to other computationally intensive routines that support today’s distributed systems like erasure coding [52], compact dataset representation using Bloom filters [56] or Merkel trees, as well as data compression [3], deduplication [31, 160], and encryption/decryption. To this end, parallel algorithms for these primitives exist, and can benefit from GPU support (e.g., parallel Reed-Solomon coding [61], parallel compression algorithm, and parallel security checking [111]). Additionally, offloading can support an active-storage design for specialized storage systems that focus on, for example, content-based image retrieval [63]. In this context, data parallel processing can be offloaded to the GPU to enable significant performance improvements, provided that the computation performed per byte of input data is sufficiently high to amortize the additional GPU overheads. It is worth noting that multiple computational cores have been used to improve the performance of some security checks such as on-access virus scanning, sensitive data analysis, and taint propagation [111]. In fact, without parallel execution, the overhead of executing some of these tasks is prohibitive. The spirit of this work is similar: without support for parallel execution, many storage system optimizations cannot be deployed due to their associated overheads. Recent commodity GPUs like the NVIDIA GeForce GTX 480 GPU (480 cores at 1.4 GHz) offer a ten-fold higher peak computational rate than Intel processors in the same price range. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance  13  relation. The evaluation study of a GPU-accelerated storage system shows that GPUs can be a cost-effective enhancement to high-end server systems. Further, using GPUs aligns with current technological trends: First, while multi-core CPUs currently do not offer this level of parallelism, they are expected to provide similar levels of parallelism over the next decade and support some of the techniques prototyped in a GPU context. Second, future massively multicore processors will likely extend the scope of heterogeneity already present in existing processors (e.g., AMD Fusion, Intel MIC, and IBM’s Cell BE); they will likely integrate heterogeneous cores, multiple execution models (e.g., SIMD/MIMD core blocks), and non-uniform memory architectures [89, 138]. An alternative is the use of application-specific integrated circuits (ASICs) as hardware accelerators. ASICs are an acceptable solution if every storage system were to use exactly the same schemes for verifying data integrity, compression, etc., and if these schemes never change. Improvements in algorithms and feature additions would require new ASICs that would, in turn, make other ASICs obsolete. The use of GPUs promotes flexibility and limits hardware waste. The contribution of this stream of work is six fold: Research contributions:   First, it demonstrates the viability of employing massively multicore processors, GPUs in particular, to support storage system services. To this end, the project evaluates, in the context of a content addressable distributed storage system, the throughput gains enabled by offloading hash-based primitives to GPUs. The evaluation provides a set of data points that inform the storage system designers’ decision whether exploiting massively multi-core processors to accelerate the storage system operations is a viable approach for particular workloads and deployment environment characteristics.    Second, this project empirically demonstrates, under a wide set of workloads and configurations, that exploiting the GPU computational power enables close to optimal system performance; that is, employing additional compute power to accelerate hashing computation will only result in minimal additional performance gains.    Third, this project sheds light on the impact of GPU offloading on competing applications running on the same node as the distributed storage middleware. The evaluation focuses on two issues in particular: First, while employing a GPU holds the 14  potential to accelerate computationally intensive operations, the need to transfer the data back and forth to the device adds a significant load on a shared critical system resource, the I/O subsystem. The evaluation experiments demonstrate that this added load does not introduce a new bottleneck in the system. Second, the evaluation quantitatively evaluates the performance impact on concurrently running compute- and IO-intensive applications. Technical contributions:   Fourth, this project builds and makes the HashGPU library available to the community. This library can be used to harness the computational power of GPUs with minimal modifications to current systems. Depending on its configuration and the target application’s data usage patterns, HashGPU enables significant performance gains. When comparing the performance enabled by a commodity GPU (the NVIDIA 8800GTX) and one core on a commodity CPU (Intel Core2 Duo 6600), HashGPU achieves up to 25-fold performance gains on not only synthetic benchmarks but also when supporting a high-level application.    Fifth, this work presents a minimal performance model that allows the estimation of a data-processing application’s performance on a given GPU model. The performance model can be used to evaluate whether modifying an application to exploit GPUs is worth the effort.  This work also presents a detailed analysis of the factors that  influence performance for a subset of applications and quantitatively evaluate their effect.   Finally, this project introduces techniques to efficiently leverage the processing power of GPUs. GPUs’ support for general-purpose programming (e.g., NVIDIA’s CUDA [14]) reduces the effort needed to develop applications that benefit from the massive parallelism GPUs offer. However, significant challenges to efficiently use GPUs remain. To address these challenges, this project designs and integrates CrystalGPU, an independent runtime layer that efficiently manages the interaction between the hosted application and the GPU. CrystalGPU transparently enables the reuse of GPU buffers to avoid memory allocation overheads, overlaps the data transfer with computation on the GPU, and enables the transparent use of multiple GPU devices.  15  2.3 Background: Use of Hashing in Storage Systems A number of primitives commonly used in distributed storage systems generate computational overheads that set them apart as potential bottlenecks in today’s systems, which increasingly employ multi Gbps links and/or are built on top of high throughput storage devices (e.g., SSDs [118]). For instance, on-the-fly data compression offers the possibility to significantly reduce the storage footprint and network effort at the expense of performing additional computations. However, the compression throughput can be prohibitively low (4MBps for bzip2 and 16MBps for gzip in my experiments on a 2.4Ghz Intel Core2 quad core processor) not only for high-performance computing scenarios but also for desktop deployments. Similarly, while encryption enables clients to securely store their data at an untrusted server, its throughput (between 38MBps and 157MBps in my experiments depending on the encryption algorithm used) can be lower than the client’s network/disk throughput. Similarly, erasure codes’ encoding throughput is limited by their computational complexity (for example, RAID6 systems using Reed Solomon codes achieve lower than 60MBps coding throughput using all cores of an Intel Core2 Quad processor [59]). This project aims to explore the viability of offloading these and other data processing intensive operations to a GPU to dramatically reduce the load on the source CPU and enhance overall system performance. In this context, the focus is on hashing-based primitives which are widely used in storage systems. This section highlights their use. Section 2.4 presents a brief overview of NVIDIA GPU’s. Hash-based primitives are commonly used in storage systems as building blocks to provide content addressability, data integrity checks, data similarity detection, compact set representation, and support for incremental computation (i.e. avoid processing regions of input that has already been processed). In this context, there are two main uses of hashing: Direct Hashing. This technique computes the hash of an entire data block. In systems that support content addressability [120, 124, 160], data blocks are identified based on their content: data-block identifiers are simply the hash value of the data. This approach provides a number of attractive features. First, it provides a flat namespace for data block identifiers and a naming system which simplifies the separation between file- and data block metadata. Second, it enables probabilistic detection of identical data-blocks by comparing their hash 16  value [30, 110, 120]. Finally, it provides implicit integrity checks. The hash computation, however, imposes overheads that may limit performance, especially in two common scenarios. First, for workloads with frequent updates, changing a few bytes in the data may require rehashing multiple data blocks. Second, in systems that use large blocks (e.g., GoogleFS [79] uses 64MB blocks), computational overhead of hashing becomes a nonnegligible component (possibly the largest) of a write/update operation. Content-based chunking. The techniques mentioned above require dividing large files into multiple blocks. To this end, two approaches are possible: fixed-size blocks, and detecting block boundaries (i.e., the markers for blocks’ start and end) based on content. This is done by hashing a large number of overlapping data windows inside the data-block and declaring a block boundary when the hash value matches a predefined value. LowBandwidth File System (LBFS) [110] and JumboStore [68] both adopt this latter technique, which is dubbed content-based chunking. When used for similarity detection, these two techniques offer a tradeoff between computational overheads and the similarity ratio detected: unlike fixed-size blocks, detecting block boundaries based on content is stable to data deletions/insertions, yet significantly more computationally intensive, hence leading to lower throughput. In fact, the limited throughput of content-based chunking is the main reason its original proponents [110] recommend its use in storage systems supported by low-bandwidth networks. On an Intel Core2 Duo 6600 2.40GHz processor, and using data from a checkpointing application, the content based chunking implementation offered 7 to 51MBps throughput, and similarity detection rates between 82% and 37% respectively, depending on the configuration. Throughput in this range is clearly a bottleneck when using Gbps network links or fast disks.  17  Figure 2. GPU Schematic Architecture  2.4 Background: GPU-Related Background This section presents a brief overview of GPU architecture, programming model, and typical application structure. While this work focuses on NVIDIA’s architecture and programming environment (the Compute Unified Device Architecture (CUDA) [14]) similar issues emerge with other GPU vendors or programming environments. At a high-level view (Figure 2), NVIDIA’s GPUs are composed of a number of SIMD multiprocessors. Each multiprocessor incorporates a small but fast shared memory (16 to 48KB). All processors in the multiprocessor have direct access to this memory. Additionally, all multiprocessors have access to three other device level memory modules: the global, texture, and constant memory modules, which are also accessible from the host. The global memory supports read and write operations and it is the largest (with size ranging from 256MB to 4GB). The texture and constant memories are much smaller and offer only read access to GPU threads. Apart from size, the critical characteristic differentiating the various memory modules is their access latency. While accessing the shared memory takes up to four cycles, it takes 400 to 800 cycles to access global memory. Consequently, to achieve maximum performance, applications should maximize the use of shared memory and processor registers. Typically, a general purpose application will first transfer the application data from host (CPU) memory to the GPU global memory and then try to maximize the usage of the shared memory throughout the computation.  18  Programming efficient applications to exploit GPUs implies extracting the target application’s parallelism and employing efficient memory and thread management techniques. Improper task decomposition, memory allocation, or memory transfers can lead to dramatic performance degradation. Particularly, efficient use of the shared memory is a challenging task for three reasons. First, the shared memory is often small compared to the volume of data being processed. Second, the shared memory is divided into 16 banks and all read and write operations that access the same bank are serialized, hence, reducing concurrency. Consequently, to maximize the performance, an application should schedule concurrent threads to access data on different banks. The fact that a single bank does not represent a contiguous memory space increases the complexity of efficient memory utilization. Finally, increasing the number of threads per multiprocessor does not directly lead to a linear performance gain although it may help hide global memory access latency. One reason is that increasing the number of threads decreases the amount of shared memory available per thread. Obviously, the optimal resource usage configuration is tightly coupled with the application characteristics (e.g., the data access patterns) and GPU hardware specifications (the number of registers in the multiprocessor or the size of shared memory available). 2.4.1 GPU Performance Factors When using the GPU, an application passes through five main stages: preprocessing, host-toGPU data transfer, processing, GPU-to-host results transfer, and post-processing. Table 1 describes these stages, identifies the main performance factors for each stage, and introduces the notation used throughout the rest of this chapter to model performance. (Note that not all applications will have the preprocessing or post-processing stages.) As Table 1 indicates, for a data parallel application, the processing step is usually repeated until all input data is processed. For a typical data-parallel application, the processing step is usually repeated multiple times until all input data is processed. During each iteration, parts of the data are copied from global memory to each multiprocessor’s shared memory and processed by the application ‘kernel’ before the results are then transferred back to the global memory. Thus, the runtime of a typical data parallel application can be modeled as:  19  TTotal = TPreprocesing+ TDataHtoG + TProcessing + TDataGtoH + TPostProcH = TGPUInit + TMemAlloct + TPreProc + TDataHtoG + TDataGtoH + TPostProc  DataSize (TDataGtoS × N × SMSize  + TProc + TDataStoG) +  (1)  where DataSize is the size of an application data set, N is the number of multiprocessors, and SMSize is the size of the multiprocessor’s shared memory. The parameters that influence the formula above (e.g., host-to-memory transfer throughput, device global-to-shared memory throughput, initialization overheads) can be either benchmarked or found in the GPU data sheets. Equation 1 allows system designers to estimate the benefits of offloading processing to the GPU and to identify parts of the application that need optimization. GPUs are known for their ability to accelerate number-crunching applications, but are less efficient when hashing large volumes of data. This is due not only to the overheads incurred when transferring large amounts of data to and from the device, but also to the fact that the various floating point units are not used. In fact, trivial data processing, such as a simple XOR between two data blocks, even on a large amount of data, is faster on the CPU than on the GPU. While the GPU can perform computations at a huge theoretical sustained instruction-per-second peak rate (46.4 GIPS -Giga Instruction per Second for the NVIDIA 8600 GTS card), the data transfer from the machine to the GPU is limited at 4GB/s, the theoretical maximum bandwidth of PCIe 16x interface. To give the reader an intuition of how the various overheads interplay, the following is the time breakdown to hash a 96MB data block: transferring the data to the GPU takes 37.4ms (for an achieved throughput of 2.5GBps), hashing takes 41.8ms (using the four GT8600 multiprocessors), and copying the results back takes 1.0ms. Overall, in this configuration, the memory transfers represent over 48% of the execution time.  20  Table 1. The processing stages of a GPU task. Stage Operations performed (1) Pre-processing Device initialization; memory allocation on the host and device; task-specific data preprocessing on the host. (2) Data Transfer In Data transfer from host’s memory to device global memory. Loop through: (3.1) data transfer from global GPU memory to (3)Processing shared memory (optional); (3.2) Task’s ‘kernel’ processing; and (3.3) Result transfer back to global memory. (4) Data Transfer Out Output transfer from device to global memory to the host memory. (5) Post-processing Task-specific post processing on the host; resource release (optional).  2.5 System Design This section first discusses the design challenges of a GPU accelerated storage system (section 2.5.1), and then details the storage system design (section 2.5.2). 2.5.1 Design and Integration Challenges Efficient offloading to an accelerator (e.g., a GPU) generally implies executing the accelerated primitive in a different memory space. Support for efficient data transfers and thread management techniques are crucial to preserve the benefits of offloading. Additionally, accelerator specific concerns emerge: for example, in case of a GPU, extracting the target primitive parallelism, and, equally important, employing efficient memory, and thread management techniques. Improper task decomposition, memory allocation and management, or data transfer scheduling can lead to dramatic performance degradation [129]. Apart from the above performance-related issues there are two additional areas to consider: minimizing the integration effort with an existing code (in this case a file system) and preserving the separation of concerns between application related issues, accelerator logic, and resource allocation issues related to the accelerator. Overall, seven areas of concern exist; presented below starting with global, system-level concerns that impact all layers of the proposed design, then continuing with issues that are specific to the GPU runtime management system, and finalizing with issues specific to the accelerated function   Minimizing the integration effort. The volume of coding required to enable GPUoffloading should be proportional to the potential performance gains. Ideally, the primitives offloaded are stateless, which simplifies offloading as there is no 21  operations state to be maintained across calls, and allows preserving the API and the semantic of the original implementation on a CPU. Even in this case, however, there are two factors for additional complexity: first, memory allocation which must be done at the device level, and, second, the asynchronous nature of task execution management on the GPU as well as other device management operations. The proposed design hides these complexities under simple, intuitive interfaces. (section 2.5.2.4 the proposed solution based on layering).   Separation of concerns. The integrated design should preserve the separation between the logic of the accelerated primitive(s) and the runtime management layer required for efficient resource management at the accelerating device. The main driver for this requirement is facilitating the addition of new primitives that will benefit from offloading through the same runtime layer. The proposed layered design (Figure 6, section 2.5.2.4 and the generic task-management functionality offered by the CrystalGPU layer (section 2.5.2.3)), address this concern.    Batch oriented computation. A system adopting GPU offloading will deliver higher performance gains if it can support batch-oriented computation for the offloaded functions; as GPUs best support batch computations. This generally involved module decoupling and making the calls to the accelerated primitives asynchronous. This change represents a significant challenge as it is often hard to obtain without further complicating the first two concerns: complicating integration or breaking the system’s design.    Hiding data transfer overheads. For streaming applications, that is, for applications that send multiple GPU-tasks back to back, overlapping the transfer of input data to the GPU with the computation step of a previous task can hide the data transfer overhead. The project explores these optimizations through the CrystalGPU layer (section 2.5.2.3).    One-copy host to device data transfers. Data transfers to and from the device require DMA transfers from (respectively to) non-pageable host memory. If an application presents data residing in pageable memory for transfer to the GPU then the CUDA runtime library first allocates a new buffer in non-pageable memory, copies the data to this new buffer, and finally launches the DMA transfer. Thus, to avoid these 22  additional overheads (data-copy and new buffer allocation), the application should present data in buffers allocated in non-pageable memory.   Hiding memory allocation overheads. Additionally, since allocating memory in nonpageable memory is an expensive operation, the application should reuse buffers to the extent possible. To this end CrystalGPU layer (section 2.5.2.3) incorporates a memory management module. This module offers a CrystalGPU-speciffic implementation of malloc and free function calls and manages non-pageable memory buffers used through the application run (these buffers are allocated at the application initialization).    Simplifying the use of multiple GPUs in multi-GPU systems. CUDA provides only a low level API for using multiple GPUs by the same application. Managing and balancing the load between multiple GPUs significantly increases application complexity. To circumvent this limitation, and benefit from multi-GPU systems, CrystalGPU provides an API that enables transparent and balanced use of multiple GPUs. (section 2.5.2.3).    Efficient use of shared memory, the fastest, yet the smallest, GPU memory module. Three reasons make the efficient use of shared memory challenging. First, shared memory is often small compared to the volume of data being processed. Second, shared memory is divided into banks and all read and write operations that access the same bank are serialized (i.e., generate bank conflicts), hence, reducing concurrency. Consequently, to maximize performance, the concurrent threads should access data on different banks. The fact that a single bank does not represent a contiguous memory space increases the complexity of efficient memory utilization. Finally, while increasing the number of threads per multiprocessor helps hiding global memory access latency, this does not directly lead to a linear performance gain. The main reason is that increasing the number of threads decreases the amount of shared memory available per thread.  To efficiently use the shared memory without increasing programming complexity, HashGPU implements a shared memory management mechanism (section 2.5.2.2).  23  2.5.2 A GPU Accelerated Storage System Prototype The prototyped distributed storage system (StoreGPU) integrates three main components: the MosaStore storage system, HashGPU library, and CrystalGPU the accelerator runtime management system. The integrated system is presented in Figure 3 and Figure 6. StoreGPU integrates MosaStore with HashGPU and CrystalGPU such that the compute-intensive hashbased processing required to support content addressability is outsourced to a GPU. The rest of this section details the design of each of these three components (section 2.5.2.1 to section 2.5.2.3) and the role of each of them in the final system prototype (section 2.5.2.4). 2.5.2.1 MosaStore MosaStore [13] design is presented in (section 1.4). This section summarizes the MosaStore design decisions related to the StoreGPU design. MosaStore can be configured to work as a content addressable storage system. MosaStore employs an object-based distributed storage system architecture (similar to GoogleFS). Its three main components are (Figure 3): a centralized metadata manager, the storage nodes, and the client’s system access interface (SAI), which uses the FUSE [2] kernel module to provide a POSIX file system interface.  Figure 3. System architecture. At the client node, the FUSE Linux module is used to implement a user-level filesystem (MosaStore SAI). The Linux virtual file system (VFS) calls, through FUSE, user-defined callbacks, that implement MosaStore file system functionality. Note that the GPU is needed only on the client machines. Figure 6 presents in detail the structure of the MosaStore SAI that integrates the GPU. Each file is divided into blocks that are stored on the storage nodes. The metadata manager maintains a block map for each file which contains the file’s blocks information including the hash value of every block. The SAI implements the client side content-based addressability mechanisms (which can be configured to use either fixed-size or content-based  24  chunking). This generates the hashing workload that traditionally is executed on a CPU or can be offloaded to the GPU via HashGPU module. To write to MosaStore, the SAI first retrieves the file’s previous-version block-map from the manager, divides the new version of the file into blocks, computes the hash value of each block, and searches the file’s previous-version block-map for equivalent block hash values. The SAI stores only the blocks with no match at the storage nodes, saving considerable storage space and network effort. Once the write operation is completed (as indicated by the release POSIX call) the SAI commits the file’s block map including the blocks’ hash values to the metadata manager. The architecture of the MosaStore SAI and its use of GPU offloading are presented in detail in Figure 6 and its legend. 2.5.2.2 HashGPU The design of HashGPU is driven by storage systems’ use of hashing as presented in section 2.3. This section presents HashGPU’s application programming interface (API) and a highlevel design overview. The evaluation section presents a number of performance-oriented design improvements. SHA1 (RFC 3174) and MD5 (RFC 1321), as well as most widely used hash functions, follow the sequential Merkle-Damgård construction approach [62, 104]. In this sequential approach, at each stage, one chunk of data is processed to produce a fixed size output. The output of each stage is used as an input to the following stage together with a new data chunk. This sequential construction does not allow multiple threads to operate concurrently in hashing the data. To exploit GPUs’ parallelism, HashGPU employs the parallel MerkleDamgård construction: the sequential hash function is used as a building block to process multiple segments of data in parallel on the GPU. The discussion section presents evidence that the hash function built is as strong as the original, sequentially built, hash function. 2.5.2.2.1 HashGPU API The HashGPU API is designed to correspond to the two main use cases presented in Section 2.3. Direct Hashing, i.e., hashing large blocks of data, with size ranging from kilobytes to megabytes or more. To address this scenario, the library provides the following interface:  25  char* SHA(char* DataBuffer,int DataBufferSize) char* MD5(char* DataBuffer,int DataBufferSize)  Sliding Window Hashing. As opposed to the first case, content-based detection of block boundaries requires hashing numerous small data blocks (sized from tens to hundreds of bytes). To address this usage pattern, the library provides the following interface: char* SHA(char* DataBuffer, int DataBufferSize,int WinSize,int Offset) char* MD5(char* DataBuffer, int DataBufferSize,int WinSize,int Offset)  This API returns an array of hashes, where each entry of this array is the result of hashing a window of data of size WinSize at shifting offset Offset. The rest of this section presents the two main modules of HashGPU with a focus on parallelizing hash computations. 2.5.2.2.2 Design of the Direct Hashing Module Figure 4 presents HashGPU’s direct hashing module design. Once input data is transferred from the CPU, it is divided into smaller blocks and, every small block is hashed. The result is placed in a single output buffer and, finally, the output buffer is hashed to produce the final hash value. Two aspects are worth mentioning. First, there are no dependencies between the intermediate hashing computations in Step 3.2 (Figure 4). Consequently, each computation can be executed in a separate thread. Second, this design uses the CPU to aggregate the intermediary hashes (Step 5). The reason is that synchronization of GPU threads across the blocks inside the GPU is not efficient.  26  Figure 4. Direct hashing module architecture. The blocks with circular arrows represent the standard hashing kernel. Stages numbers correspond to Table 1. 2.5.2.2.3 The Sliding Window Hashing Module To parallelize the computation of a large number of small hashes drawn from a large data block, the module hashes in parallel all the small blocks and aggregate the result in a buffer. This module’s architecture is presented in Figure 5.  Figure 5. Sliding window hashing module architecture. The blocks with circular arrows represent the standard hashing kernel. Stage numbers correspond to Table 1. Each of the hash functions in Figure 5 can be executed in a separate thread since there are no dependencies between computations. The challenge in implementing this module lies in the memory management to extract maximum performance. Note that the input data is not  27  divided into smaller blocks as the previous case. The reason is that the input data for each thread may overlap with the neighboring threads. 2.5.2.2.4 Optimized Memory Management Although the design of the hashing module is relatively simple, optimizing for performance is a challenging task. For example, one aspect that induces additional complexity is maximizing the number of threads to extract the highest parallelism (around 100K threads are created for large blocks) while avoiding bank conflicts and maximizing the use of each processors’ registers. To address this issue, HashGPU includes a shared memory management subsystem that:   Reduces memory access latency by allocating to each thread a fixed-size workspace located in shared memory. Additionally, to avoid bank conflicts, the workspaces of threads that are co-scheduled are allocated on separate shared memory banks. When a thread starts, it copies its data from the global memory to its shared memory workspace, hence avoiding subsequent accesses to the slower global memory.    Reduces the complexity of the programming task. The workspace allocation technique just described makes programming more complex; since a shared memory bank is not mapped to a contiguous memory address space. To reduce programing complexity the memory management subsystem abstracts the shared memory to allow the thread to access its workspace as a contiguous address space.  In effect, the shared memory management subsystem reduces memory access latency by avoiding bank conflicts while reducing the programming effort by providing a contiguous memory address abstraction. Further, the shared memory management subsystem can bring benefits even to the latest GPUs as it performs application aware data placement to avoid bank conflicts. 2.5.2.2.5 Other Optimizations In addition to optimizing the shared memory usage, HashGPU considered two other optimizations: the use of pinned memory, and reducing the size of the output hash. Allocating and initializing the input data in host’s pinned memory (i.e., non-pageable memory) saves the GPU driver from an extra memory copy to an internal pinned memory buffer. In fact, the GPU driver always uses DMA (Direct Memory Access) from its internal pinned memory buffer when copying data from the host memory to the GPU global memory 28  [54]. Therefore, if the application allocates the input data in pinned memory from the beginning, it saves the driver from performing the extra copy to its internal buffer. However, allocating pinned memory adds some overhead since the kernel is involved in finding and preparing a contiguous set of pages before locking it. The system performance numbers do not show a pronounced effect for this overhead, since pinned memory buffers can be reused over subsequent library calls and thus this overhead is amortized. Additionally, HashGPU allows users to specify the size of the desired output hash. The rationale behind this feature is that, some applications such as block boundary for similarity detection only need the first few bytes of the hash value. 2.5.2.3 CrystalGPU Although HashGPU optimizes the processing of a single data block, there is still room for improvement. Let us consider a workload that is a stream of hashing computations. In this case, three opportunities for performance improvement exist: input/output buffer reuse to amortize buffer allocation overhead, ability to harness multiple GPUs on multi-GPU systems, and exploiting the CUDA2 [14] capabilities to overlap the data transfer of one block with the computation of a previous block. To exploit these opportunities, this work developed CrystalGPU, a standalone abstraction layer that exploits these three opportunities. CrystalGPU runs entirely on the host machine as a management layer between the application and the GPU native runtime system. CrystalGPU manages the execution of GPU operations (e.g., transferring data from/to the GPU and starting computations on the GPU) and includes a memory management layer that enables the reuse of non-pageable memory. At a high level, the metaphor CrystalGPU offers is that of a task-management environment for GPU tasks. The proposed design aims for (i) generality, to facilitate the support of a wide range of GPGPU applications, (ii) flexibility, to enable deployment on various GPU models while hiding the heterogeneity in task management across models, and (iii) efficiency, to maximize the utilization of GPU resources (processing units and I/O channels). These goals are achieved via an interface that is agnostic to the upper level application, and an internal design that avoids extra data copies and shared data structure bottlenecks. The detailed CrystalGPU design and API is presented in a technical report [76].  29  Figure 6. MosaStore SAI architecture and operations flow. The data written by the application is stored in a buffer. When the buffer is full, the content addressability module submits to CrystalGPU a request to execute the HashGPU sliding window (SW) hashing kernel on this data. The resulting sliding window hashes are used to detect the blocks boundaries. Upon detecting these boundaries, the content addressability module submits a second computation to CrystalGPU that uses, this time, the HashGPU direct hashing kernel to compute block hashes. Once these are received the content addressability module uses the blocks’ hashes to detect similarity and decide if any of the blocks are new and need to be transferred to the storage nodes. All buffer management and HashGPU kernel calls are managed by CrystalGPU. The dashed ovals and arrows represent operations that are only executed in content based chunking content addressability, and the double lined arrow represents the flow of operation in fixed size blocks content addressability. CrystalGPU design comprises a single driving module named the master. The master module employs a number of host-side manager threads, each assigned to manage one GPU device. The rationale behind this assignment is twofold. First, each manager thread is responsible for querying its device for job completion status, and asynchronously notifying the application, using the callback function, once the job is done. This allows the client to make progress on the CPU in parallel; further, it relieves the application from job execution state bookkeeping. Second, having a dedicated control thread for each GPU facilitates transparent multi-GPU systems support. As detailed below, the application submits a job to a shared outstanding queue, later the job is transparently handled by one of the manager threads in such a way that balances the load across GPUs. Note that this multithreaded design requires a multi-core CPU to enable maximum parallelism across the manager threads as well as the application’s host-side threads. The application requests services from the 30  framework by submitting jobs and waiting for callbacks. The status of a job is maintained by the master using three queues. First, the idle queue maintains empty job instances (with preallocated buffers). Second, the outstanding queue contains ready-to-run jobs submitted by the application, but not dispatched to the GPUs yet. Third, the running queue contains the jobs that are currently being executed on the GPUs. A manager thread is chosen in a round robin scheme to execute the next job in the outstanding queue. 2.5.2.4 Integration and System Prototype The system prototype (Figure 3 and Figure 6) integrates the three aforementioned components into a storage system able to exploit the GPU computational power to support content addressability. The integration entailed changes to the storage system design to efficiently exploit the GPU. Two GPU characteristics make harnessing their computational power challenging: the high data transfer overheads to/from the GPU, and the highly parallel GPU architecture. To match with these characteristics, workloads that best fit GPUs are ideally highly parallel (in the SIMD model) and involve a high computation-per-transferred-byte rate. Data-intensive stream-oriented workloads such as those generated by storage systems using content-based chunking are not an ideal match. Batching the streamed computations, however, can be used to mold these workloads to better exploit the GPU. The rest of this section discusses the main design and implementation changes for integrating the two GPU modules: fixed size blocks and content-based chunking. General Changes. In addition to the module-specific changes detailed below I modified the original implementation of HashGPU library [25] to use the GPU as virtualized by CrystalGPU. To this end, I applied two changes: allocated the data buffers through CrystalGPU, and retrofitted a hash computation as a CrystalGPU ‘task’ (A task is CrystalGPU’s abstraction for a unit of GPU computation and the associated data transfers). The HashGPU resulting library is able to harness optimizations available for a single hash operation as well as across a stream of hash operations. Fixed-Size Blocks Hashing Integration. As fixed-size block hashing operate at a relatively larger granularity (and internally HashGPU treats one block as one batch), changing MosaStore to efficiently exploit the GPU for fixed blocks is a relatively easy integration task. This modification required two straightforward changes: First, all hash 31  function calls needed to be changed to the new HashGPU-provided hash function (Since APIs are similar this effort is minor). Second, to facilitate optimized data transfers and buffer reuse during GPU calls, memory buffers needed to be allocated using the HashGPU library call instead of the standard memory allocation functions (e.g, malloc). The resulting storage system is able to offload the hash computation to the GPU. These changes required changing only 22 lines of the original MosaStore implementation (over 18K lines of code). I estimate that integrating GPU offloading with other storage systems that make use of fixed size blocks hashing is equally straightforward. Content-based Chunking Integration. Storage systems employ content-based chunking to detect block boundaries in newly written data; hence often sliding window hashes are computed on the data as it is written (often in chunk of 4KB in size). This design approach offers little opportunity for exploiting GPU power since the communication overhead is high and there is little parallelism to exploit. To better support GPU offloading I applied major changes to the MosaStore SAI to use content-based chunking in a batch oriented fashion. As data is written through the SAI to the file system the data is buffered. Once the buffer is full, it is submitted to HashGPU which computes the block boundaries of the buffered data. The newly discovered blocks are further processed by HashGPU direct hashing module for computing each block’s hash value. When block boundaries do not align with buffer sizes care must be taken to transfer the leftovers to first block that is detected in the next buffer. 2.6 Experimental Evaluation This section evaluates the performance benefits of using GPUs as storage system accelerators. The evaluation evaluates the components and integrated system performance using synthetic as well as real application workloads. This section starts by evaluating the HashGPU library (section 2.6.1) and highlighting the benefits brought by CrystalGPU when used in conjunction with HashGPU (section 2.6.2). Next, the evaluation investigates the following question: Given a fixed budget, which configuration can bring higher performance: adding an extra CPU or GPU to the system? (section 2.6.3). Finally, the evaluation focuses on the integrated system, and analyzes the performance improvements GPU support brings to the whole storage system (section 2.6.4); demonstrates that with the proposed configuration StoreGPU reaps almost all possible potential gains 32  additional computation can provide (section 2.6.5); and studies the impact of offloading to the GPU on concurrently running applications (section 2.6.6). 2.6.1 HashGPU Evaluation The evaluation evaluates HashGPU both with synthetic benchmarks (Section 2.6.1.1) and an application driven benchmark: similarity detection between multiple versions of the same file (Section 2.6.1.2). 2.6.1.1 Synthetic Benchmarks This section presents the performance and speedup delivered by HashGPU under a synthetic workload: it first compares GPU-supported performance with the performance for the same workload running on a commodity CPU. Next, this section investigates the factors that determine the observed performance. 2.6.1.1.1 Experiment Design The experiments are divided into two parts, each corresponding to the evaluation of one of the two usage scenarios of hashing described in Section 2.3 (i.e., Direct Hashing and Sliding Window Hashing). Table 2 summarizes the factors considered in the performance evaluation. Currently, HashGPU provides the implementation of two hashing algorithms: MD5 and SHA1. The data size variation is intended to expose the impact of memory copy overhead between the host and the GPU. Additionally the sliding-window hashing technique introduces two specific parameters: the window and offset sizes.  33  Table 2. The list of factors considered in the experiments and their respective levels. Note that the sliding-window hashing module has extra parameters. Direct and Sliding Window Hashing Factors Levels Algorithm MD5 & SHA1 Data Size 4KB to 96MB Shared Memory Enabled or Disabled Pinned Memory Enabled or Disabled Sliding-Window Hashing only Window Size 20 or 52 bytes Offset 4, 20 or 52 bytes Reduced Hash Size Enabled or Disabled In particular, the evaluation explores the impact of the three performance optimizations presented in section 2.5.2.2: i) the optimized use of shared memory; ii) memory pinning; and iii) reduced output size. The experiments follow a factorial experimental design and evaluate the impact of each combination of factors presented in Table 2. For all performance data points, the results report the speedup computed from the average execution time collected from 40 runs. I confirmed that this number of experiments is sufficient to guarantee an average speedup estimate with confidence level of 95%. The following sections present a summary of these experiments. The devices used in the performance analysis are: an Intel Core2 Duo 6600 2.40GHz processor (released late 2006), an Intel Core2 Quad Q6700 2.66GHz processor (released mid 2007), an NVIDIA GeForce 8600 GTS GPU (released mid 2007) and an NVIDIA GeForce 8800 GTX (released late 2006). The GeForce 8600 GTS GPU was installed on a machine with Intel Core2 Duo 6600 2.40GHz processor, running WindowsXP, and CUDA 1.0 driver and runtime library. While the GeForce 8800 GTX was installed on a machine with Intel Core2 Duo E6850 3GHz processor, running Linux 2.6.24, and CUDA 2.0 driver and runtime library. Note that, in all cases, HashGPU implementation uses out-of-the-box hash function implementations. These original implementations are single-threaded and use only one core of the Intel processor; when used on multi-core architecture, the experiments execute in parallel multiple instances of the original hash function implementation. For this reason, the rest of this section uses the performance of the original singlethreaded implementation running on a single core on the Intel Core2 Duo 6600 2.40GHz 34  processor as the baseline to compute and compare the speedups achieved by HashGPU configurations. To offer a more comprehensive perspective on the achieved performance the evaluation also offers a speedup evaluation when using multiple traditional cores. To avoid cluttering the evaluation this section does not report performance on all the platforms used for all experiments. Figure 7, however, offers a first indication that HashGPU provides better performance when compared to an optimized multi-threaded CPU implementation harnessing all cores of a traditional (e.g., Intel) multicore architecture and optimized for maximum parallelism. Although for small data sizes, the multithreaded CPU implementation has better performance than HashGPU, as the data size increase (i.e., data larger than 1MB), HashGPU achieves much higher speedups. It is worth noting that, even though not presented here, the sliding window module and SHA1 algorithm have a similar behavior. This comparison highlights that for a comparable price (or even lower), a GPU offers much higher performance than a high end CPU. Note that the lower-end GPU (GeForce 8600) achieved a better performance for smaller data sizes than the high-end GPU model (GeForce 8800). This is due to two main reasons; first, a careful examination of the results break down reveals that CUDA 2.0 run time library takes considerably longer time to allocate pinned memory buffers (around 14x slower) than CUDA 1.0. Second, the GeForce 8600 GPU cores (a.k.a. shaders) are clocked at 1450MHz while the newer GeForce 8800 GTX GPU shaders are clocked at 1350MHz; consequently, for small data sizes that do not utilize the 16 multiprocessors of the GeForce 8800, the GeForce 8600 will complete the computation faster.  35  Figure 7. Speedups for MD5 direct hashing module for fully optimized GPU implementations running on GeForce 8600 and 8800, and multithreaded CPU implementations harnessing all available cores. The value x=1 separates the speedup (right) from the slowdown values (left).  36  Figure 8. HashGPU speedup for the MD5 Figure 9. HashGPU speedup for the SHA1 implementation of the direct hashing module. implementation of the direct hashing The shaded bars use GeForce 8800 GTX GPU. module. The shaded bars use GeForce 8800 GTX GPU. 2.6.1.1.2 Experimental Results The first question addressed by the evaluation is: What is the speedup offered by HashGPU compared to the original single-threaded CPU implementation? To answer this question, the experiment determines the ratio between the execution time on the GPU and the CPU for both MD5 and SHA1 hashing algorithms. Figure 8 and Figure 9 show the speedups achieved by HashGPU for MD5 and SHA1 respectively for the Direct Hashing module. Values larger than one indicate performance improvements, while values lower than one indicate a slowdown (this is indicated by a line at x=1). The results show that the fully optimized (i.e., pinned and shared memory  37  optimizations enabled) HashGPU starts to offer speedups for blocks larger than 700KB and offer up to 6x speedup for large data blocks. Note that as the data size increases, the performance improvement reaches a saturation point. Further, optimizing for shared memory accesses has the biggest impact on the achieved speedup. This highlights the fact that efficient memory management is paramount to gain maximum performance when considering data-intensive applications. It is also important to observe that for small blocks the GPU performs much worse than its CPU counterpart ( e.g., when memory accesses are not optimized, the performance can decrease up to 37x slow down for 4KB and MD5). This is mainly due to the overhead of host to device data transfers compared to the processing cost. I discuss the latter point in more detail in the next section. Figure 10 to Figure 13 present the results of experiments for the sliding-window hashing module. Qualitatively, the observed behavior is similar to the direct hashing module. Quantitatively, however, the speedup delivered by HashGPU is significantly higher (up to 25x speedup). The sliding window hashing introduces two extra parameters that influence performance: the window size and the offset. The window size determines how much data is hashed while the offset determines by how many bytes the window is advanced after each hash operation. The experiments explore four combinations for these two factors with values chosen to match those used by storage systems like LBFS [110], Jumbostore [68], and stdchk [30]. To avoid cluttering the evaluation I present the results of window size of 20 bytes and offset of 4 bytes, and window size of 52 bytes and offset of 52 bytes. Although not reported here the other combinations of window sizes and offset present the same characteristics.  38  Figure 11. HashGPU sliding window hashing module speedup for SHA1. Window = Figure 10. HashGPU sliding-window 20bytes, Offset = 4bytes, Reduced hash. hashing module speedup for MD5. Window = 20 bytes, offset = 4 bytes, reduced hash. Figure 10 and Figure 11 show the results for a configuration that leads to intense computational overheads: a window size of 20 bytes and an offset of 4 bytes. In this configuration (in fact used by stdchk), HashGPU hashes the input data approximately 20x faster for MD5 and up to 25x faster for SHA1. Figure 12 and Figure 13 present the results for larger chunks (52 bytes) and offset (52 bytes). It is interesting to note that HashGPU performs a little slower when compared to the previous, more computing intensive, scenario. The sliding window hashing achieves higher speedup compared to direct hashing module for two reasons: First, the CPU implementation of the sliding window hashing will pay an additional overhead of a function call to hash each window, while HashGPU spawns one thread per window that can execute in parallel. Second, since the window size is usually 39  less than 64 bytes (the minimum input size for SHA or MD5), every window is padded to complete the 64 bytes. This translates to hashing considerably larger amounts of data for the same given input data, making this module more computationally intensive and thus a better fit for GPU processing due to its intrinsic parallelism. This is also the reason larger speedups are observed with smaller window sizes and offsets. Finally, the speedup achieved for SHA1 is better than MD5. This is attributed to SHA1 being more computationally demanding than MD5 algorithm and hence fitting better GPU's application domain.  Figure 12. HashGPU sliding-window Figure 13. HashGPU sliding-window hashing module speedup for MD5. Window hashing module speedup for SHA1. Window size=52 bytes. Offset=52 bytes. size=52 bytes. Offset=52 bytes.  40  2.6.1.1.3 Dissecting the Overheads The execution time of a particular computation on the GPU can be divided into the five stages outlined in Section 2.4.1: preprocessing, host-to-GPU data transfer, code execution, GPU-to-host result transfer, and finally post-processing operations on CPU (e.g., result aggregation, release of resources). This section analyzes how the execution time of each of these stages is affected by the three optimization features available in HashGPU: use of pinned memory, optimized use of shared memory, and reduced output size. I limit the analysis to the direct hashing module and MD5 algorithm implementation. Although not reported here, the sliding-window module and the SHA1 implementations present the same characteristics. Stage 1: Preprocessing. The application does not have a special data preprocessing operation. Therefore, this stage consists of memory allocation and GPU initialization only. The allocation of memory buffers (host and GPU) and the allocation of the buffer for returned results on the host main memory take between 3ms and 30ms depending on the data size and whether the pinned memory optimization is enabled (Figure 14). The initialization takes longer with pinned memory and larger data sizes as it is costly to find contiguous pages to accommodate larger data sizes. At a first glance, the proportional overhead due to the initialization time may seem significant for the overall HashGPU performance (Stage 1 in Figure 19 and Figure 20). However, the pinned memory allocation impact can be reduced by reusing buffers once allocated.  41  Figure 14. Stage 1 (preprocessing and initialization) execution time for MD5 direct hashing module of HashGPU. Stage 2: Data Transfer In. The host-to-device transfer time varies depending on the data size and on whether Pinned Memory optimization is used. As expected, although using pinned memory slows down Stage 1, it improves transfer performance (Figure 15). Compared to the theoretical 4GBps peak throughput of the PCIe 16x bus, obtained, for large blocks, 2.5GB/s with pinning and 1.7GB/s without.  Figure 15. Stage 2 (input data transfer) execution time of MD5 direct hashing module with and without using pinned memory feature.  42  Stage 3: Data Processing. The performance of kernel execution is highly dependent on the utilization of shared GPU memory and its optimized use (i.e., avoiding bank conflicts Figure 16). For large data volumes, without the optimized memory management, the kernel contributes approximately up to 65% to the overall operation time (Figure 19). When all optimizations are enabled, efficient use of shared memory reduces the kernel execution impact to about 20% of the total execution time (Figure 19 and Figure 20).  Figure 16. Kernel execution time for MD5 direct hashing module with/without shared memory optimization enabled. Stage 4: Data Transfer Out. Transferring the output (Figure 17) causes proportionally less impact on the overall execution than transferring the input (Figure 19 and Figure 20). The reason is that, for direct hashing, the output size is several orders of magnitude smaller than the input. Moreover, the output buffers are always pinned; therefore, this step always benefits from the high throughput achieved by using pinned memory pages. As a result, no major differences are observed in terms of the impact caused by the output transfer across tested configurations.  43  Figure 17. Execution times for MD5 direct hashing module data transfer operation from the GPU global memory to the host memory with and without using pinned memory feature. Stage 5: Post-processing. Finally, the aggregation of the kernel output into one hash value takes only up to a few milliseconds and has a minor impact on the overall execution time (Figure 18). Enabling GPU optimizations do not influence the performance of the last stage (hash aggregation), since the execution is performed on the CPU.  Figure 18. Execution times for the last stage – the hash aggregation. Figure 19 and Figure 20 illustrate the proportion of total execution time that corresponds to each execution stage. These results show the major impact of pinned and shared memory optimizations on the contribution of each stage to the total runtime. Using pinned memory reduces the impact of data transfer (compare Stage 2 in Figure 19 and Figure 20), while using  44  the shared memory reduces kernel execution impact (compare Stage 3 in Figure 19 and Figure 20). Finally, enabling both optimizations increases the impact of the copy operation, since pinning memory demands a higher overhead during the allocation stage (Stage 1 in Figure 20).  Figure 19. Percentage of total execution time spent on each stage when none of the optimizations are enabled.  Figure 20. Percentage of total execution time spent on each stage with pinned and shared memory optimizations enabled.  45  2.6.1.2 Application Level Performance This section complements the synthetic benchmarks presented so far. The following experiment evaluates the application-level gains achieved by using HashGPU. Concretely, it evaluates the speedup offered using HashGPU to detect similarities between successive checkpoint images of the same application. Checkpointing is an indispensable fault tolerance technique adopted by long-running applications. These applications periodically write large volumes of snapshot data to persistent storage in an attempt to capture their current state. In the event of a failure, applications recover by rolling-back their execution state to a previously saved checkpoint. Consecutive checkpoint images have often a high degree of similarity (for instance, in checkpointing application (section 3.5) up to 82% similarity is detected). I have collected the checkpoint images using BLCR checkpointing library [86] from 24 hour-long runs of BLAST, a popular bioinformatics application [34]. The interval between the checkpoints is 5min. The average image size is 279MB. Table 3 and Table 4 compare the throughput of online similarity detection between using standard hashing functions running on a single core of Intel Core2 Duo 6600 2.40GHz processor and using HashGPU running on NVIDIA 8800 GTX.  The fixed block size  similarity detection uses a block size of 20 MB, and the variable block size similarity detection uses window size of 20 bytes and an offset of 4 bytes. These results show dramatic improvement in the throughput of online similarity detection with both fixed and variable size blocks. The results indicate that fixed-block similarity detection can be used even on 10Gbps systems while the variable block size technique can be used for systems connected with 1Gbps links without introducing a performance bottleneck. Note that this experiment considers the similarity detection mechanism at the application level only, without integrating with a file system. Table 3. Online similarity detection throughput (in MBps) and speedup using SHA1. Throughput Similarity (MBps) ratio HashGPU Standard detected Fixed block size (using direct hashing) 1015.9 155.76 23% Speedup: 6.5x Variable block size (LBFS technique 194.24 8.06 82.0% using sliding window hashing) Speedup: 24.1x  46  Table 4. Online similarity detection throughput (in MBps) and speedup using MD5. Throughput Similarity (MBps) ratio HashGPU Standard detected 1101.6 234.17 Fixed block size (using direct hashing) 23% Speedup: 4.7x Variable block size (LBFS technique 255.21 11.05 80% using sliding window hashing) Speedup: 23.1x 2.6.2 Evaluating the Performance Gains Enabled by CrystalGPU To explain the performance gains enabled by the CrystalGPU layer, this section starts by summarizes the overheads of the various GPU processing steps within the HashGPU library when used independently. Then it evaluates the performance gains enabled by each of the optimizations enabled by CrystalGPU: buffer reuse, overlap of data transfers and computations, and transparent utilization of multiple GPUs. The evaluation uses a machine with an Intel Xeon Quad-core 2.33 GHz, 16GB of memory, a PCI Express 2.0 x16 bus, and an NVIDIA GeForce GTX 480 GPU with 480 1.4 GHz cores and 1.5 GB of memory, and a GeForce 8800 GTX card. The Overheads. Figure 21 presents the proportion of the total execution time that corresponds to each of the stages outlined in Section 2.4.1 for the sliding window hashing module on the GTX 480 GPU card. Figure 22 presents the proportion of the total execution time that corresponds to each stage (outlined in Section 2.4.1) for the direct hashing module on the GeForce 8800 GTX card. These results show the major impact of memory allocation and copy-in stages on the total execution time: up to, 80-96% of the total execution time in sliding hashing module (Figure 21) and 70-90% in the direct hashing module (Figure 22). Two opportunities exist to reduce the overhead of these two stages: first, the overhead of memory allocation stage can be avoided by reusing memory buffers. Second, copy-in overheads can be hidden, by overlapping data transfers and computation. Given the generic nature of these optimizations, CrystalGPU offers them in an application agnostic layer. In more detail, CrystalGPU memory and task management API implicitly apply these two optimizations for every task regardless of the application kernel. The rest of this section presents an evaluation of the benefits of integrating HashGPU on top of CrystalGPU.  47  Figure 21. Percentage of total HashGPU sliding window hashing execution time spent on each stage without any optimization.  Figure 22. Percentage of total HashGPU direct hashing module execution time spent on each stage without any optimization. The Performance Gains. Figure 23 and Figure 24 show the average speedup obtained when using HashGPU on top of CrystalGPU for the sliding window hashing and direct hashing modules, respectively, to hash a stream of 10 data blocks. The baseline for speedup calculation is single core performance. In addition to speedup the figures also present the processing throughput for critical lines. The dual GPU configuration uses a second GPU card in the system: NVIDIA Tesla C2050 GPU with 448 1.1GHz cores and 3GB memory. While the experiment uses batches of 10 blocks the evaluation, presented in this technical report 48  [76], shows that a batch of at least 3 blocks is needed to obtain close to maximal performance gains. Note that, to avoid cluttering the figures, the plots to not report the standard deviation as it is significantly small for all experiments. The experiment shows the effect of block size on the achieved speedup. For smaller block sizes, and irrespective of which optimization is enabled, the memory allocation and data transfer overheads overshadow the performance improvement in the computation (i.e., calculating the hash); conversely, for larger block sizes, the aforementioned overheads are amortized over the longer computation time, hence achieving better speedups compared to the CPU. This experiment demonstrates that using all optimizations enabled by CrystalGPU leads to huge performance gains; up to 190x performance gains in sliding window hashing and 45x performance gains in direct hashing (for large data blocks), compared to up to 27x and 7x performance gains, respectively, provided by HashGPU alone. Further, while the original sliding window hashing performance lags behind the CPU performance for data blocks smaller than 64KB, and while the original HashGPU direct hashing performance lags behind the two socket CPU performance (i.e. two CPUs each on a separate socket) for all data sizes, adding CrystalGPU enables much higher performance gains. The figures demonstrate that the buffer reuse optimization is able to amortize memory management costs; enabling, as a result, up to 100x sliding window hashing speedup for large data blocks. The overlap feature enables an additional speedup increase that corresponds to the portion of time spent on data transfer operations (to up to 125x). Using both GPUs available on the machine increases the speedup further (to up to 190x). Direct hashing results (Figure 24) show a similar pattern. Two contention points prevent dual GPU cards to achieve linear speedup in the studied case: First, the limited percentage of time spent in computation (compared to the overheads of memory allocation and I/O). Second, in the direct hashing case, post processing stage is performed sequentially on the CPU.  49  Figure 23. Achieved speedup for sliding window hashing for a stream of 10 jobs. Small data sizes in the left figure (logarithmic scale on Y axis), and large data size in the right figure (linear scale on Y axis). Values below 1 indicate slowdown. The baseline is the performance on a single core on Intel Xeon Quad-core 2.33 GHz. The numbers over the “Dual socket CPU” and “Overlap, Buffer reuse” data points indicate the achieved processing throughput in MBps.  Figure 24. Achieved speedup for direct-hashing for a stream of 10 jobs. Small data sizes in the left figure (logarithmic scale on Y axis), and large data size in the right figure (linear scale on Y axis). Values below 1 indicate slowdown. The baseline is the performance on a single core on Intel Xeon Quad-core 2.33 GHz. The numbers over the “Dual socket CPU” and “Overlap Buffer reuse” data points indicate the achieved data processing throughput in MBps.  50  2.6.3 Add a CPU or a GPU? To inform the system builders’ decision, this section focuses on comparing the performance gain when extending the system with a second CPU versus a GPU card. The evaluation compares the performance of a system originally configured with an Intel Xeon Quad-core 2.33 GHz CPU and 16GB of memory when extended either by a second similar processor, or with an NVIDIA GeForce GTX 480 GPU with 480 1.4 GHz cores and 1.5 GB of memory. The CPU evaluation uses a multithreaded implementation of the content-based chunking module. The evaluation (not presented here) shows that using 16 threads leads to the highest performance on the dual CPU system. The GPU evaluation uses HashGPU on top of CrystalGPU as presented in the previous section. Figure 23 and Figure 24 shows the speedup obtained when using an additional CPU (the solid line labeled “Dual Socket CPU”) compared to using a single GPU card (the dotted line labeled “Overlap, Buffer Reuse”). The figure shows that the single GPU configuration achieves up to 125x sliding window hashing performance gain (relative to the baseline) and 15x performance gain compared to the dual CPU configuration (which achieves only 8x performance gain against the baseline). Further, the single GPU configuration achieves up to 28x direct hashing performance gains compared to only 8x performance gain by the dual CPU configuration (a 3.5x relative performance gain). Further, note that despite the 8x performance improvement of the dual CPU configuration, this configuration only achieves 129MBps sliding window hashing throughput, a throughput close to the 1Gbps network throughput and significantly lower than the throughput of high performance networks (often 10Gbps). Consequently, regardless of the level of similarity detected sliding window hashing on two CPUs will not bring performance benefits for 1Gbps system and will introduce a performance bottleneck for 10Gbps systems (the next section presents an end-to-end evaluation). These results highlight that adding a GPU is a better fit for the hashing-based workloads due to its highly parallel architecture and high throughput memory system. 2.6.4 Integrated System Evaluation This section evaluates the gains enabled by integrating the HashGPU/CrystalGPU modules with MosaStore. The evaluation uses a 22 nodes cluster of 2.33GHz Intel Xeon Quad-core CPU, 4GB memory nodes connected at 1Gbps. The storage system client machine has two 51  CPUs of Intel Xeon Quad-core 2.33 GHz (8 cores in total), 16GB of memory, a PCI Express 2.0 x16 bus, and NVIDIA GeForce GTX 480 GPU. I present here averages over 10 runs. To avoid cluttering the figures, the plots do not report the standard deviation as it is significantly small for all experiments. The experiments evaluate the system using two configurations: Fixed size blocks that uses direct hashing for similarity detection. The block size is 1MB, the default block size in MosaStore, and Content based chunking configuration, which uses sliding window hashing to detect block boundaries. The block size was 1.2MB on average (with minimum block size of 256KB and maximum block size of 4MB). The client SAI is configured to stripe the write operations to four storage nodes in parallel. To explore the gains enabled by GPU offloading the performance of the following three configurations of the storage system are compared:   non-CA, in which the content addressability module is disabled and all data is written directly to the storage nodes (i.e., without any hashing and similarity detection overheads).    CA-CPU, in which all processing related to content addressability is done by the CPU (i.e., hashing of data blocks and hash comparisons). Compared to the previous configuration, this configuration exposes the performance impact of hashing data to support content addressability.    CA-GPU, which uses the integrated HashGPU/CrystalGPU stack to offload the computation of hashes to the GPU. Compared to the previous configuration, this configuration exposes the gains brought by offloading computationally intensive operations to a GPU.  Note that the performance of the system when using content addressability varies depending on the degree of data similarity present in the workload. To evaluate the entire performance spectrum the experiments use the following three workloads:   Different: The first workload consists of writing a set of completely different files. This workload exposes all overheads as all data need to be hashed and transferred across the network to the storage nodes. Moreover, no similarity can be detected between writes, which implies no opportunity to reduce space or bandwidth usage. Exploring this workload has a second advantage: the performance comparison holds 52  for systems that use hashing for other goals than content addressability (e.g., for data integrity checks only).   Similar: The second workload represents the other end of the spectrum: it exposes an upper bound for the performance gains that can be obtained using content addressability and maximizes the hash-computation overheads in relation with other storage overheads. When the files are identical, data is transferred only once across the network yet similarity detection overheads still exist.    Checkpoint: Finally the third workload represents a real application data. This third workload involves 100 successive data checkpoint images, taken at 5 minute intervals for the BLAST [34] application using BLCR [86] checkpointing library (the average checkpoint size is 264.7MB). Fixed size blocks similarity detection detects 21-23% similarity on average between successive checkpoint images, depending on the block size, while content based chunking detects 76-90% similarity depending on the block size.  Results for the “Different” Workload. Figure 25 presents the write throughput with the different workload with fixed size blocks configuration. The figure shows that under workloads with completely non-similar data, the non-CA configuration achieves the highest throughput while the CA-CPU and CA-GPU configurations lag behind for small files. The reason is that with completely non-similar files, the similarity detection mechanism increases the write operation’s overhead without bringing any performance gains (expected by reducing the amount of data that need to be transferred). Figure 26 presents the write throughput under the content based chunking configuration. The figure shows that the CA-CPU and CA-GPU configuration lag behind non-CA in almost all file sizes. Note that content based chunking configuration using dual CPUs is caped at 46Mbps (significantly lower than the network throughput of a 1Gbps NIC card) highlighting that content based chunking introduces a new performance bottleneck for the write operation (this is also the case with the similar workload presented next).  53  Figure 25. Average throughput while writing 40 different files back-to-back in the fixed block configuration. Note the logarithmic scale on the y-axis.  Figure 26. Average throughput while writing 40 different files back-to-back in the content based chunking configuration. Note the logarithmic scale on the y-axis. Results for the “Similar” Workload. Figure 27 presents the write throughput with the similar workload with fixed size blocks configuration. The figure shows that under workloads with completely similar data, MosaStore with HashGPU significantly outperforms the CPU version and achieves over two times higher throughput for files larger than 64MB. This is because the similar workload is compute bound: only the first file needs to be transferred over the network to the storage nodes, while the computationally intensive hash computations will detect that the other files are similar. Under this workload HashGPU enables doubling the system throughput by accelerating the hash computations.  54  Figure 28 presents the write throughput for the similar workload with content based chunking configuration. Note that the CPU version achieves lower throughput than non-CA configuration. This is because the content based chunking approach adds significant computation overhead that introduces a new bottleneck in the system. On the other hand, MosaStore with HashGPU significantly outperforms the CPU and non-CA versions and achieves over 4.4x and 2.1x, respectively, higher throughput for files larger than 64MB. This result validates the hypothesis that while content based chunking can reduce the transferred data size, its computation overhead on the CPU hinders its use in high performance computing system. Offloading this computational step to GPU not only eliminates this bottleneck but brings significant performance gains.  Figure 27. Average throughput while writing the same file 40 times back-to-back in the fixed block configuration. Next section discusses the CA-Infinite configuration. Note the logarithmic scale on the y-axis.  55  Figure 28. Average throughput while writing the same file 40 times back-to-back in the content based chunking configuration. Next section discusses the CA-Infinite configuration. Note the logarithmic scale on the y-axis.  Figure 29. Average throughput while writing 100 BLAST checkpoints back-to-back while varying the block size. “Fixed” denoted evaluation with the fixed block size configuration while “CB” denotes the content based chunking configuration. The content based chunking configuration was tuned to produce average chunks sizes close to the sizes indicated on xaxis. The numbers on top of the bars denote the average similarity percentage detected using the configuration. Results for the “Checkpointing” Workload. Figure 29 presents the write throughput with the checkpoint workload for all system configurations while varying the block size. These results lead to the following observations:   First, offloading to GPU always offer sizeable benefits compared to a system with dual CPU configuration. Offloading to the GPU enables about 1.3x write throughput improvement with fixed block sizes and up to 5x write throughput improvement with 56  content based chunking compared to a content addressable system that does not offload (i.e. uses the CPU).   Second, content based chunking configuration achieves the highest throughput when offloaded to GPU; up to 5x higher than CPU configuration and 2.3x than a system without content addressability. Although content based chunking adds higher overhead in detecting blocks boundaries, it detects 3-4x more similarity across checkpoint images, significantly reducing the amount of data that needs to be transferred, hence offering higher throughput in the GPU offloading case.    Third, content based chunking using dual CPUs achieves the lowest throughput (49MBps) despite the high similarity ratio detected. This is because the system is bottlenecked by the content based chunking mechanism on the CPUs.    Fourth, the speedup offered by offloading to the GPU increases as the block size increases (as evaluated in the previous sections). This is the reason the system that offloads to the GPU performs increasingly better for 1MB, and 4MB blocks.    Finally, the block size offers a performance tuning knob: small block sizes enable detecting higher similarity, but increase the system overhead (for transferring larger number of data blocks, and similarity detection). This tradeoff is clearer with content based chunking configuration when offloaded to the GPU. The configuration achieves the highest throughput with block sizes close to 1MB although higher similarity is detected using 256KB blocks, and although larger blocks (i.e 4MB) are expected to reduce data transfer overheads.  In summary, for workloads that have data similarity, exploiting GPU’s computational power can bring tangible performance benefits to content addressable storage systems. Further, in systems that use hashing only for data integrity checks (as indirectly evaluated by the different workload), hashing can be efficiently offloaded to the GPU. 2.6.5 What Would the System Performance be if we had Infinite Compute Power? While the results (Figure 27 and Figure 28) show that significant performance gains can be achieved by offloading the hash computation to the GPU, it is important to estimate the practicality of further investing in CPU/GPU hardware or software optimizations to obtain higher throughput. To answer this question, I experimented with a hypothetical storage system configuration (CA-Infinite in Figure 27 and Figure 28) that represents a system with 57  infinite compute power for computing hash functions (sliding window hashes or direct hashes). This configuration uses an oracle that ‘computes’ the hash function instantly (thus emulating infinite compute power to compute hashes). The results in Figure 27, (which presents the throughput of the system when using the similar workload with fixed block configuration) show that CA-GPU performance is almost equivalent to optimal (as represented by CA-Infinite plot line). Further, the results in Figure 28, (which presents the throughput of the system using content based chunking) show that CA-GPU performance is close to optimal: the throughput loss is lower than 50% for files smaller than 16MB, and lower than 25% for larger files. Thus, one can argue that offloading to the GPU, which in this configuration and workload combination enables up to 2.3x higher system throughput, will offer close to the maximal benefits that could be obtained by accelerating the hash computation on standard commodity clusters. 2.6.6 The Impact on Competing Applications While the previous section has demonstrated that offloading computationally intensive primitives to the GPU can improve the system’s throughput, the impact of this approach on the overall client system performance still needs to be evaluated. On the one hand, offloading frees CPU cycles on the client system. On the other hand, offloading might add a significant load on the client’s kernel and the I/O subsystems to handle data transfers to and from the device. Consequently, this technique may negatively impact applications that are running concurrently, especially those with high I/O load. This section evaluates the impact of the proposed approach on concurrently running applications. The experiment uses two workloads to reproduce the diversity of applications’ possible resource usage pattern: a compute bound application (the experiments use a multithreaded prime number search application), and an I/O bound application (the experiments use the compilation of the Apache web server v2.2.11 as a representative I/O bound application, which particularly stresses the disk I/O channel). Further the evaluation evaluates the system using MosaStore with fixed size block configuration, as content based chunking approach using CPUs clearly adds significant computational overhead to the system and is a less viable option in performance centric HPC systems (Section 2.6.4). The evaluation uses a client machine with an Intel quad-core 2.66 GHz processor, PCI Express 2.0 x16 bus, and NVIDIA GeForce 9800GX2 GPU. 58  To measure the performance impact, the experiment times the application run while the client system is also serving an intense write workload: writing 1GB files back to back. The performance baseline for all results presented in the figures below is the execution time of the application on an unloaded system (i.e., neither the MosaStore SAI client nor other applications are running on the system at the time). The results present averages over 20 runs. The impact on a compute bound application. Figure 30, Figure 31, and Figure 32 present the MosaStore achieved throughput (left), and the compute bound application execution time increase (i.e., application slowdown – the lower, the better) (right) under the three workloads presented in Section 2.6.4 These results confirm that:   First, outsourcing to the GPU frees CPU cycles that can be effectively used by a compute intensive competing application. In all three experiments, with different workloads, the competing application performs faster on a client system that offloads to the GPU than on a client system that computes hash functions on the CPU. The difference can vary from as high as reducing the slowdown by half (for the ‘different’ workload) to reducing the slowdown by 10-20% (for the other two workloads).    Second, outsourcing to the GPU enables better storage system throughput (around 2.5x better under the ‘similar’ workload, and 2x better under the checkpoint workload) even when a competing compute intensive application is present on the client node.    Third, the throughput of the GPU-enabled storage system is only slightly affected by the competing application: less than 18% throughput loss compared to the case when the client system runs on a dedicated node.  Note that the non-CA system imposes a significant burden on the competing application (between 225% and 80% slowdown depending on workload). This is due to high TCP processing overheads (I have verified this by running the competing application while continuously generating TCP traffic using iperf [4]. In this case iperf caused 185% application slowdown). A second, surprising, observation is that under the different workload, CA-GPU has lower impact on the competing compute bound application than the non CA system configuration (that does not consume CPU cycles for hashing). While I do not have a precise 59  understanding for this performance disparity, my intuition is that it is related to the blocking of the SAI threads (introduced by GPU calls) that yield more frequently to the application.  Figure 30. (Left) MosaStore average achieved throughput under the different workload while running a competing compute intensive application. (Right) Competing application slowdown (the lower the better).  Figure 31. (Left) MosaStore average achieved throughput under the ‘similar’ workload while running a competing compute intensive application. (Right) Competing application slowdown (the lower, the better).  60  Figure 32. (Left) MosaStore average achieved throughput under the ‘checkpoint’ workload while running a competing compute intensive application. (Right) Competing application slowdown (the lower, the better). The Impact on an I/O bound application. Figure 33, Figure 34, and Figure 35 present MosaStore’s achieved throughput (left) and the disk IO bound application slow down (right), under the three workloads presented in Section 2.6.4. There are three main takeaways from these experiments:   First, offloading to the GPU does not introduce a bottleneck for a competing I/Ointensive application. The competing application slowdown is marginally (5-15%) lower (thus better) than when hashing on CPU.    Second, offloading to the GPU enables better storage system throughput (around 2x better under the ‘similar’ and 1.7x better under the checkpoint workload) even when a competing I/O intensive application is present on the client node.    Third, the throughput of the GPU-enabled storage system is only slightly affected by the competing application: less than 6% throughput loss compared to the case when the client system runs on a dedicated node.  61  Figure 33. (Left) MosaStore's average achieved throughput while running an I/O intensive application with the different workload. (Right) Competing application slowdown (lower is better).  Figure 34. (Left) MosaStore's average achieved throughput while running an I/O intensive application with the similar workload. (Right) Competing application slowdown (lower is better).  62  Figure 35. (Left) MosaStore's average achieved throughput while running an I/O intensive application with the checkpoint workload. (Right) Competing application slowdown (lower is better). 2.6.7 Evaluation Summary The evaluation highlights five findings:   First, the evaluation shows that offloading content addressability mechanism to GPU not only brings tangible performance gains; but, importantly, it opens the door for using content addressability in high performance computing (HPC) systems, an area where this mechanism that has been often avoided not only due to its raw computational overheads but mostly because it adds a performance bottleneck where high-speed networks had been deployed.    Second, the evaluation shows that a task management engine (i.e., the CrystalGPU layer) that orchestrates GPU offloading, and provides application agnostic optimizations, is not only essential from a software architecture standpoint but also from a performance standpoint.    Third, the evaluation shows that GPU offloading provides close to optimal performance; as the system throughput with GPU offloading is slower than an infinite-compute power configuration by less than 10% in fixed block configuration and by 25-50% in content based chunking configuration.    Fourth, GPU offloading does not introduce a new system bottlenecks: the impact on competing CPU-bound and IO-bound is negligible.    Finally, the evaluation provides an important data point for system builders: for hashing-based workloads, extending the system with a GPU achieves significantly higher results than extending the system with a second CPU (of close market price). 63  2.7 Related Work The background and the introduction have included positioned this project relative to important aspects of related work. This section adds two new dimensions. Hashing.  Hashing is commonly used by storage systems to support: content  addressability [120, 124], data integrity checking [58, 96, 158], load balancing [60, 64], data similarity detection [10, 110], deduplication backup operations [160], and version control (e.g, rsync [10]). The use of hashing in these systems differs along multiple axes; including: block sizes - from fixed-sized blocks [120, 124] that use hashing similar to the HashGPU direct-hashing module, to detecting block boundaries based on content [68, 110, 160] and use sliding-window hashing in Section 2.3; and targeted deployment environment - from personal use [124], peer to peer [58] to data center [160]. Other uses of hashing include: detection of copyright violations [48, 136], compact set representation [45], and various security mechanisms [158]. GPU harnessing. Exploiting GPUs for general purpose computing has recently gained popularity, particularly as a mean to increase the performance of scientific applications. I refer the reader to Owens et al. [113] for a comprehensive survey. More related to the infrastructural focus of this work, Curry et al. [59] explore the feasibility of using GPUs to enhance RAID system reliability. Their preliminary results show that GPUs can be used to accelerate Reed Solomon codes [123], used in RAID6 systems. Along the same lines Falcao et al. [69] show that GPUs can be used to accelerate Low Density Parity Checks (LDPC) error correcting codes. Harrison and Waldron [87], study the feasibility of GPUs as a cryptographic accelerator. To this end they implement the AES encryption algorithm using CUDA and report 4x speedup. Finally, Moss et al. [108] study the feasibility of accelerating the mathematical primitives used in RSA encryption. They use OpenGL and render the application into a graphics application and report up to 3x speedup. Moreover, recently exploiting GPUs to accelerate security operations was adopted in few security products, including Kaspersky antivirus [5] and Elcomsoft password recovery software [1]. This study is different from the above studies in two ways. First, unlike the previous studies that focus on accelerating standalone primitives, this study evaluates the viability of exploiting the GPU computational power in the context of a complete system design. Second, 64  the hashing primitives addressed by work are data-intensive with a ratio of computation to input data size of at least one order of magnitude lower than in previous studies. Finally, the storage systems research community has, recently, adopted the design approach proposed by this stream of work: to exploit the GPUs as storage system accelerators. A number of recent studies adopt this approach, including a GPU accelerated encrypted storage system (PTask based system [127], GPUStore [143]), a deduplicated storage system (Shredder [44] P-Dedupe [156], and GHOST [94]), low cost RAID storage [93], and file matching service [106], erasure coding based reliable storage [149].  2.8 Discussion This section focuses on a number of interrelated questions: 1.) Are HashGPU hash function implementations strong? Is the system backward compatible? Most of today’s hash functions are designed using the Merkle-Damgard construction [62, 104]. Merkle and Damgard show that collision-free hash functions can divide data into fixedsized blocks to be processed either sequentially, using the output of each stage as the input to the next, or in parallel and then concatenating and hashing the intermediate hash results to produce a single final hash value. Most hash functions such as MD5 and SHA adopt the iterative model because it does not require extra memory to store the intermediate hash results. The proposed approach for the direct-hashing module is based on the parallel construction. This choice has two implications. First, as a direct implication of the Merkle and Damgard argument, the resulting hash function will still have the same strength as the original sequential construction. Second, while the HashGPU sliding-window module is still backward-compatible, hashing only small data windows, the direct-hashing technique produces different hash values compared to the sequential MD5 or SHA versions. This does not have an impact on the HashGPU usability as long as all entities in the storage system use the same library. One way to reduce the migration burden is to provide CPU implementations of HashGPU that implement the same algorithm. 2.) Can other middleware benefit from this idea?  65  I believe that a host of other popular primitives used in distributed systems can benefit from GPU support, such as erasure coding, compressed set representation using Bloom filters, and data compression among others. For example, different parallel algorithms for ReedSolomon coding exist [61] and can be deployed on GPUs; on the other hand, Gilchrist [80] proposes a parallel implementation of the bzip2 loss-less data compression algorithm that may benefit from GPU support. Further, in my previous work I collaborated on experimenting with a GPU-optimized Bloom filter implementation [56]. In general, I believe that GPUs can be used by any data-parallel application to provide significant performance improvements, provided that the number of operations performed per byte being processed is sufficiently high to amortize the additional overheads due to host-device memory transfers. 3) How does HashGPU perform against the theoretical peak? Since HashGPU is a data-intensive application as opposed to a compute-intensive one, I first consider memory access throughput. While the memory bandwidth listed in NVIDIA’s specification [12] is as a high 32GB/s for GeForce 8600GTS GPU and 86.4GB/s for GeForce 8800 GTX GPU, the real memory access bottleneck is the PCIe bus, listed at 4GB/s in each direction. This is congruent to the evaluation experiments, which show that pinned memory transfers achieve up to 2.48GB/s. Furthermore, NVIDIA’s GeForce 8600 theoretical non floating point instruction exaction peak rate is estimated at to 46.4 GIPS (Giga Instruction Per Second). The HashGPU kernel performs at up to 19.54 GIPS, a slowdown compared to the peak rate mainly due to internal memory copy operations inside the GPU. As a matter of fact, recently, Bakhoda et. el. [39] built a detailed micro-architecture performance simulator for NVIDIA cards and evaluated the performance of a dozen of nongraphical application on the GPU including the HashGPU code. The study concludes with insightful guidelines for optimizing general purpose programming on GPUs. Based on microarchitecture level information, Bakhoda et al. [39] estimate that HashGPU uses the shared memory efficiently: over 90% of the memory accesses are to shared memory and less than 10% to global memory. Moreover the study estimates that HashGPU is able to maximally use all the running threads and avoid stalls. 4) What if chip multiprocessors had many cores? Would a GPU-driven solution be necessary?  66  Processor manufacturers expect to develop many-core architectures over the next decade [138]. While this project proposes HashGPU as a solution that addresses germane problems for storage systems with current technology, the high-level principles embodied in this work will extend to the many-core arena as well. Some of the challenges that this work encountered in the development of HashGPU relate to the efficient handling of memory transfers and shared memory. Such concerns are likely to exist even with many cores on a single chip; it is expected that many core architectures will utilize heterogeneous cores and non-uniform memory accesses (NUMA) and such architectures will continue to require careful memory management to achieve significant speedups.  2.9 Conclusions At a high level of abstraction, computing system design is a multi-dimensional constraint optimization problem: designers have at hand various components that offer different cost-toperformance characteristics as well as the techniques to put these components together. These techniques have their own overheads that lead to performance tradeoffs and new limitations. In this context, a system designer optimizes for key success metrics (such as latency, throughput, storage footprint, and data durability) within specific constraints (such as cost and availability). Historically, the unit costs for raw storage, computation, or network transfers have evolved largely in sync. Recently, however, massively multi core commodity hardware (such as GPUs) holds the promise of a sharp, one order of magnitude, drop in computation costs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This study demonstrates the feasibility of exploiting GPU computational power to support distributed storage systems, by accelerating popular compute-intensive primitives. This project presents a storage system prototype capable of harvesting GPU power, and evaluated it under two system configurations: as a content addressable storage system and as a traditional system that uses hashing to preserve data integrity. Further, the evaluation evaluated the impact of offloading to the GPU on competing applications’ performance, and the impact of competing applications on the storage system performance. The evaluation results show that this technique can bring tangible performance gains enabling close to 67  optimal system performance under the set of workloads and configurations considered without negatively impacting the performance of concurrently running applications. Whereas this work has effectively demonstrated the feasibility of tapping into GPUs computational power to accelerate compute-intensive storage system primitives, I view this effort as a first step towards the larger goal of understanding how one can use heterogeneous multicore processors for independent system-level enhancements rather than applicationlevel speedups. Note that future single-chip massively multicore processors will likely be similar to the used experimental platform along two directions: they will be heterogeneous (e.g., include different cores with different execution models, SIMD vs. MIMD) and will offer complex application manageable memory hierarchies. In this context performance gains can only be obtained by careful orchestration of data transfers, data placement, and task allocation. My experience with a combination of parallel hashing on a GPU, optimized memory handling, and task scheduling leads us to believe that these goals can be effectively be mapped to independent abstraction layers. Processor manufacturers expect to develop many-core architectures over the next decade [138]. While this project proposes a solution that addresses germane problems for storage systems with current technology, the high-level principles embodied in this work will extend to the many-core arena as well. Some of the challenges encountered in this project relate to the efficient handling of memory transfers and shared memory. Such concerns are likely to exist even with many cores on a single chip; it is expected that many core architectures will utilize heterogeneous cores and non-uniform memory accesses (NUMA) and such architectures will continue to require careful memory management to achieve significant speedups.  68  Chapter 3 3 Cross-Layer Optimizations in Storage Systems Today’s  large-scale  computing  systems  (e.g.,  supercomputers,  cloud  computing  infrastructures) aggregate thousands of computing nodes and offer ample computational power. These systems support large-scale scientific applications that generate an intense storage workload. For instance, data intensive applications [9, 34] often use thousands of computing nodes to search through or analyze terabytes of stored data. For such applications, the storage system throughput and scalability play a key role in the overall application performance. The risk is that the I/O system is the bottleneck for an expensive set of compute resources. Moreover, applications that use these compute resources are highly heterogeneous over multiple axes related to storage system usage patterns and required semantics. These include: file granularity, read vs. write composition (e.g., read- or write-only access pattern), datalifetimes, durability requirements (e.g., some files can be recomputed), consistency requirements, data sharing levels (e.g., sometimes thousands of nodes concurrently access the same data), and security requirements (e.g., in terms of authentication, data integrity and confidentiality). Further, data is rarely shared between applications. A one-size-fits-all dedicated storage system that serves such diverse requirements while meeting the access requirements of data-intensive applications is particularly complex and costly. Moreover, this approach often introduces a performance or scalability bottleneck. An alternative approach is specialization: that is, exploiting workload characteristics to optimize the data store for application-specific usage patterns. Google file system [79] and PVFS [49] are only two examples of this approach: Google file system optimizes for large datasets and append-intensive access patterns, while PVFS optimizes for sequential read/write access to large datasets in HPC environments. This project proposes an approach to specialize the storage system, at runtime, for the specific application using the system. In particular, this project proposes using file system custom metadata as a bidirectional communication channel between applications and the storage system. 69  This communication channel can be used to pass hints that enable cross-layer optimizations, an option hindered today by the ossified file-system interface. The project studies this approach in context of storage system support for large-scale workflow execution systems (the deployment setup is detailed in section 3.3 and Figure 36): the proposed workflow optimized storage system, exploits application hints to provide per-file optimized operations, and exposes data location to enable location-aware scheduling. This stream of work argues that an incremental adoption path for adopting cross-layer optimizations in storage systems exists, it presents the system architecture for a workflowoptimized storage system (section 3.6), its integration with a workflow runtime engine, and evaluates the proposed approach using synthetic as well as real applications workloads (section 3.7). The developed prototype demonstrates the viability of the cross-layer optimization approach. Further, the evaluation using synthetic benchmarks and real applications shows that the proposed approach brings tangible performance gains.  3.1 Context Custom metadata features (a.k.a., ‘tagging’) have seen increased adoption in systems that support the storage, management, and analysis of ‘big-data’ [35, 36, 91, 114]. However, the benefits expected are all essentially realized at the application level either by using metadata to present richer or differently organized information to users (e.g., better search and navigability [95, 98]) or by implicitly communicating among applications that use the same data items (e.g., support for provenance [109]). My thesis is that, besides the above uses, custom metadata can be used as a bidirectional communication channel between applications and the storage system and thus become the key enabler for cross-layer optimizations that, today, are hindered by an ossified file-system interface. This communication channel is bidirectional as the cross-layer optimizations enabled are based on information passed in both directions across the storage system interface (i.e., application to storage and storage to application). Possible cross-layer optimizations (surveyed in detail in section 3.8) include:   (top-down) Applications can use metadata to provide hints to the storage system about their future behavior, such as, per-file access patterns, ideal data placement (e.g. cousage), predicted file lifetime (i.e., temporary files vs. persistent results), access locality 70  in distributed setting, desired file replication level, or desired QoS. These hints can be used to optimize the storage layer.   (bottom-up) The storage system can use metadata as a mechanism to expose key attributes of the data items stored. For example, a distributed storage system can provide information about data location, thus enabling location-aware scheduling. The proposed approach has three interrelated advantages: it uses an application-agnostic  mechanism, it is incremental, and it offers a low cost for experimentation.  First, the  proposed communication mechanism: simply annotating files with arbitrary <key, value> pairs, is application-agnostic as there are no application-specific provisions for cross-layer information passing. Second, the proposed approach enables evolving applications and storage-systems independently while maintaining the current interface (e.g., POSIX) and offers an incremental transition path for legacy applications and storage-systems: A legacy application will still work without changes (yet will not see performance gains) when deployed over a new storage system that supports cross-layer optimizations. Similarly a legacy storage will still support applications that attempt to convey optimization hints through extended attributes (as suggested), yet without offering performance benefits. As storage and applications incrementally add support for passing and reacting to hints the overall system will see increasing gains. Finally, exposing information between different system layers implies tradeoffs between performance and encapsulation. To date, these tradeoffs have not been widely explored. I posit that a flexible encoding (key-value pairs) as the information passing mechanism offers the flexibility to enable low-cost experimentation within this tradeoff space. Enabling cross layer optimization through standard cross layer communication mechanism is a major departure from the current approaches for designing storage systems (dominantly proposing new API or designing application specialized storage systems). This approach will bear its benefits only if two assumptions hold true: first, it is feasible to build a standard cross layer communication channel that does not introduce significant overhead and can enable sizable optimizations, second, specializing the storage system brings significant performance gains. As a first step to explore the potential benefits of cross layer optimization in storage system, this work conducts two preliminary evaluation studies to evaluate these two assumptions: 71    Study the feasibility of building a cross layer communication channel, and evaluate its overhead and benefits (section 3.4). This work studies the feasibility and evaluates the potential benefits building a workflow aware storage system (using cross layer optimizations) can bring to workflow applications.    Evaluate the benefits of storage system specialization for a specific workload while preserving the POSIX API (section 3.5). This study evaluates the potential benefits storage system specialization brings to checkpointing. To this end, I built stdchk [30], a storage system specialized for checkpointing workloads, and evaluate the potential performance and scalability gains of the cross layer optimizations design approach.  3.1.1 Chapter Structure The rest of this chapter first highlights this work contributions (section 3.2), presents related background (section 3.3), presents the two preliminary evaluation studies: evaluating the feasibility of building a workflow optimized storage system and evaluating its performance limits (section 3.4), and evaluating the performance gains of specializing the storage system for checkpointing applications (section 3.5). Then the chapter presents the workflow optimized storage system design (section 3.6) and evaluation (section 3.7), and surveys the related work (section 3.8). The chapter discusses related issues and concludes in (section 3.8).  3.2 Contributions The proposed approach falls in the category of ‘guided mechanisms’ (i.e., solutions for applications to influence data placement, layout, and lifecycle -- i.e., re-use, longevity and importance), the focus of other projects as well. In effect, the wide range (and incompatibility!) of past such solutions proposed in the storage space in the past two decades (and incorporated to some degree by production systems - pNFS, PVFS, GPFS, Lustre, and other research projects [33, 37, 38, 70, 135, 142]), only highlights that adopting a unifying abstraction is an area of high potential impact. The novelty of this approach comes from the "elegant simplicity" of the proposed solution. Unlike past work, this approach maintains the existing API (predominantly POSIX compatible), and, within this API, it proposes using the existing extended file attributes (defined in attr/xattr.h library) as a flexible, applicationagnostic mechanism to pass information across the application/storage divide. This work demonstrates that significant improvements are possible, without abandoning 72  POSIX and that it is feasible to build a POSIX compliant storage system optimized for each application (or application mix) even if the application exhibits a heterogeneous data access pattern. This project demonstrates this approach by building a POSIX-compatible storage system to efficiently support the execution of scientific workflows. I chose this application domain as this community has to support a large set of legacy applications (developed using the POSIX API). The storage system aggregates the resources of the computing nodes allocated to a batch application (e.g., disks, SSDs, and memory) and offers a shared filesystem abstraction with two key features. First, it is able to efficiently support the data access patterns generated by workflows through file-level optimizations. To this end, the storage system takes hints that offer information about the expected access pattern on a specific data item or collection of items and guides the data layout (e.g., file and block placement, file coplacement). Second, the storage system uses custom metadata to expose data location information so that the workflow runtime engine can make location-aware scheduling decisions. These two features are key to efficiently support workflow applications as their generated data access patterns are irregular and application dependent. The key contributions of this work are:   This project proposes a new approach that uses custom metadata to enable cross-layer optimizations between applications and the storage system. Further, this work argues that an incremental adoption path exists for adopting this approach. This suggests an evolution path for co-designing POSIX-compatible file-systems together with the middleware ecosystem they coexist within such that performance efficiencies are not lost and flexibility is preserved, a key concern when aiming to support legacy applications.    To demonstrate the viability of this approach, this project present the design of a workflow-optimized storage system (WOSS) based on this approach. The proposed design provides generic storage system building blocks that can be adopted to support a wider range of cross-layer optimizations. Based on these building blocks, the design supports data access patterns frequently generated by workflows by enabling the workflow runtime engine to pass per-file/collection access hints and the storage to expose data location and thus enable location-aware task scheduling. Importantly, the project demonstrates that it is possible to achieve the sought goals without changing the 73  application code or tasking the application developer to annotate their code to reveal the data usage patterns.   The project offers an open-source implementation of the system. The system prototype was integrated with two workflow runtime engines (pyFlow, developed by ourselves, and Swift [154]). On the storage side, I started from MosaStore [13] and added the ability to offer and react to hints. On the workflow runtime side, this project adds data-location aware scheduling.    The evaluation demonstrates, using synthetic benchmarks as well as three real-world workflows that this design brings sizeable performance gains. On a commodity cluster, the synthetic benchmarks reveal that, compared to a traditionally designed distributed storage system that uses the same hardware resources, WOSS achieves 30% to up to 2x higher performance depending on the access pattern. Further, compared to a NFS server deployed on a well provisioned server-class machine (with multiple disks, and large memory), WOSS achieves up to 10x performance gains. (NFS only provided competitive performance under cache friendly workloads due to its well provisioned hardware). Further, under real application, WOSS enabled 20-30% application-level performance gain, and 30-50% gain compared to NFS. Finally, the evaluation on the BG/P machine at ANL shows that WOSS can scale to support larger workloads and enables sizable gains compared to the deployed backend storage (GPFS).  3.3 Background This section starts by briefly setting up the context: the target application domain and the usage scenario. It then continues with a summary of data access patterns of workflow applications. Section 3.8 surveys the related work on alleviating the storage bottleneck. The application domain: workflow applications. Meta-applications that assemble complex processing workflows using existing applications as their building blocks are increasingly popular in the science domain (e.g., modFTDock [8], Montage [97], PTMap [50]). While there are multiple ways to support the execution of these workflows on largescale machines, in the science area — where a significant legacy codebase exists — one approach has gained widespread popularity: a many-task approach [121] in which metaapplications are assembled as workflows of independent, standalone processes that communicate through intermediary files. 74  There are three main advantages that make most workflow runtime engines adopt this approach and use a shared file-system to store the intermediary files: simplicity, direct support for legacy applications, and support for fault-tolerance. First, a shared file-system approach simplifies workflow development, deployment, and debugging: essentially workflows can be developed on a workstation then deployed on a large machine without changing the environment. Moreover, a shared file-system system simplifies workflow debugging as intermediate computation state can be easily inspected at runtime and, if needed, collected for debugging or performance profiling. Second, a shared file-system will support the legacy applications that form the individual workflow stages as these generally use the POSIX API. Finally, compared to approaches based on message passing, communicating between workflow stages through a storage system that offers persistency makes support for fault-tolerance much simpler: a failed execution step can simply be restarted on a different compute node as long as all its input data is available in the shared file-system. Although these are important advantages, the main drawback of this approach is decreased performance: a shared file-system abstraction (reached through the POSIX API) constrains the ability to harness various performance-oriented optimizations that are only enabled by passing hints from application to storage or from storage to application. Usage scenario: batch applications. Since, on large machines, the back-end file-system becomes a bottleneck when supporting I/O intensive workflows [159], today’s common way to run them is to harness some of the resources allocated by the batch-scheduler to the application and assemble a scratch shared file-system that will store the intermediary files used to communicate among workflow tasks. This usage scenario is similar to the one explored by BAD-FS [41]: the file system acts as a scratch space and offers persistence only for the duration of the application; input data and results are staged-in/out. It is this batch-oriented scenario, described in more detail in Figure 36 and its legend, that is assumed for the rest of the chapter. Note that the shard file-system offered, facilitates integration and has become popular in other scenarios as well (e.g., checkpointing [99], stage-in/out [107], analytics [46], in-memory analysis of intermediate results, visualization).  75  Figure 36. Usage scenario and high-level architecture. The workflow optimized storage system (WOSS) aggregates the storage space of the compute nodes and is used as an intermediate file-system. Input/output data is staged in/out from the backend storage. The workflow scheduler queries WOSS for data location to perform location-aware scheduling. The scheduler submits tasks to individual compute nodes and includes hints on the data usage patterns. 3.3.1 Data Access Patterns of Workflows Several studies explore the data access patterns of scientific workflows: [43, 92, 140, 155, 157]. To make this chapter self-contained, this subsection briefly presents the common patterns identified by these studies, the opportunities for optimizations they generate (Table 5), and the support required from a storage system:   Pipeline: A set of compute tasks are chained in a sequence such that the output of one task is the input of the next task in the chain. An optimized system will store an intermediate output file on a storage node on the same machine as the one that executes the task producing it (if space is available) to increase access locality and efficiently use local caches. Ideally, the location of the data is exposed to the workflow scheduler so that the task that consumes this data is scheduled on the same node.    Broadcast: A single file is processed by a number of compute nodes at the same time. An optimized storage system can create enough replicas of the shared file to eliminate the possibility that the node(s) storing the file become overloaded – resulting in a performance bottleneck.    Reduce: A single compute task uses as inputs files produced by multiple computations. Examples include a task that checks the results of previous tasks for a convergence criterion, or a task that calculates summary statistics from the output of many tasks. An optimized storage system can intelligently place all these input files on one node and expose their location, thus creating an opportunity for scheduling the reduce task on 76  that node to increase data access locality.   Scatter/Gather: A single file is written to/read from by multiple compute nodes where each node accesses a disjoint region in the file. An optimized object storage system can optimize its operation by setting the storage system block size to the application file region size, using intelligent placement of file’s blocks, and by exposing block-level data location.    Reuse: A single file is read multiple times by one or multiple tasks scheduled on the same compute node. An optimized storage system will use informed caching to pin the file and make it available for reuse.    Distribute: a single task produces a collection of files, each consumed by one subsequent task. An optimized storage system can spread the files across storage nodes, effectively distributing the storage and compute workload.  77  Table 5. Popular workflow data access patterns. Circles represent computations. An outgoing arrow indicates that data is produced (to a temporary file) while an incoming arrow indicates that data is consumed (from a temporary file). There may be multiple inputs and outputs via multiple files. (Notation similar to that used by Wozniak et al. [155]). Arrows are labeled with extended attribute API calls used to pass hints to enable the optimizations. (The corresponding hints are presented in detail in Table 12) Pattern Pipeline  Pattern Details  Optimizations / Hint  Node-local data placement (if possible).  Caching.  Data location-aware scheduling.  Broadcast   Optimized replication taking into account the data size, the fan-out, and the topology of the interconnect.  Reduce   Reduce-aware data placement: coplacement of all output files on a single node;  Data location-aware scheduling  Scatter   Application-informed block size for the file.  Application-aware block placement;  Data-location application scheduling  Application-informed block size for the file.  Application-aware block placement.  Gather   Application-informed replication.  Application-informed caching.  Reuse  Distribute  Input   Application informed file and chunk placement.  Application informed replication.  Output files  78  3.4 The First Preliminary Evaluation Study: Workflow Optimized Storage System: a Limit Study 1 This study [150, 151] investigates the feasibility and the performance benefits of a workflowoptimized storage system: a storage system that aggregates the resources of the computing nodes (e.g., disks, SSDs, and memory) and is able to efficiently support the data access patterns generated by workflows through optimizations at the file or directory level. To support specific data access patterns, the storage system will use hints [131] that will drive the data layout (e.g., co-placement, replication levels, chunk placement) that indicate the expected access pattern. The goal of this opportunity study is to evaluate the performance benefits of a workflow-optimized storage system before paying the full cost of prototyping it. To this end, to understand the performance impact of per-file optimizations the following methodology is used: identify the workflow access pattern characteristics (combining patterns described by Wozniak et al. [155] and adding patterns discovered in this study) and derive the file-level optimizations the storage system needs to support. Once these are identified, MosaStore was changed to support these patterns (section 3.4.1). However, these changes at this point do not amount to a new system design – they are generally hardcoded paths in the storage with the single purpose of supporting the evaluation experiments (section 3.4.2). Study results. This exploration shows that building a workflow aware storage system is indeed feasible and can bring significant performance gains. It is feasible, for two reasons: First, previous studies showed that the workflows have a small set of common data access patterns, thus a small set of storage optimizations are enough to serve these patterns. Second, this study shows that these optimizations can be incorporated in a high-performance storage system without significantly increasing its complexity. Additionally, this evaluation using synthetic benchmarks shows that a workflow-optimized storage system can bring significant performance gains. Compared to a general distributed system that uses the same hardware resources, per-file optimizations and exposing data location enable 0.5x to 3x performance gains depending on the access pattern. Further, compared to a central NFS server deployed on a well provisioned server-class machine (with multiple disks, and large memory), a 1  Emalayan Vairavanathan was the leader for this preliminary study. The results of this study are published in two papers [150, 151]. In this project I helped implement MosaStore hardcoded optimizations, design the evaluation experiments, and lead the effort for writing the papers. 79  workflow-optimized storage system achieves up to 16x performance gains. (NFS only provided competitive performance under cache friendly workloads due to it well provisioned hardware.) 3.4.1 Hacks: Customizing MosaStore A workflow-optimized storage system should provide per-file configuration at run time to support high-configurability for diverse applications access patterns. Further, the workflowoptimized storage system should be workflow engine friendly. That is, it should expose internal per-file/directory information (e.g. data location) that helps the workflow engine optimize its runtime decisions (e.g., data location aware scheduling). To mimic a workflow-optimized storage system and to evaluate its performance, MosaStore was customized for each pattern evaluated. This section briefly presents these hardcoded customizations (described in more detail in section 3.4.2 in the context of the evaluation experiments). Some of the customizations made are incompatible with each other in the current MosaStore implementation (e.g., different data placements schemes as they are, in the original MosaStore, system-wide policies rather than per-file policies). The evaluation enables/disables for each experiment some of these changes in the code, recompile the code, and redeploy the storage system. All optimizations described below use the fact that MosaStore exposes data placement through standard extended file attributes and assume that the workflow runtime engine can optimize its decisions using this information (i.e., the runtime engine can schedule computations close to where data is located).  Optimized data placement for the pipeline pattern. MosaStore data placement module was changed to prioritize storing output files produced by a pipeline stage at the node where the task corresponding to that stage runs. If data does not fit on the local node, then the file’s chunks are shipped remotely through the normal MosaStore mechanisms.  Optimized data placement for the reduce pattern. MosaStore was changed to co-locate all the output files of a workflow stage followed by a reduce stage on a single pre-specified storage node. If data does not fit on the local node, file chunks are shipped remotely through the normal MosaStore mechanisms.  80   Replication mechanism optimized for the broadcast pattern. To avoid that the storage nodes for a file used in a broadcast pattern become a bottleneck, the replication factor was increased for these files. The default MosaStore lazy replication mechanism was changed to eager parallel replication: replicas are created eagerly while each chunk is written to storage.  Optimized data chunk placement for the scatter and gather patterns. Unlike other patterns described above that require optimizations at the file level, scatter and gather require chunk-level optimizations, as a single file’s chunks are accessed by a number of compute tasks in parallel. Consequently, the experiment sets the MosaStore chunk size to match the application per-node access region size, and constrain the MosaStore data placement such that the location of each chunk of a file can be determined. Further, the hacks optimize the scheduling decision to run the compute task on the node that has the specific file chunk accessed by that task. 3.4.2 Experimental Results The evaluation uses both micro benchmarks and application-level synthetic benchmarks to evaluate the customized MosaStore storage system. Data locality plays a key role to improve the storage performance. To quantitatively evaluate its impact in realistic settings a micro benchmark was designed to evaluate the cost of accessing a local vs. remote storage node to serve application’s data requests (section 3.4.2.1). Since the real scientific workflows are complex and often have multiple I/O patterns with several stages [43, 65, 92, 140, 155, 157], the application-level synthetic benchmarks were designed (section 3.4.2.2) to mimic the data access pattern of the scientific workflows. 3.4.2.1 Micro Benchmark: The Impact of Locality Experiment setup: MosaStore was deployed with one storage node and one SAI in two different setups: First, to evaluate the cost of accessing a local file, the experiment deploys the storage node and the SAI on the same machine. Second, to evaluate the cost of accessing remote files, the experiment deploys the storage node and SAI on two different machines. In both setups the manager was deployed on a separate machine to keep the metadata cost constant across the experiments. Each machine has Intel Xeon E5345 4-core, 2.33-GHz CPU, 4-GB RAM, 1-Gbps NIC, and a 300-GB 7200-rpm SATA disks. 81  Customizations: The default MosaStore system uses regular sockets [85] to communicate between the storage nodes and the SAI. The regular socket uses the standard network stack; hence, it adds an additional overhead when the SAI and the storage node are collocated on the same physical node. MosaStore was changed to use domain sockets [85] and partially eliminate this overhead in this situation. The reason is that domain sockets use shared memory to communicate instead of the network stack, while, at the same time, support the standard socket APIs. The workload: The micro benchmark sequentially writes 30 files of 1 GB via a single SAI and then sequentially reads these files. The experiment uses large files and write/read the files back to back to reduce the effect of caching especially when data reside on disk. The benchmark reports the write/read throughput. Evaluation results: The experiment evaluates the achievable performance gain due to locality while having the data chunks stored on either RAMdisk or spinning-disk. Figure 37 presents the I/O throughput for the following configurations: the local storage node when using the domain socket (labeled ‘Domain’ in the figure), the local storage node with regular socket (‘Regular’), and the remote storage node with regular socket (‘Remote’). Additionally, for comparison, the evaluation present the results of running the same benchmark when using the native file systems (ext3 on spinning-disk and tmp-fs on RAMdisk - labeled as ‘Local’ in the figure) which represent ideal baselines, and eliminate all MosaStore overheads. Figure 37 presents the I/O throughput when the storage node is backed by spinning disk (left plot) and RAMdisk (right plot). For each plot there are two sets of columns presenting the write and, respectively, the read throughput. The results lead to following observations. When data chunks are stored on spinning-disk, locality does not have a pronounced impact on the read throughput; the reason is that in this case the disk itself is the bottleneck (Figure 37). Locality, however, provides significant performance gains for writes, even when data chunks are stored on disk. This is because the writes often hits the file system cache hence the network becomes the bottleneck. When the storage node is backed by a RAMdisk, the network become bottleneck in both cases and both local read and writes are much faster than remote read and remote write. Further, in most cases, accessing local data through domain sockets offers a performance advantage. Compared to accessing local data through the regular TCP sockets, domain 82  sockets offer 27% - 47% (on RAMdisk) and 6%-10% (on spinning-disk) higher throughput in the four configurations studied.  Figure 37. I/O throughput when the storage node is backed by spinning disk (left plot) and RAMdisk (right plot). For each plot there are two sets of columns presenting the write and, respectively, the read performance. Note that the axes use different scales in the two plots. Figures represent average throughput, and standard deviation in error bars, over 30 reads/writes. As expected, accessing data stored on a remote node leads to a throughput 52% to 84% lower (except in the read from spinning disk case mentioned above). The performance penalty is magnified when the storage nodes are backed by RAMdisk instead of spinningdisks. Finally, this experiment allows having a first estimate of the overheads added by MosaStore when compared to a local storage system.  While these overheads appear  significant, note that the comparison is not entirely fair: the results compare a distributed filesystem (deployed such that some components, the manager in this case, are indeed remote, 83  which increase the latency and decreases the throughput of IO operations) with a local filesystem. As expected when the storage node is backed up by the much faster RAMdisks the throughput loss is much more pronounced than when the storage node is backed up by spinning disk (up to 5.2x throughput loss for RAMdisk vs. up to 1.06x throughput loss for spinning disk). 3.4.2.2 Synthetic Benchmarks The evaluation evaluates the studied approach using a set of application-level synthetic benchmarks. The benchmarks were designed to mimic the data access pattern of the scientific workflows. These benchmarks evaluate the impact of pattern specific storage optimizations. The experiments evaluate the synthetic benchmarks on storage nodes supported by either spinning-disk or RAMdisks. 3.4.2.2.1 Experiment Setup Current workflow processing often works as follows: workflow applications stage-in the input data from a backend storage system to an intermediate shared storage space, process the data in this shared space, and then stage-out the results, persisting them again on the back-end storage system. The intermediate shared storage is faster than back-end storage and provides a high performance scratch space to the application. The experiment setup is similar to this scenario. Throughout the evaluation, it compares the performance of the following intermediate shared storage alternatives: a workflowoptimized storage system (i.e., the data access pattern optimized MosaStore); a generic distributed storage system (an un-optimized MosaStore deployment); and an NFS server representing a back-end storage system that often is found in large scale computing machines. Note that an un-optimized MosaStore storage system is similar in architecture and design to a set of cluster storage systems such as Lustre. Further, although NFS is not typically used in large scale platforms, at this experiment scale with the setup of 20 machines, it fairly approximates a well-provisioned shared back-end storage system The evaluation uses a cluster of 20 machines. Each machine has Intel Xeon E5345 4core, 2.33-GHz CPU, 4-GB RAM, 1-Gbps NIC, and a 300-GB 7200-rpm SATA disks. The system has an additional NFS server that runs on a well provisioned machine with an Intel Xeon E5345 8-core, 2.33-GHz CPU, 8-GB RAM, 1-Gbps NIC, and a 6 SATA disks in a RAID 5 configuration. 84  The cluster is used to run one of the shared storage systems (MosaStore with either the default code or with the changes made to mimic a workflow-optimized storage system) and the synthetic applications. One node runs the MosaStore manager and 19 run the storage nodes, the MosaStore SAI, and the application itself. With the NFS configuration the NFS runs on the above mentioned server and the application on the other 19 nodes. The sets of synthetic application benchmarks fit the standard workflow application model (stage-in, workflow execution and stage-out) and are composed of read/write operations that mimic the file access patterns described earlier. The benchmarks are purely I/O bound and provide an upper bound on the achievable performance for each pattern. This opportunity study looked at several real world workflow applications [43, 92, 140, 155, 157]and selected four workload types with different file sizes. Figure 38 summarizes these application benchmarks.  100KB  100KB  100KB  100KB  200KB  200KB  Reduce 10KB  100KB  Scatter 10KB  10KB  100KB  ...  100KB 10 KB  10  KB  10KB 10 KB  1KB  1KB  1KB  1KB  1KB  1KB  1KB  100KB  Stage-Out  1KB  100KB  1KB  B 0K  100K  ...  10  B  200KB  10KB  10KB  10KB  100KB  ... 100KB  1KB  1KB  1KB  Backend storage  100KB  KB  100KB  10  KB  ...  ... 100KB  Intermediate storage  10  B  200KB  0K  200KB  10  Workflow  Broadcast  Backend storage  Stage-In  Pipeline  Figure 38.Summary of synthetic benchmarks for pipeline, broadcast, reduce, and scatter patterns. Nodes represent workflow stages (or stage-in/out operations) and arrows represent data transfers through files. Labels on the arrows represent file sizes for the ‘small’ workload. The other workload sizes are presented in Table 6.  85  Table 6. File sizes for different workflow patterns Workloads (file size for input, intermediate & output) Dataflow patterns Small Medium Large Pipeline 100KB, 200KB, 10KB 100 MB, 200 MB, 1MB 1GB, 2GB, 10MB Broadcast 100KB, 100KB, 1KB 100 MB, 100MB, 1MB 1 GB, 1GB, 10 MB Reduce 10KB, 10KB, 200KB 10MB,10MB, 200 MB 100MB, 100MB, 2 GB Scatter 100KB, 190KB, 1KB 100 MB, 190MB, 1MB 1 GB, 1900MB, 10 MB The rest of this section presents, for each synthetic benchmark: pipeline, broadcast, reduce, and scatter, the detailed experiments executed, the MosaStore customizations that support them, and the performance evaluation results. 3.4.2.2.2 Pipeline Pattern Evaluation Customization. To efficiently support the pipeline pattern, the workflow-optimized storage system changes the MosaStore data placement mechanism to place newly created files on the node that produces them. This change supports fast access to the temporary files used in the pipeline pattern as the next stage of the pipeline is launched on the same node. The workload (Figure 38 - left). The experiment runs in parallel 19 application pipelines similar to the ones described in the Figure 38. Each of these pipelines stages-in a common input file from the shared back-end storage (i.e., the NFS server), goes through three processing stages, that read input from the intermediate store and write the output to the intermediate store, then the final output is staged out back to back-end (i.e., NFS). The cluster is used to run the MosaStore storage system and the synthetic application. One node runs the MosaStore manager and 19 run the storage nodes, the MosaStore SAI, and application scripts.  86  Figure 39. Pipeline pattern – small files. Average execution time (in seconds) for small file sizes. Error bars represent standard deviation for all stages of the workflow (the entire experiment time).  Figure 40. Pipeline pattern – medium files. Average execution time (in seconds) for medium-size file. Error bars represent standard deviations for the entire experiment.  Figure 41. Pipeline pattern large files. Average execution time (in seconds) for large file sizes.  87  Evaluation Results. Figure 39, Figure 40 and Figure 41 present the performance of the systems for small, medium, and large workloads. The figures present distinctly (as stacked bars) the performance for data staging (stage-in time plus stage-out time) and the performance for the pipeline stages that touch the intermediate storage system. The experiment uses with four possible intermediate storage configurations: (1) a local file system (labeled ‘local’ in the plots) which represents the best possible performance and is presented as a baseline for comparison; (2) NFS itself used as intermediate storage(labeled ‘NFS’ in the plots); (3) MosaStore applying standard configuration and optimization techniques (labeled ‘MS RAM’ or ‘MS DISK’ depending on whether the storage nodes are backed by RAMdisk or spinning disk); and (4) a MosaStore with modifications to become workflow aware (labeled ‘WFRAM’ or ‘WFDISK’). Note that the ‘large’ workload for three configurations could not be executed: The NFS crashes (or takes unreasonably long time) under this workload and there isn’t enough space to execute this workload with RAM based storage nodes. For all scenarios, the workflow-optimized system performs faster than NFS and MosaStore un-optimized, and is close to the performance of the local file system. The larger the file sizes, the larger the difference between the workflow-optimized setup and the other two alternatives of shared intermediate storage. For medium files, the workflow aware storage is 10x faster than NFS, and almost 2x faster than vanilla MosaStore. For large files (1GB), this difference is even larger, NFS is unable to properly handle the demand generated and we stopped the experiments after 200 minutes. Further with large files, the RAMdisk experiments could not run due to the memory limitation in the cluster. The local configuration presents the optimal data placement decision for the pipeline pattern, serving as a baseline. In both experiments the workflow aware storage (‘WFRAM’ and ‘WFDISK’) lags behind the local storage due to added overhead of metadata operations and additional context switches and memory copies introduced by fuse user-level file system. 3.4.2.2.3 Broadcast Pattern Evaluation Customization. To efficiently support the broadcast pattern for the workflow-optimized system, the study adds eager replication to the MosaStore base system (the system originally supported lazy replication only). With eager replication replicas are created in parallel, while a file is written to the storage system (if replication is needed for that file). A broadcasted file 88  will be eagerly replicated by the storage system thus reducing the likelihood of a bottleneck when a file is consumed by multiple concurrent workflow stages. The workload (Figure 38 center left).An input file is staged-in to the intermediate storage from the back-end storage (i.e., the NFS). Then the first stage of the benchmark reads the input file and produces a broadcast-file on the intermediate storage. In the second stage, the broadcast-file is read by 19 processes running in parallel on 19 different machines. Each of these processes writes its output independently on the intermediate storage. As a last stage, the output files are staged-out to the back-end storage in parallel. Evaluation Results. Figure 42 and Figure 43 present the performance for this benchmark for ‘medium’ and ‘large’ workloads, while varying the number of replicas created. WF performs better than MStore (i.e., no replication), reaching the best performance for 8 replicas for medium files and 4 replicas for large files. This result matches the expectation of the potential benefits of WOSS approach. For more replicas than this optimal number, the overhead of replication is higher than the gains of adding more data access points. A similar pattern can be observed for small files; in this case, replication does not pay off at all. To better understand the trade-off between adding more access points and creating extra replicas, Figure 44 shows the breakdown of the benchmark phases. As the number of replicas increases, the time to process the data (the ‘workflow’ line) decreases and the time to create the replicas increases.  Figure 42. Average execution time for broadcast synthetic benchmark with medium workload. All storage systems are deployed on spinning disks.  89  Figure 43. Average execution time for broadcast synthetic benchmark with large workload. All storage systems are deployed on spinning disks.  Figure 44. Breakdown of broadcast benchmark for the ‘medium’ workload. It shows the time to create requested replicas and the actual workflow time separately. 3.4.2.2.4 Reduce Pattern Evaluation Customization. To efficiently support the reduce pattern, for workflow awareness the MosaStore data placement was changed such that all output files of one stage are co-located on a pre-specified storage node. The synthetic application using the reduce pattern runs the reduce application on the nodes storing all the files increasing file access locality. The workload (Figure 38- center right).During the stage-in phase 19 input files are stagedinto the intermediate storage from the back-end storage. In the first stage of the benchmark 19 executables, running in parallel on 19 different machines, each reads an input files and produce an intermediate file. In the next stage a single executable reads the intermediate files and produces the reduce-file (final output). The reduce-file is staged-out to the back-end store (the NFS).  90  A. Small workload  B. Medium workload  C. Large workload Figure 45: Reduce pattern. Average benchmark execution time (in seconds). Evaluation Results. Figure 45 shows the benchmark runtime for all three workloads and the five different configurations of the intermediate storage system (intermediate storage on NFS, MosaStore and the workflow aware system and, for the last two options, with storage nodes using RAMdisk and spinning disk) With spinning disk configuration, for medium and large files, workflow-optimized is between 3.9x (with large files) to 3.4x (with medium files) faster than NFS and 1.2x (for large files) to 2.25x (for medium files) faster than MosaStore default configuration. NFS performs relatively similarly to the other options for small files: this happens because the advantages offered by the faster intermediate system are cancelled by its additional overheads that start to dominate for small files. For the large workload, workflow time on WF DISK is longer compared to MS DISK. This happens because during the reduction phase the data is on the spinning-disk and the disk throughput becomes a bottleneck given the concurrency of several clients in parallel to write 91  the data. However, WF RAM has significantly shorter workflow time since the entire data is on RAMdisk without the throughput bottleneck of spinning disks. With WF RAM configuration, workflow aware storage system achieves the highest performance with medium and large files workload, up to 2.6x times faster than MS_RAM and up to 1.9x faster than WF DISK. This is mainly due to a significant reduction in the workflow execution time. This reduction in workflow execution time is due to the optimized data placement in workflow aware storage that increases the data locality. 3.4.2.2.5 Scatter Pattern Evaluation Customization. To efficiently support the scatter pattern, the study applies two modifications to MosaStore: first, enable configuring the storage system’s chunk size in order to match the application-level scatter ‘region’ size (i.e., the region of the file that will be read by a single application), and second, modify the MosaStore data placement to collocate all the chunks belonging to a particular file region on the same node. The workload (Figure 38 - right).Initially an input file is staged-in to the intermediate storage from the back-end storage (i.e., the NFS). The first stage of the workflow reads the input file and produces a scatter-file on intermediate storage. In the second stage, 19 processes run in parallel on different machines. Each process reads a disjoint region of the scatter-file and produces an output file. Finally, at the stage-out phase, the 19 output files are copied to the back-end storage. Evaluation Results. In experiments with MosaStore and workflow-optimized storage, the scatter benchmark spends equal amount of time in staging in the input file (12 seconds on average for the large workload) and creating the scatter file (27.5 seconds on average for the medium workload). The stage in time and file creation time are significant, amounting to 7090% of the benchmark time. Staging time and scatter file creation time on NFS was significantly slower than the other systems. For clarity of the presentation Figure 47 presents only the runtime of the scatter stage (stage-2) for the large workload. Further, for small and medium workloads, the evaluation results were inconclusive due to high variance; hence are not presented here. With spinning disk configuration workflow-optimized storage is around 8.1x faster than NFS and 1.5x faster than MosaStore default configuration. With RAMDisk configuration, workflow aware storage system achieves the highest performance with medium and large 92  files workload, up to 10.4x times faster than NFS and 2x faster than MosaStore default configuration. This is mainly due to the application customized data placement in the workflow aware storage system that significantly increases data access locality.  Figure 46: Scatter pattern large files. Average execution time (in seconds) and standard deviation for the scatter stage of the benchmark (large file sizes) 3.4.3 Summary This study explores the viability of a workflow-optimized storage system. To this end, it discusses the feasibility of building a workflow-optimized storage system that can provide per-file optimizations, and evaluates experimentally the performance gains such system can bring. The evaluation shows that a workflow-optimized storage system can bring for some data access patterns significant performance gains: up to 3x performance gain compared to a distributed storage system and up to 10x compared to a well provisioned NFS server. For other data access patterns the performance gains are more moderate yet still significant.  3.5 The Second Preliminary Evaluation Study: A Checkpoint Storage System for Desktop Grid Computing This study studies the feasibility of building a specialized storage system while preserving the POSIX API, and evaluates the potential benefits storage system specialization brings. The study is conducted in context of storage system supporting checkpointing workloads. To this end, I built stdchk [30], a storage system specialized for checkpointing workloads, and evaluate the potential performance and scalability gains of the cross layer optimizations design approach. Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This project argues that a checkpoint storage system, optimized to operate in these environments, can offer multiple 93  benefits: reduce the load on a traditional file system, offer high-performance through specialization, and, finally, optimize data management by taking into account checkpoint application semantics. Such a storage system can present a unifying abstraction to checkpoint operations, while hiding the fact that there are no dedicated resources to store the checkpoint data. stdchk is a checkpoint storage system that uses scavenged disk space from participating desktops to build a low-cost storage system, offering a traditional file system interface for easy integration with applications. This section presents the stdchk architecture (section 3.5.3), key performance optimizations, and its support for incremental checkpointing and increased data availability. The evaluation (section 3.5.4) confirms that the stdchk approach is viable in a desktop grid setting and offers a low-cost storage system with desirable performance characteristics: high write throughput as well as reduced storage space and network effort to save checkpoint images. 3.5.1 Context Checkpointing is an indispensable fault tolerance technique adopted by long-running applications. These applications periodically write large volumes of snapshot data to persistent storage in an attempt to capture their current state. In the event of a failure, applications recover by rolling-back their execution state to a previously saved checkpoint. The checkpoint operation and the associated data have unique characteristics. First, applications have distinct phases where they compute and checkpoint; often, these phases occur at regular intervals. Second, checkpointing is a write I/O intensive operation. Consider a job running on thousands of compute nodes: this scenario has the potential to generate thousands of files, amounting to terabytes of snapshot data at each timestep. Under these conditions, high-resolution checkpointing can easily overwhelm the I/O system. Third, checkpoint data is often written once and read only in case of failure. This suggests that checkpoint images are seldom accessed beyond the lifetime of an application run or even during the run. Finally, checkpointing, however critical it may be for reliability, is pure overhead from an application standpoint, as time is spent away from useful computation. To minimize this overhead, expensive high-throughput storage devices (SAN, parallel file system) are often used.  94  This project argues that the above characteristics of the checkpoint operation can be exploited to design a high-performance, yet low-cost, storage system, optimized to serve checkpointing needs. As a proof of concept this project focuses on a desktop grid scenario but the proposed architecture can be applied to any storage system integrating a large number of unreliable components (e.g., a cluster). Desktop grids are collections of loosely connected machines—within a single administrative domain—harnessed together to provide compute cycles for high throughput applications. Several academic and industry solutions support this scenario and current deployments aggregate thousands of nodes [100]. In a desktop grid, checkpointing stresses the storage system further as it is not only used to increase application reliability but also to support process migration when a user reclaims a machine. This section presents a checkpoint-optimized storage system that aggregates storage contributions from nodes participating in a desktop grid. A checkpoint-optimized storage system can bring significant benefits. First, such a system offloads the I/O intensive checkpoint operations from the traditional file system, thus alleviating the load on an expensive shared server. Second, this storage system can be optimized for the checkpoint operation. For example, it can reduce file system overhead associated with large writes as well as reduce data storage requirements. As a result, applications can checkpoint at a rate significantly higher than what is currently feasible with shared file systems. Third, checkpoint data is transient in nature and is often not maintained beyond the lifetime of a successful application run. Unlike a regular file system, a checkpoint storage system can be aware of this characteristic and act like a cache to purge or prune files using a combination of data usage, aging, and user specified policies. Finally, a checkpoint storage system needs to present only a unifying file system abstraction and can hide the fact that there are no dedicated resources to store the checkpoint data. Further, the storage system can even be built atop unreliable resources (storage donors), much like how a computational desktop grid itself is based on an unreliable substrate (cycle donors). The contributions of this work are as follows. The project presents stdchk, a checkpoint storage system for HPC applications in a desktop grid environment. Much like how stdin and stdout input/output systems are ubiquitously available to applications, I argue that checkpointing is an I/O intensive operation, requiring a special ‘data path’. This project  95  shows that this data path can be made available to HPC applications as a low-cost checkpoint-optimized storage system. While the proposed solution is based on MosaStore [13], stdchk is optimized for a different workload namely, high-speed writes of incremental versions of the same file. To this end, stdchk introduces several optimizations to render itself ‘checkpoint-friendly’ to HPC applications:   High sequential write throughput. stdchk exploits the I/O parallelism that exists inherently in a desktop grid to provide a suite of write-optimized protocols that enable checkpointing at throughputs higher than what is feasible in current desktop grid settings. The evaluation results indicate an observed application bandwidth of 110MB/s per node in a LAN connected desktop grid.    Support for incremental versioning. stdchk minimizes the size of the data stored using a novel solution to incremental checkpointing that exploits the commonality between successive checkpoint images. This work puts forth several heuristics that do not require application or operating system support to identify commonality between incremental versions of the same checkpoint image. Additionally, these heuristics are evaluated in the context of real applications. The results indicate a substantial reduction in checkpoint data size and generated network traffic. A desired side-effect of this approach is that it enables applications to checkpoint at a finer granularity.    Tunable data availability and durability. Since stdchk aggregates storage contributions from transient workstations, standard replication techniques are used to ensure data availability and durability. Further, applications can decide the level of data availability/durability they require.    Tunable write semantics. Additionally, stdchk gives applications the ability to choose between a write semantic that is pessimistic (the system call returns only after the desired level of replication is achieved) or optimistic (return immediately after data has been written safely once, while replication occurs in the background). This further gives applications control over the write throughput vs. data durability tradeoff.    Automatic pruning of checkpoint images. stdchk offers efficient space management and automatic pruning of checkpoint images. These data management strategies lay the foundation for efficient handling of transient data. 96    Easy integration with applications. stdchk provides a traditional file system API, using the FUSE (File system in user space) Linux kernel module [2], for easy integration with applications To summarize, the novelty of the proposed approach lies in recognizing that  checkpointing can benefit from specialized storage system support and in bringing to bear a combination of storage solutions to address this specific workload. stdchk builds on best practices from several domains: storage scavenging from peer-to-peer storage systems, striping from parallel I/O, incremental checkpointing from versioned storage and data archiving, replication/encoding from storage systems concerned with data durability and applies them to checkpointing. The rest of this chapter is organized as follows: presents the design considerations for a checkpointing oriented storage system, section 3.5.3 details stdchk design, while section 3.5.4 presents an extensive evaluation study. Section 3.5.5 surveys the related work, while section 3.5.6 concludes this study. 3.5.2 Design Considerations for a Checkpoint Storage System Applications running in a desktop grid have the following options for storing checkpoint images.   Node-local storage. It is common practice for jobs running on the individual nodes in a desktop grid to checkpoint to their respective node-local storage. Local storage is dedicated and is not subject to the vagaries of network file system I/O traffic. Moreover, local I/O on even moderately endowed desktops offers around 30-50MB/s. However, local storage is bound to the volatility of the desktop itself. First, individual nodes in a desktop grid are usually relinquished when their owner returns, leaving little time to migrate checkpoint data. Second, desktops are themselves not highly reliable and failure is common. Thus, the locally stored data is lost when the node crashes.    Shared file systems. Alternatively, nodes within a desktop grid can also checkpoint to a shared, central file server. However, shared file servers are crowded with I/O requests and often have limited space. Further, the hundreds of nodes in the desktop grid—on which processes of a parallel application run—can flood the central server with simultaneous checkpointing I/O operations.    Distributed checkpointing. A desirable alternative, adopted by this work, is the 97  construction of a distributed storage system optimized for a checkpointing workload. Such a targeted storage infrastructure can be built by aggregating storage contributions from the individual nodes within the desktop grid itself—much like how CPU cycles are aggregated to form the desktop grid. 3.5.2.1 Checkpoint I/O Workload Characteristics This section summarizes the characteristics of a typical checkpoint workload in a desktop grid.   Applications typically create one file per process per timestep. Thus, a large parallel application, running for a few hours and checkpointing every 15 minutes, can easily create tens of thousands of files.    Applications have distinct compute and checkpoint phases. Parallel applications on thousands of nodes can simultaneously access the storage system to save their images.    Checkpoint data is transient in nature. Successive checkpoint images are produced throughout an application’s lifetime. These images are accessed only in the case of a failure, process migration, or for debugging or speculative execution. Depending on the usage scenario, a checkpoint image may become obsolete at the end of a checkpoint interval when a new image is produced, after the successful execution of the application, or, in case of migration, at process restart time. Alternatively, a checkpoint image might be useful in the long term for debugging and speculative execution.    Low risk. Checkpoint image loss involves rolling-back the computation to the image corresponding to the previous timestep. While, in the common case, this may affect the job turnaround time, data loss effects are dependent on the specific application execution scenario. With these workload characteristics in mind, let’s look at some design goals for a  checkpoint storage system. 3.5.2.2 Design Goals This section describes the desirable properties of a checkpoint storage system.   Performance. Elmootazbellah et al. [67] identify the performance of the storage system as the key driver for checkpointing overheads. Consequently, the checkpoint storage should be optimized for write performance, while a reasonable read performance is necessary to support timely job restarts. 98    Easy-to-use interfaces. The storage system should provide an interface that enables easy integration with applications. Specialized libraries and interfaces, however optimized, cannot match the simplicity of file system interfaces.    Low overhead. Although file system interfaces are desirable their overhead should be minimal. For instance, the overhead involved in metadata management and synchronization, can all be minimized for checkpoint storage.    Support for incremental checkpoints and data sharing. To reduce the storage and I/O load, the storage system should be able to exploit data commonality between successive checkpoints.    Scalability. The storage system should scale to support a large number of simultaneous client requests. For instance, multiple nodes, on which a parallel application is running, will likely checkpoint all at once.    Flexible namespace. The storage system should provide a flexible naming scheme that enables easy identification of an entire set of checkpoint images as belonging to a particular application’s checkpoint operation. Additionally the namespace should support versioning.    Support for checkpoint image management. The storage system should include components to manage checkpoint image lifetime according to user specified policies: e.g., complete replacement of checkpoint images when a new image is produced, removal of all images at successful completion of application, or long-term storage of checkpoint images.  3.5.3 Stdchk: A Checkpoint-Friendly Storage System Storage space scavenging is a good base for building a low-cost storage system in desktop grids as parallelism in these environments is achieved by exploiting a deployment that is already in place rather than using expensive, dedicated devices. 3.5.3.1 System Architecture Overview. stdchk integrates two types of components: a metadata manager and a number of benefactor (or donor) nodes that contribute storage space to the system. Datasets are fragmented into smaller chunks that are striped across benefactor nodes for fast storage and retrieval. This basic model is common to other storage systems (e.g., GoogleFS, FreeLoader) as well. 99    The metadata manager maintains the entire system metadata (e.g., donor node status, file chunk distribution, and dataset attributes). Similar to a number of other storage systems stdchk adopts a centralized metadata manager implementation.    The benefactor nodes contribute storage space to the system. To facilitate integration, the design aims to decouple, to the extent possible, the metadata manager and the benefactor nodes, and in effect minimizes the set of functions benefactor nodes provide. Benefactor nodes interact with the manager to publish their status (on-/off-line) and free space using soft-state registration, serve client requests to store/retrieve data chunks, and run garbage collection algorithms. Data storage and retrieval operations are initiated by the client via the manager. To  retrieve a file, the client first contacts the metadata manager to obtain the chunk-map, (i.e., the location of all chunks corresponding to the file), then, the actual transfer of data chunks occurs directly between the storage nodes and the client. When a new file is written, stdchk cannot predict the file size in advance. Thus, storage space allocation is done incrementally. Clients eagerly reserve space with the manager for future writes. If this space is not used, it is asynchronously garbage collected. The manager also stores metadata regarding benefactor space contributions, file versioning and replication as described in this section. Reads and writes are performed in parallel to a stripe width of benefactors in chunks of fixed size, using a round-robin striping policy. The storage system is particularly geared for high-speed writes. To this end, stdchk offers a suite of write protocols and innovative storage system support for incremental checkpointing. In addition, stdchk offers tunable background replication and write semantics. The storage system is mounted under /dev/stdchk. Any file opened under this mounting directory is written to the aggregate storage system, thereby making stdchk easily available to client applications. The rest of this section describes stdchk’s main design choices. Session Semantics. A key decision shaping the design of a distributed storage system is the consistency model. Existing systems differ widely in terms of their write consistency semantics. Solutions range from unspecified consistency semantics (e.g., NFS [117]) to strong consistency, provided, for example through access serialization [88]. stdchk storage system provides session semantics [144]. Data commits are delegated to stdchk client proxies: when, the client application eventually performs a close() operation, the client proxy 100  will commit the chunk-map for the dataset to the manager. The fact that this operation is atomic ensures session consistency. Note that, strictly speaking, session semantic is not necessary for checkpointing operations as checkpoint images are immutable and have a single producer. However, introducing a clear and low-overhead consistency model gives a good path for future transitioning of stdchk towards a generic high-performance file system. Dealing with failures: Reliable writes. Since stdchk stores data in a distributed storage cloud, the failure of benefactor nodes needs to be addressed to ensure reliable operation. To this end, stdchk replicates data over multiple storage nodes. However, replication introduces a new question: should a write operation return immediately after the first replica of the data has been persistently stored or wait until all data reaches the desired replication level. The tradeoff is between data-loss risk and write throughput. A client can choose to be conservative (pessimistic) and wait until a desired level of replication is achieved before declaring a write operation successful. In this case, the client favors data durability over high write throughput. Alternatively, an optimistic client can return as soon as a chunk is written to the first benefactor and let the background replication process bring about the desired replication level. In this case, the client favors write throughput over data durability. The choice between optimistic and pessimistic writes is a configuration parameter. Data replication: User-defined replication targets. In the target environment, storage nodes are unreliable desktops. Any solution, addressing data availability, needs to factor the following: (1) facilitate fast writes so the application can quickly return to performing useful computation, (2) reliably store checkpoint data so that it is available if needed, and (3) provide good read performance to minimize restart delays. To this end, the evaluation evaluated both erasure coding and replication. Erasure coding incurs significant computational overhead compared to replication. The checkpointing application has to compute the erasure code while writing the data. Alternatively, if this computation is performed in the background, after the write, it leads to significant network traffic to pull the different chunks to a single node, perform the encoding and redistribute them. Further, data reads involve equivalent computational and network traffic overheads. Replication, on the other hand, incurs no computational overhead, but involves larger space overhead for the same degree of reliability. Replication can be implemented as a background task, thereby imposing minimally on the application. Further, replication is 101  easier to implement as it involves less complex data management. Finally, since checkpoint data is mostly transient in nature, the space overhead is transient. In some cases, the application might choose to keep the images for a prolonged duration, in which case, the data can be offloaded to more stable storage. For these reasons, stdchk uses replication to improve data durability. Replication is implemented as a background task initiated by the manager. The manager builds a shadowchunk-map for the checkpoint dataset that comprises of a list of benefactors to host the replicas of the original chunks. The process of building a shadow-map is similar to the process of selecting a set of benefactors for a new write operation. The shadow-map is then sent to the source benefactors to initiate a copy to the new set of benefactors. Once the copy succeeds the shadow-map is committed to the manager. Creation of new files has priority over replication to expedite that applications’ writes. Garbage collection. To decouple, to the extent possible, benefactors from metadata management, the deletion operation is performed only at the manager which results in orphaned chunks at benefactors. To reclaim space, benefactors periodically send a list of the set of chunks they store and the manager replies with the set of chunks that can be garbage collected. 3.5.3.2 Write Optimizations for High Throughput stdchk implementation optimizes large, sequential writes, the most frequent operation in a checkpointing storage system. Depending on whether local I/O, remote I/O or a combination of the two is used, and based on the overlap of these operations, the following designs are possible:   Complete local write. When enough local space is available, the write operation can temporarily dump the application data to the local node. Then, it can asynchronously push it out to stdchk when the application closes the file. The main advantage of this approach is its simple implementation. The drawback, however, is that it makes intense use of the local disk to store data that is eventually pushed out of the local node. Further, checkpointing to local storage does not protect against local node failures.    Incremental write. The node-local storage may not always have enough free space to temporarily store the entire checkpoint image. Even when space is available, it is not reliable. Incremental writes limit the size of the temporary file; when the temporary file 102  size reaches a predefined limit, writes are redirected to a new temporary file, and the previous one is pushed to stdchk. Once all incremental files have been pushed out and after the application has issued a close(), the chunk-map for the complete file is pushed to the metadata manager. While this solution still uses local I/O, it overlaps data creation and remote propagation, leading to faster overall transfer to remote storage.   Sliding window write. To minimize the application perceived write response time, stdchk exploits the fact that, for most modern desktops, disk write bandwidth is lower than the network throughput achievable with standard Gbps NICs. The sliding window technique pushes data out of a memory buffer directly to stdchk storage. This method completely eliminates the use of local disk, and enables complete utilization of the stdchk storage throughput.  3.5.3.3 Support for Incremental Checkpointing A checkpoint image typically involves a dump of the application’s memory, comprising of data structures and other state variables. Incremental versions of the same application image may produce (partially) similar files. This property can be used to improve the write throughput and/or reduce storage and network requirements, ultimately providing support for checkpointing at a higher frequency. The challenge, however, is to detect similarities at runtime without operating system or application support. To investigate whether similarity between checkpoint images can be exploited in real settings this work addresses the following three interrelated issues. First, it evaluates the potential gains from detecting similarity between successive checkpoint images. Second, it evaluates heuristics to understand the degree to which the task of detecting file similarity can be efficiently implemented by the storage system without application or operating system support. Third, it designs the architecture to efficiently support these heuristics. This section presents the similarity detection heuristics explored and the required architecture while Section 3.5.4 presents a detailed performance evaluation using real-world application data. Heuristics to detect similarities. The generic problem of identifying the maximum common substring between two strings has a computational overhead O(n.m), where n and m are the lengths of the two strings. This is unacceptable in the context of file systems. We, therefore, evaluate two heuristics that offer lower overheads.   Fixed-size Compare-by-Hash (FsCH). This approach divides a file into equal-sized 103  chunks, hashes them and uses the hashes to detect similar chunks. The main weakness of this approach is that it is not resilient to file insertions and deletions. An insertion of only one byte at the beginning of a file prevents this technique from detecting any similarity.   Content-based Compare-by-Hash (CbCH). Instead of dividing the file into equal-sized blocks, CbCH detects block boundaries based on content (as suggested by Brin et.al. [48] and used by LBFS [110] and JumboStore [68] storage systems). CbCH scans the file using a ‘window’ of m bytes and, for each position of the window, computes a hash of the corresponding string. A chunk boundary is declared if the lowest k bits of the hash are all zero. Then, identification of chunk similarity proceeds as above, based on chunk hashes. Statistically, k, the number of bits of the hash compared to zero allows controlling the average chunk size, while m, the window size, and p, the number of bytes the window is advanced every time, allow controlling the variation in chunk sizes and, additionally, influence the chunk size. Unlike FsCH, CbCH is resilient to data insertion/deletion, since inserting/deleting some bytes will only affect one block (two blocks if the changes are at a block boundary). The drawback is that CbCH requires hashing more data and, hence, results in larger computational overhead. Section 3.5.4.5 includes an extensive performance evaluation of these heuristics using  two real-world applications. The evaluation evaluated the rate of similarity detected and the computational overhead for application-/library-/VM-level checkpointing and different checkpoint intervals. The evaluation results suggest that FsCH is the best approach for stdchk due to the balance it offers between throughput and reduced space consumption as a result of similarity detection. Architectural support. To support these heuristics and to manage incremental checkpoint images efficiently, stdchk provides the following:   Content based addressability. stdchk provides content-based naming of data chunks, i.e., chunks names are based on a hash of their content. An additional advantage of using content-based naming is that it enables data integrity checks, a feature that can be used to prevent faulty or malicious storage nodes from tampering with the chunks they store.    Support for copy-on-write and versioning. Additionally, stdchk supports versioning and 104  copy-on-write, so that chunks that have been identified as similar can be shared between different file versions. When a new version of a checkpoint image is produced, only the new chunks need to be propagated to persistent storage. The new chunk-map will integrate the newly produced chunks and the chunks that have already been stored. 3.5.3.4 Support for Automated, Time-Sensitive Data Management The burden of managing large volumes of data (checkpoint or output data) HPC applications produce can become onerous. This project aims to add to the storage system, the intelligence to automatically manage files based on user-specified policies concerning their lifetimes. To this end, stdchk exploits the fact that checkpoint images are often used in a few standard scenarios. Most of the checkpoint data is time sensitive. For example, in a normal application scenario, checkpoint images are made obsolete by newer ones; while in a debugging scenario, all checkpoint images may need to be saved to enable debugging. stdchk supports this functionality through versioning, the use of a simple naming convention that helps recognize successive files from the same application, and the integration of user-specified metadata. By convention, files in stdchk are named as follows: A.Ni.Tj stands for an application A, running on node, Ni and checkpointing at timestep Tj. stdchk treats all images—from the many processes of application A running on nodes, N—as versions of the same file. Files from an application are organized within a folder for that application. The folder has special metadata concerning the time-related management of the files it contains. Currently stdchk supports the following scenarios:   No intervention. All versions (from multiple timesteps) are persistently stored indefinitely.    Automated replace. New checkpoint images make older ones obsolete.    Automated purge. Checkpoint images are automatically purged after a predefined time interval.  3.5.3.5 Providing a Traditional File System Interface The strong requirement for a file system-like API is motivated by two observations. First, a traditional API is crucial for adoption and increased usability of the storage system. Second, in the specific context of checkpointing systems, the libraries that support checkpointing are complex pieces of code that, in some situations, are executed in kernel mode. Modification or  105  even recompilation to integrate them with a custom storage system would be a high barrier to adoption and may be considered a security risk. A number of implementation alternatives are possible to provide a traditional file-system API. One approach is to build a Virtual File System (VFS) module [101]. The Linux kernel provides hooks for adding new file systems via loadable modules without requiring kernel recompilation. NFS [117], for example, is implemented using this approach. While this approach provides good performance, custom kernel modules are considered a security risk in some of the target deployment environments. Further, system administrators of the target platforms are reluctant to deploy non-standard kernel modules, as it increases the maintenance overhead, and often decrease the system reliability. stdchk adopts a user–space file system implementation for three additional reasons. First, to avoid the complexity and maintenance overhead of kernel-level development. Second, using a module that handles all the system call details allows the developer to focus on storage system logic rather on system calls and VFS details. Finally, the performance impact of the extra context switch can be mitigated by overlapping local I/O operations and the actual data transfer. Several projects have adopted this approach with reasonable performance results (e.g., Ceph [130]). This is also confirmed by the evaluation results. To implement a user-space file system, one option is to use an interception technique to catch file-system calls, filter those related to checkpointing, and redirect them to a user-level implementation. Parrot [148] and early versions of PVFS [49] adopted this approach. However, this approach results in high overhead to maintain the interception code along with the evolution of system-calls [49].  106  Figure 47. File system call path through FUSE. stdchk uses FUSE, a Linux kernel module, similar to the other VFS modules (e.g. NFS, ext3). Once a FUSE volume is mounted, all system calls targeting the mount point are forwarded to the FUSE kernel module, which preprocesses and forwards them to user-level file system callbacks (see Figure 47). When the callback function finishes processing the system call, FUSE post-processes the call and returns the results to VFS. FUSE is officially merged into the Linux kernel starting with 2.6.14 version, further simplifying adoption of stdchk user-space file system. stdchk user-space file system implementation maps the system calls to stdchk operations. Additionally, it handles granularity differences. For example, applications usually write in small blocks, while remote storage is more efficiently accessed in data chunks of the order of a megabyte. Further, the implementation is performance optimized for the target deployment scenario. It provides high-performance writes, improves read performance through readahead and high volume caching, and caches metadata information so that most system readdir and getattr system calls can be answered without contacting the manager. 3.5.4 Evaluation The evaluation uses a range of micro- and macro-benchmarks. Except where specifically mentioned, a testbed composed of 28 machines is used. Each machine has two 3.0GHz Xeon processors, 1 GB RAM, two 36.5GB SCSI disks, and Gbps Ethernet cards. For all configurations, the results report averages and standard deviations over 20 runs. 3.5.4.1 Platform Characterization The evaluation starts by first evaluating the performance and the overhead of each individual component. The sustained write throughput on a local disk with write caches enabled was  107  86.2MB/s, while accessing a dedicated NFS server deployed on the same node achieved 24.8MB/s. Micro-benchmarks are also used to estimate the overhead due to the additional context switch any user-level file system entails. Thus, to evaluate FUSE module overheads I have built two simple file systems. The first one (‘FUSE to local I/O’ in Table 7) simply redirects all write requests back to the local file system. The second (/stdchk/null) ignores the write operation and returns control immediately. Table 7, presents the time to write a 1 GB file to the local disk and to these two file systems. The results show that FUSE overhead is very low, about 2%, on top of local I/O operations. The /stdchk/null performance indirectly indicates that the cost of the additional context switch using FUSE is about 32μs. Table 7. Time to write a 1 GB file. Local I/O FUSE to local I/O Average Time (s) 11.80 12.00 Standard deviation 0.16 0.24  /stdchk/null 1.04 0.03  3.5.4.2 Write Throughput The write implementation decouples the application write I/O from the actual network file transfer to benefactor nodes. Therefore, the evaluation defines two performance metrics to compare the various alternatives for write-optimized operations described in Section 3.5.3.2). First, the observed application bandwidth (OAB) is the write bandwidth observed by the application: the file size divided by the time interval between the application-level open() and close() system calls. Second, the achieved storage bandwidth (ASB) uses the time interval between file open() and until the file is stored safely in stdchk storage (i.e., all remote I/O operations have completed). Figure 48, presents the OAB when the number of remote nodes to save data on (the stripe width) varies from one to eight benefactors. The evaluation experiments show that two contributing nodes with 1 Gbps NICs can saturate a client. However, when benefactors are connected by a lower link bandwidth (100Mbps), a larger stripe width is required to saturate a client (for detailed discussion of these results please refer to this technical report [29]). The complete-local-write OAB is similar to that of FUSE local writes. This is not surprising since all data is written to a local temporary file and then transferred to storage nodes after the file close operation.  108  Higher concurrency allows sliding window and incremental writes to perform better in terms of OAB (at around 110 MB/s). This high bandwidth translates to shorter time for checkpoint operation as observed by the application. Further, the sliding window interface completely avoids local IO and, hence, its performance is mainly influenced by the size of memory buffers allocated. Section 3.5.4.3 explores this effect. Write Throughput (MB/s) .  120 100 80  CLW SW Local I/O  60  IW FUSE NFS  40 20 0 1  2Stripe  Width 4  8  Figure 48. The average observed application bandwidth (OAB) for three write optimized techniques: complete local writes (CLW), incremental writes (IW), and sliding window (SW). For comparison the figure also shows: the throughput of writing to the Local-I/O, to local I/O through the FUSE module (FUSE), and to a dedicated NFS server (NFS) running on the same node.  Write Throughput (MB/s) .  90 80 70 60  CLW SW Local I/O  50 40  IW FUSE NFS  30 20 10 0 1  2  Stripe Width  4  8  Figure 49. The average ASB for complete local writes (CLW), incremental writes (IW), and sliding window (SW). Figure 49, presents the achieved storage bandwidth (ASB). Since complete-local-write serializes the local write and network transmission, it performs worst. Incremental and sliding-window write interfaces achieve better concurrency. The performance of the complete local write improves only slightly when adding more benefactors since the 109  bottleneck is the local I/O. Sliding-window performs best and, in these experiments, saturates the Gigabit network card with only two benefactors. Figure 48 and Figure 49 also show that sliding-window write performance is slightly better than either local I/O or network file system based checkpointing. Even though the local I/O in this case is comparable to the write interfaces, data stored on local nodes is volatile due to the transient nature of contributed storage. 3.5.4.3 Effect of Configuration Parameters This section investigates the impact of the size of the temporary file and memory buffers used on the write performance. Figure 50 and Figure 51 present the observed and achieved throughput for the sliding window write with different allocated memory buffers. As shown in the figures, the sliding window OAB increases with larger memory buffers as applications return quickly after writing. However, the memory buffer size variation does not impact ASB  Throughput (MB/s) .  significantly as the application waits to write to stdchk. 140 120 100 80 60 40 20 0  32MB 256MB 1  2  64MB 512MB 4  Stripe Width  128MB  8  Figure 50 The OAB for the sliding window write for different number of benefactors and allocated buffer size (in MB).  110  Throughput (MB/s) .  120 100 80 60 40  32MB 256  20  64MB 512  128  0 1  2  4  Stripe Width  8  Figure 51. The ASB for sliding window writes for different number of benefactors and allocated buffer sizes (in MB). With incremental write interface performance, smaller temporary files result in higher throughput (OAB and ASB) due to higher concurrency in the write operation. 3.5.4.4 Write Performance on a 10Gbps Testbed This experiment further tests stdchk’s sliding window write performance on a testbed endowed with higher IO, network and processing capabilities. The testbed is composed of a single stdchk client and four benefactor nodes. The client is a Xeon 2GHz processor, 8GB memory, SATA disk, and a 10Gbps NIC. The benefactors have Xeon 1.6 GHz processor, 8GB memory, SATA disks, and 1Gbps NIC. Figure 52 presents the observed and achieved throughput of the sliding window write interface. The buffer size is set to 512MB. With four benefactor nodes, stdchk is able to successfully aggregate the I/O bandwidth of the contributing benefactors and achieves up to 325 MB/s of OAB and 225 MB/s of ASB. (Note that the scale of this experiment is limited by the size of the testbed I had access to.) This experiment shows that stdchk can efficiently integrate multiple benefactors to provide a high write throughput as the number of benefactors nodes grows.  111  Throughput (MB/s)  400 350 300 250  OAB ASB  200 150 100 50 0 1  2  3  Stripe Width  4  Figure 52. The OAB and ASB of the sliding-window interface varying the stripe width for a testbed of 10Gbps client and manager and four 1Gbps benefactors. 3.5.4.5 Incremental Checkpointing: Feasibility and Performance This section presents evidence that supports the decision to include support for incremental checkpointing with stdchk. The experiments evaluated the potential gains from detecting similarity between successive checkpoint images and the performance of heuristics that operate at the file system level without application or operating system support to detect similarity. Additionally the evaluation evaluated the performance of the entire storage system. The two heuristics compared (described in section 3.5.3.3), fixed-size compare-by-hash (FsCH) and content-based compare-by-hash (CbCH), differ in their efficiency of detecting similarities and in the imposed computational overhead. To quantitatively evaluate these heuristics along these two axes and to ground the comparison in the real-world, checkpoint-images from two popular scientific applications are used: a protein-substrate complex biomolecular simulation (which is called BMS [19] for brevity), and BLAST [34], a bioinformatics protein/nucleic acid sequence searching tool. BMS uses application-level checkpointing and BLAST experiments use library (using BLCR [86]) and virtual machine-based checkpointing (using Xen [40]). Table 8 presents the trace details.  112  Table 8. Characteristics of the collected checkpoints. App BMS BLAST BLAST BLAST BLAST  Checkpoint type Application Library (BLCR) Library (BLCR) VM (Xen) VM (Xen)  Interval (min) 1 5 15 5 15  # of checkpoints 100 902 654 100 300  Average size (MB) 2.7 279.6 308.1 1024.8 1024.8  Table 9 presents the average ratio of the detected similarity and the achieved throughput (in MB/s) for the two techniques. For each technique, the table presents the performance for key parameterization points (the technical report presents a more detailed study [29]). The results show that, in general:   There is little similarity between checkpoint images collected using application-level techniques. This is due to the user-controlled, ideally-compressed format used to create these checkpoint images.    The level of similarity for library-level checkpointing techniques is extremely high. For example, BLAST, using library based checkpointing (BLCR), generates checkpoints with up to 84% average similarity between successive images. A surprising result is the near-zero similarity observed using virtual machine based  checkpointing. I have verified that this is due to the particular way in which Xen checkpoints. Xen optimizes for speed, and when creating checkpoints it saves memory pages in essentially random order. Further, to preserve the ability to recreate correct VM-images, Xen adds additional information to each saved memory page. In the future, I plan to explore solutions to create Xen checkpoint images that preserve the similarity between incremental checkpoint images.  113  Table 9. Comparison of similarity detection heuristics. The table presents the average rate of detected similarity and the throughput in MB/s (in brackets) for each heuristic. BMS BLAST Technique App BLCR Xen 1 min 5 min 15 min 5 or 15 min 1KB 0.0% (96) 25.0% (99) 9.0% (100) Low similarity FsCH 256KB 0.0% (102) 24.3% (110) 7.1% (112) for both FsCH 1MB 0.0% (108) 23.4% (109) 6.3% (113) and CbCH overlap 0.0% (1.5) 84.0% (1.1) 70.9% (1.1) techniques. CbCH no-overlap 0.0% (28.4) 82% (26.6) 70% (26.4) m=20B, k=14b From Table 9 the following are observed:   FsCH has higher throughput (over 100MB/s) but pays in terms of similarity detection (when compared with CbCH) between successive checkpoints (similarity of up to 25%).    With CbCH, aggressively configured to detect block boundaries, the similarity rate is extremely high. However, this significantly reduces the achievable throughput. When the window to detect block boundaries is advanced by one byte every time (labeled ‘overlap’ in Table 9), throughput degrades to as low as 1MB/s. Advancing the window with its size every time (labeled ‘no-overlap’ in Table 9), can improve throughput to about 26MB/s, which is still four times slower than FsCH. The CbCH results thus far present only an upper-bound for similarity detection but do  not explore the tradeoff between similarity detection, throughput, and block size. Table 10 explores this tradeoff and presents the effect of varying m (the window size) and k (the number of bits compared to zero to detect a block boundary) on the CbCH no-overlap performance. The experiment uses the BLAST/BLCR trace with 5-minute checkpoint intervals. In general, as the window size m increases, the ratio of detected similarity decreases, mainly due to the reduced opportunity to detect block boundaries, leading to larger blocks. On the other hand, one can control the block size by varying the number of zero bits required to detect a boundary: lower k leads to smaller blocks. However, increasing k increases the variation in the block size (the table presents averages for the minimum and maximum detected block for each checkpoint image).  114  Table 10. The effect of m and k on CbCH no-overlap performance. The table presents the ratio of detected similarity (in percentage), the heuristic’s throughput in MB/s, the average resulting checkpoint size in KB, and the average minimum and maximum chunk sizes (Values for m in bytes and for k in bits) k m 20 32 64 128 256 30.0 62.8 62.4 64.3 64.5 8 Similarity (%) Throughput (MB/s) 85.7 86.8 86.3 86.0 84.2 Avg. size (KB) 519.2 522.4 530.7 547.3 579.5 Avg. min size (KB) 325.1 275.6 210.1 350.2 257.1 Avg. max size (KB) 614.3 627.3 668.9 787.3 967.9 38.6 72.4 66.3 65.0 64.7 10 Similarity (%) Throughput (MB/s) 75.6 78.2 77.5 74.6 69.5 Avg. size (KB) 539.3 552.5 584.7 654.8 778.9 Avg. min size (KB) 265.9 283.9 294.7 409.2 380.8 Avg. max size (KB) 893.9 890.0 1095.0 1491.2 2251.7 77.3 73.4 65.6 63.0 60.7 12 Similarity (%) Throughput (MB/s) 47.0 53.6 50.2 52.3 53.6 Avg. size (KB) 626.3 665.4 812.5 1076.3 1544 Avg. min size (KB) 239.8 242.2 269.5 437.7 456.2 Avg. max size (KB) 1683.8 1807.8 2632.5 3812.7 4510.4 82.4 71.7 61.3 58.4 57.1 14 Similarity (%) Throughput (MB/s) 26.6 32.7 34.2 40.6 46.43 Avg. size (KB) 930.8 1079.2 1635.6 2267.3 2908.6 Avg. min size (KB) 514.9 232.0 449.5 528.8 506.8 Avg. max size (KB) 3710.9 3639.5 4515.1 4662.2 4646.6 Since the stdchk write throughput is the main success metric, I have chosen to implement FsCH in stdchk. FsCH offers a good rate of similarity detection with higher data throughput while also providing a simpler implementation path. StoreGPU project (Chapter 2) explores alternatives to provide a high-performance CbCH implementation by offloading the intensive hashing computations to the Graphical Processing Unit. The results above also shed light on one key element namely, the checkpoint interval. In both FsCH and CbCH (Table 9), the finer the checkpoint granularity, the higher the similarity between successive images. Consequently, there is a significant space and network effort saving. For instance, there is a 12% to 17% improvement in commonality detection between a 15 and 5 minute checkpoint interval. Further, with FsCH, the checkpoint throughput is over 100MB/s. This suggests that integrating versioning and similarity detection with stdchk enables checkpointing at finer granularity as the throughput is high and the amount of data that needs to be stored decreases due to higher similarity.  115  Figure 53 presents the sliding window write’s average observed (OAB) and achieved bandwidth (ASB) with and without FsCH. (labeled FsCH and no-FsCH respectively in the figure). The test involved writing 75 successive checkpoint images of BLAST using BLCR. The sliding window write is tested with different memory buffer sizes and with four benefactors. The checkpointing interval is 5 minutes, the chunk size is 1MB and the average checkpoint size is 280MB. The figure shows a slightly degraded write performance when similarity detection using FsCH is enabled. The main advantage, however, is the reduced storage space and network effort (by 24%). One exception is when the write interface is configured with large memory buffers (256MB). In this case, the overhead for detecting similarity leads to 25% lower observed bandwidth but similar achieved storage bandwidth. This is explained by the small checkpoint image size (280 MB on average), which allows storing nearly all the data in the write buffer and makes the observed application throughput solely dependent on memcopy performance (similarity detection is slowed down by the  350 300  Write Throughput (MB/s) .  Write Throughput (MB/s) .  hashing overhead). 120  no-FsCH FsCH  100  250 200 150 100 50  no-FsCH FsCH  80 60 40 20 0  0 64  128  256  File System Interface Write Buffer size (MB)  64  128  256  File System Interface Write Buffer size (MB)  Figure 53. The observed application bandwidth (OAB, left plot) and the achieved storage bandwidth (ASB, right plot) for the sliding window with and without incremental checkpointing supported by fixed block compare-by-hash. 3.5.4.6 Stdchk Scalability To assess the scalability and performance under heavy load the evaluation valuates the stdchk performance on two platforms: a commodity cluster, and on a Cray XT4 machine. Cluster Evaluation. The evaluation setup stdchk with 20 benefactors and 7 clients on the 28 node testbed. Each client wrote 100 files of 100MB each, amounting to around 70 GB of 116  data, and 2800 manager transactions (four for each write operation). To ramp-up the load, clients started at 10s intervals. Figure 54, presents the aggregate stdchk throughput. I observe a sustained peak throughput of about 280MB/s limited, in fact, by the network configuration of the testbed. This demonstrates that stdchk is able to scale to match an intense workload.  Figure 54. stdchk throughput at larger scale: 7 clients generate a synthetic workload to stress a stdchk pool supported by 20 benefactor nodes. Cray Machine Evaluation. stdchk was deployed by a research group at Oak Ridge National Lab on the ‘Smoky’ large scale cluster, a Cray XT4 machine with 80 machines (1280 compute cores), 20Gb/s interconnect, and 5 20Gb/s infiniband links to a Luster storage system [6]. stdchk was deployed with 35 donor nodes each donating 25GB of memory based storage (total of 875GBs) to support a synthetic MPI application running on up to 560 cores, each core checkpointing 1.5GB of data (total of up to 840GBs). Figure 55 presents the results comparing the clients’ average throughput when checkpointing to stdchk and to Luster. Two high level patterns can be observed. First, for large scale applications (employing more than 80 compute cores) stdchk achieves at least 3 times higher throughput than Luster. This is mainly because stdchk efficiently aggregates memory based storage and exploit the well provisioned cluster interconnect. Second, stdchk throughput does not degrade with increasing the number of clients; this is a testimony to stdchk’s ability to scale to serve intense workloads spanning hundreds of nodes.  117  Average Bandwidth (MB/s)  600  Lustre Average stdchk Average  500 400 300 200 100 0 16  32  80  160  240  320  400  480  560  Number of clients  Figure 55. stdchk throughput at larger scale: up to 560 clients, running a synthetic MPI application, checkpoint to a stdchk pool supported by 35 donor nodes. 3.5.4.7 Putting Everything Together To complement the synthetic performance benchmarks the evaluation evaluated the performance of stdchk when integrated with a real-world application (BLAST) and compared checkpointing to local disk and to stdchk. BLAST was configured to checkpoint every 30 seconds. The stdchk testbed used four benefactors with 1Gbps network connection. Table 11, presents the execution time as well as the total amount of checkpoint data generated for both runs. The improvement in overall execution time is minor (1.3%) as the overall ratio of application execution time to checkpointing time is high. However, more importantly, the results presented in Table 11, show that stdchk speeds up the checkpointing operation itself by 27% and leads to a 69% reduction in storage space and network effort. Table 11. The execution time and volume of generated data for BLAST application checkpointing to local disk and stdchk. Local disk stdchk Improvement Total execution time (s) 462,141 455,894 1.3% Checkpointing time (s) 22,733 16,497 27.0% Data size (TB) 3.55 1.14 69.0% 3.5.5 Stdchk Related Work To the best of my knowledge, there is no dedicated storage system for checkpointing either in a desktop grid or in a large-scale cluster environment. However, the following efforts are relevant to this work. Support for checkpointing. Applications can use one of the following checkpointing solutions, each offering different transparency vs. performance tradeoffs. Application-level checkpointing offers no transparency, yet performance is high as the application programmer 118  manages the process of collecting and saving the application state. Checkpointing libraries (e.g., BLCR [86], DejaVu [128]), often available on large-scale clusters, provide an increased level of transparency by checkpointing at the process level. Finally, system-level checkpointing offers complete transparency at the cost of ignoring application semantics, which can often be used to reduce storage requirements. None of these techniques entail storage system support and simply use the file system as is. Workload-optimized storage systems. Building storage systems geared for a particular class of I/O operations or for a specific access pattern is not uncommon. For example, the Google file system [79] optimizes for large datasets and append access; the Log-structured file system [126] optimizes for writes, arguing that most reads are served by ever increasing memory caches; BAD-FS [41] optimizes for batch job submission patterns; FreeLoader [152] optimizes for write-once read-many data access and exploits the locality of interest and sequential access patterns of scientific data analysis. Parallel file systems (Lustre [7], PVFS [49], GPFS [137]) also target large datasets and provide high I/O throughput for parallel applications. In a similar vein, a checkpoint optimized storage system can be geared towards write-intensive I/O operations. Relaxing POSIX semantics. Another related thread is the relaxation of POSIX semantics so that the file system can better cater to HPC I/O. In this vein, Lightweight File System [112] provides a small set of critical functionality that an I/O library can extend to build solutions for parallel applications. For stdchk, the design provides a POSIX interface for easy integration with applications, while at the same time building a high-performance storage system. Contributory storage. A number of storage systems [42, 47, 53, 152] aggregate space contributions from collaborating users to provide a shared data store. Their basic premise is the availability of a large amount of idle disk space on personal desktops that are online for the vast majority of the time. The specific technical solutions, however, vary widely as a result of different targeted deployment environments (local vs. wide-area networks), different workloads (e.g., unrestricted file-system workload vs. read-only workload vs. checkpointing workload), or different assumptions on user motivation to contribute storage (e.g., from systems that propose storage space bartering to motivate users to systems that assume collaborative users by default). 119  Versioned Storage. Several file systems save periodic snapshots of an entire file system to recover from accidental file deletions or changes. Examples include Plan 9 [119] or AFS [90] that use a single policy that guides file retention for the entire file system and the Elephant file system [133] that incorporates user-specified policies to determine which versions to retain. On the one side, the checkpoint scenario is more coarse-grained in that each checkpoint is written sequentially. The flip side is that copy-on-write techniques used by the aforementioned systems offer no gains when entire files are written sequentially. Low-bandwidth file systems. To reduce the amount of data sent to remote storage, LBFS [110] and JumboStore [68] detect similarity between file versions sent over the network by only transmitting the changed file ranges. The same approach was also used to detect copy right violations [48]. This is similar to utilities such as CVS and rsync that transmit deltas of files to bring server and user copies up to date. The evaluation results of this technique in the target setting show that its overhead is considerable. I have chosen to build stdchk by making use of two concepts from MosaStore design: storage aggregation using scavenging and striping. Based on these concepts, this project builds a sophisticated storage infrastructure geared towards distributed checkpointing. The design and implementation of the MosaStore storage system has been modified to incorporate new functionality conducive to checkpointing namely: file versioning, session semantics, and optimized write techniques that enable delegating to applications the control of the tradeoffs between data reliability and performance. 3.5.6 Summary This chapter presents the design and implementation of stdchk, a distributed checkpoint storage system targeting a desktop grid environment. This study presents empirical evaluation that supports the premise that checkpointing I/O is a write intensive operation, requiring novel storage solutions. stdchk aggregates storage space from LAN connected desktop machines to provide a traditional file system abstraction that facilitates easy integration with applications. stdchk offers several checkpoint-specific optimizations such as a suite of write protocols for high-speed I/O, support for data reliability, incremental checkpointing and lifetime management of checkpoint images. The prototype evaluation indicates that stdchk can offer an application perceived checkpoint I/O throughput as high as 110MB/s, which is significantly higher than what is feasible with current local I/O or 120  network file system based checkpointing. The proposed novel solution to exploit similarity between incremental checkpoint images results in significantly lower storage space and network effort requirements. In summary, the evaluation results indicate that a checkpoint storage based on space aggregation is viable for desktop grid applications. Further the results suggest that the proposed architecture is able to scale to support intense workloads spanning hundreds of clients. In summary, the evaluation results provide evidence that specializing storage system to checkpointing workloads can bring tangible performance gains and considerable storage space and network effort savings (25%-70% depending on the workload and checkpointing period) while still supporting POSIX API.  3.6 System Design This section discusses the workflow optimized storage system (WOSS) design requirements, presents the system design, the prototype implementation, and the WOSS integration effort with pyFlow and Swift – the workflow runtime engines. 3.6.1 Design Requirements To efficiently support the usage scenario targeted and the access patterns generated by workflow applications (section 3.3.1), WOSS needs to support the following requirements:   Extensibility. The storage system architecture should be modular and extensible. For instance, it should be easy for a developer to define a new data placement policy that associated with a new custom attribute.    Fine-grain (e.g., per-file or collection of files) runtime configurability: The storage system should provide per-file configuration at run time to support high-configurability for diverse applications access patterns. Further, the system should support defining a group of files and supporting per-group optimizations (e.g. collocation).    Deployable as an intermediate, temporary storage that aggregates (some of) the resources allocated to the batch application. This will not only avoid potential backend storage performance and scalability bottlenecks, but will also enable location-aware scheduling as computation can be collocated with data. The storage system should be easy to deploy during the application’s start-up. Further, ideally it should be transparently interposed between the application and the backend storage for automatic data pre-fetching or storing persistent data or results. 121    System-level configurability: The storage system should provide system-wide configuration knobs to support configurability for diverse applications. The system should be tunable for a specific application workload and deployment. This includes ability to control local resource usage, in addition to controlling application-level storage system semantics, such as data consistency and reliability.    POSIX compatible, to facilitate access to the intermediate storage space, without requiring modifications to applications.    Support for chunking. To support large files that do not fit in a single machine storage space, and to enable optimizations for scatter and gather patterns, the storage system should support dividing and storing a single file into multiple chunks.  3.6.2 Storage System Design The system prototype is based on a traditional object-based distributed storage system architecture, with three main components (Figure 56): a centralized metadata manager, the storage nodes, and the client’s system access interface (SAI) which provides the client-side POSIX file system interface. Each file is divided into fixed-size blocks that are stored on the storage nodes. The metadata manager maintains a block-map for each file. Table 12 lists the optimizations the current prototype implements and their associated hints/tags (POSIX extended attributes).  Figure 56.The main components of a distributed storage system (also used by WOSS). Design for extensibility. The distributed nature of the storage system makes providing hinttriggered optimizations challenging: while the hints (i.e., files’ extended attributes) are maintained by the manager, the functionality corresponding to the various optimizations can reside at the manager (e.g., for data placement), client SAI (e.g., for caching), or storage 122  nodes (e.g., for replication). Additionally this work aims for a flexible design that supports exploration and facilitates adding new custom metadata and their corresponding functionality. To this end, three main design decisions enable extensibility:   A generic hint propagation approach that extends every message/request with optimization hints (i.e., it enables tagging communication messages) to enable propagating hints between the components (manager, storage node, and SAI). These per-message hints enable end-to-end information passing and optimized handling of every message/request across components. In the design, file-related operations are always initiated by the client SAI (e.g. space allocation, data replication request, or checking file attributes,). The first time an application opens a file or gets the file attributes, the SAI queries the metadata manager and caches the file’s extended attributes (that carry the application hints). The SAI tags all subsequent internal inter-component communication related to that file (e.g., a space allocation, a request to store a data block) with the file’s extended attributes and the callbacks that may be deployed at each component are triggered by these tags to implement the hint-directed optimizations.    Extensible storage system components design. All storage system components follow a ‘dispatcher’ design pattern (Figure 57): all received requests are processed by the dispatcher and based on the requested operation and the associated hints (i.e., tags) the request/message maybe forwarded to the specific optimization module associated with the hint type, or processed using a default implementation.  To extend the system with a new optimization for a specific operation (e.g., space allocation, replication, read, write …etc), the developer needs to decide the application hint (key-value pair) that will trigger the optimization, and implement the callback function the dispatcher will call when an operation on file with the associated hint is issued. Every optimization module can access the storage component’s internal information including reading the manager metadata or system status (e.g. storage nodes status) through a well-defined API, or, accessing the blocks stored at the storage nodes.   Passing hints bottom-up: an extensible information retrieval design. To communicate a 123  storage hint to the application the metadata manager provides an extensible information retrieval module (Figure 57). This module is integrated with the dispatcher described in previous point as it is only triggered by the client POSIX ‘get extended attribute’ operation. Similar to other optimizations, to extend the system to expose specific internal state information the developer needs to decide the application hint/tag (keyvalue pair) that will trigger the optimization. The module, as all other optimization modules, has access to the manager metadata and system status information, and is able to extract and return to the client any internal information. 3.6.3 Prototype Implementation Details The prototype implementation is based on MosaStore (http://mosastore.net). The prototype changes MosaStore design and implementation to follow the extensible storage system design as described above. Similar to MosaStore, the WOSS SAI uses FUSE [2] kernel module to provide the POSIX file system interface.  Figure 57. WOSS metadata manager design. For clarity, the figure shows WOSS integration with a workflow runtime engine (Scheduler) and details WOSS metadata manager. The figure shows: (i) in solid lines, the path followed by a client chunk allocation request: the request is processed by a pattern-specific data placement ‘DP’ module based on the corresponding file tags/hints, (ii) the data path as data is produced by the clients (the solid lines going to storage nodes), and, (iii) the path of a request to retrieve file location information (dashed lines). The following is a highlight of a number of implementation details:   Replication operations are carried by the storage nodes. Their design adopts a similar dispatcher architecture to enable multiple replication policies. In the current 124  implementation the application can select the replication policy and the number of replicas. The current implementation implements two replication policies: eager parallel replication (to replicate hot spot files as used in the broadcast pattern) and lazy chained replication (to achieve data reliability without increasing system overhead).   Exposing data location: To expose files location the system defines a reserved extended attribute that has values for every file in the system (“location”). An application (in this case the workflow runtime) can ‘get’ the “location” extended attribute to obtain the set of storage nodes holding the file.  Prototype limitations. The prototype has three main limitations (All these limitations are the result of implementation decisions that enabled faster development and can be easily addressed with more resources). First, the prototype uses FUSE kernel module. While FUSE simplifies the SAI development it adds overhead to every file system call. Second, the data placement tags are only effective at file creation, changing the data placement tag for existing files will not change the file layout. Finally, the system prototype uses a centralized metadata manager. While this introduces a potential bottleneck at scale, my experience is that the bottleneck that limited the overall system performance lied with the workflow runtime engine.  125  Table 12. Implemented metadata attributes (hints) and the corresponding optimizations Patterns and associated metadata hints Pipeline pattern set (“DP”, “local”) Reduce pattern / collocation set(“DP”, “collocation |” <group_name>)  Description Indicates preference to allocate the file blocks on the local storage node Preference to allocate the blocks for all files within the same <group_name> on the storage node. Place every group of contigues Scatter pattern <scatterSize> chunks on a storage node in a set (“DP”, “scatter | <scatterSize>”) round robin fashion Broadcast pattern/replication Replicate the blocks of the file <repNum> set(“Replication”,<repNum>) times. Indicates which replication semantic to be used for the file: optimistic, return to Replication semantics application after creating the first replica, Set(“RepSmntc”, “Optimisitc/Pessimestic”) pessimistic, return to the application only after a chunk is well replicated. Location Retrieves the location information of the get (“location”, null) specific file. Manage per file cache size Suggests a cache size per file (e.g. small set(“CacheSize”,<size>) cache size for read once or small files) 3.6.4 Integration with a Workflow Runtime System To demonstrate the end-to-end benefits of the proposed approach, the project integrates the WOSS prototype with Swift, a popular language and workflow runtime system [154] and with pyFlow, a similar, yet much simpler, system developed at NetSysLabs, UBC. In particular The following two modifications are applied to each of these systems. (Note that the integration does not require any modification to the application tasks):   Adding location-aware scheduling. The current implementations did not provide location-aware scheduling. This required modifying the schedulers to first query the metadata manager for location information, then attempt to schedule the task on the node holding the file. I note that the scheduling heuristics used are relatively naïve, consequently further performance gains are expected with better heuristics; thus, in this respect, the experiments provide a lower bound on the achievable performance gains.    Passing hints to indicate the data access patterns. Information on the data access patterns is crucial to enable the ability of the storage system to optimize. The evaluation assume that the workflow runtime engine performs the task of determining the data 126  access pattern, as this seem the most direct approach to obtain this information: The reason is that the runtime engine has access to the workflow definition, maintains the data dependency graph, and uses them to schedule computations. Thus, it already has the information to infer the usage patterns; the lifetime of each file involved, and can make computation placement decisions as well. Changing the workflow runtime implementation, however, to automatically extract this information is a significant development task (and not directly connected with the thesis put forward here). Thus, the evaluation takes a simpler approach: inspect the workflow definitions for the applications uses in the evaluation and explicitly add the instructions to indicate the data access hints Implementation limitation: As one of the experiments highlights, the proposed approach for integrating location-aware scheduling with Swift adds a significant overhead. This, for some scenario at scale, eliminates the performance gains brought by implemented optimizations. The problem here is that, to limit the changes made in the Swift code (which itself is in a major transition between Swift/K and Swift/T), every set-tag or get-location operation is implemented as a Swift task which, in turn, calls the corresponding POSIX command.  3.7 Evaluation The evaluation uses a set of synthetic benchmarks and three real applications to evaluate the performance benefits of the proposed approach. To this end, the evaluation compares the proposed workflow-optimized storage system (labeled WOSS in the plots) with two baselines. First, MosaStore without any cross-layer optimizations, as an intermediary storage scenario (these experiments are labelled DSS – from distributed storage system to highlight that this is the performance expected from a traditional object-based distributed storage system design). Since this setup is similar in terms of architecture and design to other cluster storage systems, such as Luster and PVFS [49], this comparison gives a rough estimate of the potential performance gains enabled by WOSS compared to these more mature systems. Second, a typical backend persistent storage system deployment (e.g., GPFS or NFS) available on clusters and supercomputers as another baseline. The reason for this additional baseline is to estimate the gains brought by the intermediate storage scenario and, additionally, to show that DSS is configured for good performance. Finally, where possible, a third baseline is used to expose the optimal performance achievable on the hardware setup. 127  To demonstrate that the proposed approach and implementation are application-, workflow engine-, and platform-agnostic and bring performance improvements in multiple setups the synthetic benchmarks are implemented solely using shell scripts and ssh (secure shell) while the real applications use two workflow execution engines: pyFlow or Swift. These schedule the workflow tasks allocating the ready tasks to idle nodes according to the location information exposed by the storage system. Similarly, the shell scripts also query the storage system before launching a script on a specific machine. Finally, the evaluation uses multiple platforms including a 20 nodes cluster and a BG/P supercomputing machine The rest of this section presents: the testbeds, the benchmarks and real applications, the optimizations they use, and the experiment results. The results are based on at least 5 executions for each scenario. 3.7.1 Testbeds Most of the experiments use the NetSysLab cluster with 20 machines. Each machine has Intel Xeon E5345 4-core, 2.33-GHz CPU, 4-GB RAM, 1-Gbps NIC, and a RAID-1 on two 300-GB 7200-rpm SATA disks. The system has an additional NFS server as a backend storage solution that runs on a better provisioned machine with an Intel Xeon E5345 8-core, 2.33-GHz CPU, 8-GB RAM, 1-Gbps NIC, and a 6 SATA disks in a RAID 5 configuration. The NFS provides backend storage to the applications. When evaluating the DSS or WOSS systems, one node runs the metadata manager and the coordination scripts and the other 19 run the storage nodes (deployed over the local spinning disk or over RAM disk), the client SAI, and the application scripts. For one experiment the evaluation uses one rack of a lager IBM BlueGene/P machine (with 850 MHz quad-core processor and 2GB RAM per node). The machine uses GPFS [137] as a backend storage system with 24 I/O servers (each with 20Gbps network connectivity). The computing nodes have no hard disks and mount a RAM disk. Details of the architecture can be found in [147]. 3.7.2 Synthetic Benchmarks Synthetic benchmarks provide relatively simple scenarios that highlight the potential impact of cross-layer optimizations on an intermediate storage scenario for each of the patterns described (Figure 38 and repeated here Figure 58 for ease of reference). Staging-in/out: Current workflow systems generally use an intermediate storage 128  scenario: they stage-in input data from a backend store (e.g., GPFS) to the intermediate shared storage space, process the data in this shared space, and then stage-out the results to persist them on the backend system. Overall, WOSS and DSS perform faster than NFS for the staging time and this section, although does not target evaluating staging, reports stagein/out for the actual benchmark separately from the workflow time. Note that adding the staging to the benchmark is a conservative approach since these patterns often appear in the middle of a workflow application and, in those scenarios, staging would not affect them.  Figure 58. Pipeline, broadcast, reduce, and scatter synthentic benchmarks. Nodes represent workflow stages (or stage-in/out operations) and arrows represent data transfers through files. Labels on the arrows represent file sizes. Horizontal dashed lines represent the crossing boundary between backend and intermediate storage (e.g., stage-in reads a file from the backend and writes to the intermediate storage). Dataflow patterns Pipeline Broadcast Reduce Scatter  Table 13. File sizes for different workflow patterns. Workloads (file size for input, intermediate & output) Small Medium Large 100KB, 200KB, 10KB 100 MB, 200 MB, 1MB 1GB, 2GB, 10MB 100KB, 100KB, 1KB 100 MB, 100MB, 1MB 1 GB, 1GB, 10 MB 10KB, 10KB, 200KB 10MB,10MB, 200 MB 100MB, 100MB, 2 GB 100KB, 190KB, 1KB 100 MB, 190MB, 1MB 1 GB, 1900MB, 10 MB  The benchmarks are designed as follows: 129  Pipeline benchmark. Each pipeline stages-in a common input file from the shared backend storage (i.e., the NFS server), goes through three processing stages using the intermediate storage and, then, the final output is staged out back to backend. The script tags the output files produced by a pipeline stage with a ‘local’ tag. This tag informs the storage to prioritize storing the output files produced by a task on the node where the task producing the file runs. The storage exposes the location of the file so the benchmark script can launch the next stage of the pipeline on the machine that already stores the file. Reduce benchmark. 19 files are staged from NFS into the intermediate storage, 19 processes run in parallel, one per machine, each consuming one input file and producing one file that is tagged with ‘collocation’. These files are then consumed by a single workflow stage which writes a single file as output which is, then, staged out to NFS. The storage system prioritizes storing the tagged files on a single node, exposes data location, and the benchmark script uses this information to execute the reduce task on this node, avoiding the overhead of moving data. Broadcast benchmark. A single file is staged from NFS into the intermediate storage. Then, a workflow stage produces a file in intermediate storage, which is consumed by 19 processes running in parallel, one per machine. When this file is created, the storage system creates eagerly (i.e., while each block is written) the number of replicas as specified by the replication tag. When the nodes process the input file, they randomly select a replica to read from, (giving preference to local blocks if available) avoiding a scenario where a storage node becomes a bottleneck. Each task produces, independently, an output file, and finally, these output files are staged-out to the backend storage. Scatter benchmark. An input file is staged-in to the intermediate storage from NFS. The first stage of the workflow has one task that reads the input file and produces a scatter-file on intermediate storage. In the second stage, 19 processes run in parallel on different machines. Each process reads a disjoint region of the scatter-file and produces an output file. A tag specifies the block size to match the size of the application disjoint region (i.e., the region of the file that will be read by a process). Expose fine grained blocks location information, enables scheduling the processes on the nodes that hold the corresponding block. Finally, at the stage-out phase, the 19 output files are copied to the back-end storage. Statistical Analysis. The results report the average execution time and standard deviation (in 130  error bars) of at least 12 runs (a large enough sample to guarantee a 95% confidence level for the average). When reporting speedup of the workflow-aware solution over alternatives, the speedup is based on the average execution time and the p-value [28] for 95% confidence value is reported.  Figure 59. Average time for pipeline benchmark.  Figure 60. Average times for broadcast pattern.  Figure 61. Average runtime for reduce benchmark  Figure 62. Average runtime for stage 2 of scatter benchmark  Results. Figure 59, Figure 60, Figure 61, and Figure 62 present the average benchmark runtime and standard deviation (over 20 runs) for five different intermediate storage systems setups. (1) NFS; (2) two setups for DSS - labeled ‘DSS-RAM’ or ‘DSS-DISK’ depending on whether the storage nodes are backed by RAM-disk or spinning disks; and (3) two setups for WOSS, labeled ‘WOSS-RAM’ or ‘WOSS-DISK’). A sixth configuration is given for a local file system based on RAM-disk in the pipeline benchmark, representing the best possible performance. For broadcast, the evaluation presents only the disk-based configurations. Overall, WOSS exhibits the highest performance among the systems for all benchmarks, showing that the overhead brought by tagging and handling optimizations are paid-off by the performance improvements. Another advantage of WOSS is less variance in the results since it depends less on the network. RAM-disk-based configurations also perform faster and with less variability than its spinning disks counter-part, as expected. 131  Locality in the pipeline scenario was the optimization that provided the best improvements. In this case, WOSS is 10x faster than NFS, 2x faster than DSS, and similar to local (the best possible scenario). In fact, t-tests show that the performance of WOSS is to the local storage (less than 3% on average, p-value 0.035). Staging and file creation for scatter benchmark take a significant amount of time (7090%) of the benchmark time and, thus, for clarity of the presentation, the plot focuses only on the workflow stage that is affected by the optimization. Following the same trend of pipeline benchmark, scatter is 6x times faster than NFS. WOSS, however, does not bring significant gains over DSS, performing similarly (with t-test giving p-values greater than 0.05). For reduce benchmark, WOSS achieves 4.5x speedup compared to NFS and 1.5x compared to DSS. The broadcast benchmark presents a more interesting case: Tagging for replication provides a finer tuning (number of replicas) for optimization than the other techniques that rely just on turning on/off (e.g., locality, and collocation). The number of replicas enables trading off the replication overhead with higher file read throughput. Figure 60 presents the performance for this benchmark when reaching the best performance (8 replicas). This result matches the expectation of the potential benefits of WOSS approach. For more replicas than optimal, the overhead of replication is higher than the gains. A simple heuristic (e.g., creating one replica for every 2-3 readers) can guide this optimization and provide most of the performance gains. Table 14. Generated network trafic (GB) Benchmark  NFS  DSS  WOSS  Pipeline  18.60  17.60  1.86  Broadcast  2.87  2.03  1.91  Reduce  1.13  1.07  0.36  Scatter  0.60  0.57  0.27  Table 14 shows the total network traffic generated by each of the benchmarks. WOSS brings significant saving in network overhead. While WOSS almost eliminates the network overhead for the pipeline benchmark, it saves 87% and 60% of the net-work traffic generated by the reduce and scatter patterns respectively. Surprisingly, WOSS brings network savings (7% reduction) even in the broadcast pattern which creates 8 replicas of the intermediary file. 132  The replication extra network overhead is paid for by higher local access at the nodes storing the file. In addition to using the workload presented in Figure 58 we also executed the benchmarks with data sizes scaled up (10x) and scaled down (1000x). The larger workload had results similar to the ones presented in this section. The smaller one did not show significant difference among the storage systems (less than 10%, in order of milliseconds) with DSS performing faster than WOSS in some cases since the over-head of adding tags and handling optimizations did not pay off for such smaller files. 3.7.3 Simple Real Applications: BLAST, ModFTDock This section evaluates the reduction in runtime execution WOSS enables for two relatively simple real-word workflow applications:   modFTDock [8] a protein docking application. modFTDock workflow has three stages with different data flow patterns (Figure 63), namely: dock stage (broadcast pattern) verifies the similarity of molecules against a database, merge (reduce) summarizes the results for each molecule, and score (pipeline) produces a ranking for the molecules.    BLAST [34] a DNA search tool for finding similarities between DNA sequences. Each node receives a set of DNA sequences as input (a file for each node) and all nodes searches the same database file, i.e., BLAST has the broadcast pattern as the database has to be available on each application node for search. (Figure 66) Integration with workflow runtimes. The modFTDock evaluation uses Swift to drive the  workflow: Swift schedules each application stage, and tags the files according to the workflow pattern. As modFTDock combines the broadcast, reduce and pipeline pattern. Swift tags the database to be replicated (broadcast pattern), the output of every dock stages is collocated on a single storage node that will execute the merge stage (reduce). The merge output is tagged to be placed on local storage node in order to execute the score stage on the same machine (pipeline pattern). The labels on the arrows in Figure 63 indicate the hints used. The BLAST evaluation uses shell scripting: the script that launches the BLAST experiment tags the database file to a specific replication level and the input file with the DNA sequences as ‘local’. The labels on the arrows in Figure 65 indicate the hints used.  133  Figure 63. modFTDoc worflow. Labels on arrows represent the tags used  Figure 64. Average runtime for modFTDock on cluster. The experiment runs 9 pipelines in parralel using 18 nodes.  modFTDock experiments on cluster. 9 dock streams progress in parallel and process the input files (100-200KB) and a database (100-200KB). The storage nodes are mounted on RAM-disks. Figure 64 presents the total execution time for the entire workflow including stage-in and stage-out times for DSS and WOSS. WOSS optimizations enable a faster execution: modFTDock/Swift is 20% faster when running on WOSS than on DSS, and more than 2x faster than when running on NFS. modFTDock experiments on BG/P. This experiment ran modFTDock at larger scale on BG/P to verify scalability and explore whether the performance gains are preserved when compared to a much more powerful backend storage (GPFS) available on this platform. On the one side, one can notice a consistent 20-40% performance gain of DSS over GPFS. On the other side, the experiment were not able to show positive results for WOSS: application runtime is significantly longer than when using DSS. However, after careful examination, the performance loss is attribute to Swift runtime overheads rather than to WOSS overheads.  134  Figure 65. modFTDock runtime on BG/P while varying the number of nodes allocated to the application. The workload size increases proportionally with the resource pool.  Figure 66. BLAST workflow. The BLAST database (1.8GB) is used by all nodes that search in parallel different queries. Labels on arrow represent the tags used to hint the data usage patterns. Table 15. Average BLAST execution time (in seconds) for NFS, DSS and various replication levels controlled in WOSS. WOSS (replication factor) NFS DSS 2 4 8 16 Stage-in 49 17 19 29 36 55 90% workflow tasks 264 185 164 155 151 145 All tasks finished 269 207 173 165 162 164 Stage-out 1 1.2 1.3 1.2 1.1 1.1 Total 320 226 193 191 200 221 BLAST experiments on cluster. 19 processes launch 38 DNA queries in the database independently and write results to backend storage. The evaluation reports results from experiments that use a 1.7GB database. Output file sizes are 29 to 604KB. Table 14 presents the breakdown for the BLAST workflow runtime. The proposed approach does offer 135  performance gains compared to NFS (up to 40% better performance) as well as compared to DSS (up to 15% better performance) 3.7.4 A Complex Workflow: Montage The previous section demonstrated that the performance improvements highlighted by the synthetic benchmarks still hold under real simple workflow applications. This section evaluates the WOSS performance using a significantly more complex workflow application (Figure 67), Montage [97], with two goals in mind:  Evaluate the performance gains the cross layer optimizations approach bring to a real complex application.  Understand in detail the overhead/performance gain balance brought by WOSS techniques (i.e. tagging, getting location information, location-aware scheduling). (in the next subsection) Montage [97] is an astronomy application that builds image mosaics from a number of independent images (e.g., smaller, or on different wavelength) captured by telescopes. The Montage workflow is composed of 10 different processing stages with varying characteristics (Table 15). The workflow uses the reduce pattern in stages and the pipeline patterns in 4 stages (labeled in Figure 67).  136  Figure 67. Montage workflow. The characteristics of each stage are described in Table 15. Labels on arrow represent the tags used to hint the data usage patterns. The I/O communication intensity between workflow stages is highly variable (presented in Table 15 for the workload used). The workflow uses pyFlow framework. Overall the workflow generates over 650 files with sizes from 1KB to over 100MB and about 2GB of data are read/written from storage. Table 16. The characteristics of each stage for the Montage workflow Stage Data #files File size Optimization stageIn 109 MB 57 1.7 MB -2.1 MB mProject 438 MB 113 3.3 MB - 4.2 MB Yes mImgTbl 17 KB 1 mOverlaps 17 KB 1 mDiff 148 MB 285 100 KB - 3 MB Yes mFitPlane 576 KB 142 4.0 KB Yes mConcatFit 16 KB 1 mBgModel 2 KB 1 mBackground 438 MB 113 3.3 MB - 4.2 MB Yes mAdd 330 MB 2 165MB Yes mJPEG 4.7 MB 1 4.7 MB Yes stageOut 170 MB 2 170 MB Yes Figure 68 shows the total execution time of the Montage workflow in five configurations: over NFS, and with DSS and WOSS deployed over the spinning disks or RAM-disks of local 137  nodes. The WOSS system achieves the highest performance when deployed on disk or RAMdisk. When deployed on disk WOSS achieves 20% performance gain compared to NFS. Further WOSS achieves up to 10% performance gain compared to DSS when deployed on disk or RAM-disk.  Figure 68. Montage workflow execution time (averages over four experiments). Note that, to better highlight the differences, y-axis does not start at zero. 3.7.5 Exploring WOSS Overheads/Gains To enable cross layer optimization in WOSS a set operations are needed, including: forking a process to add a file extended attributes (labeled fork in Table 16), adding the extended attributes to files (‘tagging’), getting the location for each file (‘get location’), and performing location-aware scheduling (‘location-aware scheduling’). All these steps add additional overhead. To guide future optimizations the evaluation ran the same Montage workload as in the previous section yet configured to expose the overhead of each of these steps. For instance, to expose only the overhead of tagging, the experiment tags the files, in all the benchmarks in Table 16 except WOSS, with a random tag that will add the overhead without triggering any optimization. Table 16 shows the average execution time of the Montage workflow in these conditions. The results suggest that steps described above add significant overhead (up to 7%). A closer look reveals that the tagging operation is the main contributor to the overhead. The main reason behind this high overhead is twofold: first, every tagging operation incurs a roundtrip to the manager; second, the current manager implementation serializes all ‘setattribute’ calls, this adds significant delay considering that Montage workflow produces and tags over 660 files in every run. 138  The evaluation highlights that optimizing the ‘set-attribute’ operation (by caching and increasing the manager implementation parallelism) can bring significant additional (up to 7%) performance gains. Table 17. WOSS microbenchmark. Experiment setup  Total time (s) DSS 66.2 DSS + fork 67.1 DSS + fork + tagging 69.5 DSS + fork + tagging + get location 70 DSS + fork + tagging + get location + location-aware 70.7 scheduling (on useless tags) WOSS 61.9  3.8 Related Work A number of alternative approaches have been proposed to alleviate the storage bottleneck for workflow applications. Taken in isolation, these efforts do not fully address the problem this work faces as they are either too specific to a class of applications; or enable optimizations system-wide and throughout the application runtime, thus inefficiently supporting applications that have different usage patterns for different files. The proposed solution integrates lessons from this past work and demonstrates that it is feasible to provide runtime storage optimizations per data-item. Application-optimized storage systems. Building storage systems geared for a particular class of I/O operations or for a specific access pattern is not uncommon. For example, the Google file system [79] optimizes for large datasets and append access, HDFS [141] and GPFS-SNC [82] optimize for immutable data sets, location-aware scheduling, and rackaware fault tolerance; the log-structured file system [126] optimizes for write-intensive workloads, arguing that most reads are served by ever increasing memory caches; finally BAD-FS [41] optimizes for batch job submission patterns. These storage systems and the many others that take a similar approach are optimized for one specific access pattern and consequently are inefficient when different data objects have different patterns, like in the case of workflows. Highly Configurable Storage Systems. A few storage systems are designed to be highly configurable – and thus, through deployment- or run-time (re)configuration, efficiently serve a wider set of applications. In this vein, UrsaMinor [17] is a versatile storage system that 139  allows the administrator to changes its configuration at runtime to better match the application’s access patterns and reliability requirements. While these storage systems can be configured to better support a range of applications, they are not designed to support workflows with different access patterns for different files, do not present an extensible storage system architecture, nor propose an approach for optimizing the system based on application hints. Custom metadata in storage systems. A number of systems propose mechanisms to efficiently support custom metadata operations including Metafs [16], Haystack [18], The Linking File System (LiFS) [36] and faceted search [95]. These systems extend the traditional file system interface with a metadata interface that allows applications to create arbitrary metadata. These efforts provide applications the functionality of annotating files with arbitrary <key, value> pairs and/or to express relationships among files. Similarly, Graffiti [102] is a middleware that allows tagging and sharing of tags between desktop file systems. As other systems which aim to provide a metadata interface, it supports tags and links between files, but focuses on sharing-related issues. All these solutions essentially use metadata to communicate between applications. Their main focus is to providing better search, navigability and data organization capabilities at the application layer with data produced by the applications themselves. Dealing with a constraining storage system interface. Two solutions are generally adopted to pass hints from applications to the storage system: either giving up the POSIX interface for a custom interface, or, alternatively, extending the POSIX interface with an orthogonal additional ad-hoc interface for hint passing. Most storage systems that operate in the HPC space (pNFS, PVFS, GPFS, Lustre) fall in the latter category and add ad-hoc hint passing interfaces; while most Internet services/cloud storage systems fall (e.g., HDFS) fall in the former category. In terms of exposing data location the situation is similar: HDFS and other non-POSIX systems do expose data location to applications while most parallel largescale file systems (e.g., pNFS, PVFS, GPFS) do not expose it (even though this information may be available at the storage system client module) [146]. Further, note that these systems cannot support cross layer optimizations as their design does not support per-file optimizations, does not have mechanisms to enable/disable optimizations based on application triggers, nor allows extending the system with new optimizations. 140  Storage system optimizations using application provided hints. A number of projects propose exploiting application information to optimize the storage system operations. Mesnier et. al. [105] propose changes to the storage system block API to classify storage blocks into different classes (metadata, journal, small/large file), allowing the storage system to apply per class QoS polices. Collaborative caching [51] proposes changing the storage system API to facilitate passing hints from clients to server to inform the server cache mechanism. Finally, Patterson et. al. [116] propose an encoding approach to list the blocks an application access, and to use the IO control interface (ioctl) to pass this list to the storage system. This access list is used to optimize the storage system caching and prefetching. For example eHiTS [73] and GreenStor [103] storage systems propose building energy optimized distributed storage system that use application hints to facilitate turning off or idling storage devices holding data blocks that will not be accessed in the near future. BitDew [70], a programming framework for desktop grids, enables users to provide hints to the underlying data management system. The five supported hints are: replication level, resilience to faults, transfer protocol, data lifetime, and data affinity (used to group files together). In this same vein, UrsaMinor [17], an object-based storage system, allows the system admin or the application, through a special API, to configure the storage system operations for its data objects. For each object, the system facilitates configuring the reliability mechanism (replication or erasure coding), and fault and timing model. Through this specialization the system better meets application requirements in terms of throughput and reliability. Finally, the XAM [11] standard defines an extended API and access method for storage systems serving mostly immutable data (e.g. backup systems, email servers). It allows the programmer to better describe the data through extended metadata interface. Further it allows the programmer to inform the storage system of how long to retain the data through special metadata fields. Table 17 compares this work to the related production and research projects. These efforts differ from the proposed approach in three main ways: First, most of them target a specific optimization and they do not build an extensible storage system that can be extended with new optimizations. Second, they propose uni-directional hint passing from application 141  to storage Third, and most importantly, they either propose changes to the standard APIs to pass hints, or use current API (ioctl) in a non-portable, and non-standard way. This hinders the approaches adoptability and portability. Finally, no other project (Table 17) provides a bidirectional communication mechanism, proposes an extensible storage system design, while using a standard API. Note that all the optimizations proposed in the surveyed projects can use the proposed cross layer communication mechanism to pass hints, and can be implemented using the proposed system architecture. Table 18. Survey of related projects. The table compares WOSS with current approaches on number of axes Projects Domain Production / API Bidirectional Extensible specific / research general project HDFS [141] Domain Production New API N N PVFS [49], GPFS General Production POSIX N N [137] GreenStor [103], General Research New API N N TIP [125], eHiTS [73] Mesnier et. al. General Research Modify Disk N N [105] API Collaborative General Research Non-portable N N caching [51] POSIX XAM [11] General Specification New API N Y BitDew [70] Domain Research New API N N WOSS General Research POSIX Y Y compliant Applications optimizations using storage system hints. A number of projects advocated exposing the lower layer storage system characteristics to enable optimizations in the OS or upper layer applications. Denehy et. al. [66] propose exposing the parallelism and failureisolation boundaries, as well as the runtime performance information of the underling storage devices. Using this information the file system can optimize its operations to increase its performance and reliability by adopting storage-device performance-balanced and failureaware data placement mechanisms. Similarity, Forney et al. [72] propose building storageaware caching for file systems accessing a set of storage systems with heterogeneous performance characteristics (e.g., local disks to remote storage servers). The proposed caching mechanism partitions the file system cache into a partition per storage system, and  142  dynamically sizes the partitions, depending on the workload access pattern and storage system performance, to balance the work across the storage systems. Optimization opportunities exist also in large-scale high-performance systems. For instance exposing where replicas of input files are located (or intermediary files in workflow based applications) can enable savvy scheduling mechanisms that schedule the computation on the nodes holding the file [122]. In the current practice, this approach often relies on a standalone metadata service to track and provide file location in the system.  3.9 Discussion and Summary Cross-layer optimizations bypass a restricted, ‘hourglass’, interface between system layers. The key enabler is allowing information available at one layer to be visible at the other layer and drive optimizations. A classic example is the TCP/IP stack: in the original design, the transport layer makes the assumption that a lost packet is an indicator of congestion at the lower layers and backs-off. This assumption is violated in wireless environments and leads to degraded performance. To deal with this situation, a number of mechanisms have been designed to expose lower layers information (i.e., hints about channel capabilities/state [83, 84]) to the transport layer such that this can infer the cause of packet loss and react appropriately. Storage systems can be viewed through the same lens: the traditional (and after decades of use, convenient) POSIX file system API performs the role of the ‘hourglass’ neck: it enables transparent cooperation between applications and storage through a minimal interface. The POSIX interface, however, does not offer a mechanism to pass information between these layers. This work proposes using POSIX extended attributes in this role. I argue that this is a flexible, backward compatible, mechanism for communication between the storage and applications. Design guidelines. Two design lessons relevant to storage system design can be borrowed from the design of the network stack: First, both applications and the storage should consider metadata as hints rather than hard directives. That is, depending on specific implementation and available system resources directives expressed through custom metadata might or might not be followed at all layers of the system. Second, to foster adoption, adding support for cross layer optimizations should not (or minimally) impact the efficiency of applications or storage system not using them. For example, if the top layer (an 143  application) does not use the metadata offered by the lower layer, or decides not to pass hints, its performance should not be affected (otherwise these mechanisms are less likely to be adopted in practice as with some of the solutions that did not gain traction in the networking space [71]) This project puts forward two additional design guidelines: First, the cross-layer communication and the optimizations enabled should not break the separation of concerns between layers. A key reason for layered designs is reducing system complexity by separating concerns at different layers. Therefore, it is necessary to devise mechanisms that limit the interference one layer may cause on others even though, as I argue, there are benefits in allowing information cross between layers. Second, the distinction between mechanism and policy should be preserved. The custom metadata offer a mechanism to pass information across layers. The various policies associated with the metadata should be kept independent from the tagging mechanism itself. Cross-layer optimizations in storage systems. Apart from the use-case discussed in this chapter a number of other uses of cross-layer optimizations based on custom metadata are possible. They are briefly listed here (without any claim to be exhaustive): Cross-layer optimizations enabled by top-down (i.e., application to storage system) information passing. Applications may convey hints to the storage system about their future usage patterns, reliability requirement, desired QoS, or versioning. Future usage patterns: there is a wealth of cross-layer optimizations that fit in this category apart from the ones already explored in this work. These include application-informed data prefetching, and data layout. QoS requirements: Different data items can have different, application-driven QoS requirements (e.g., access performance, availability or durability, and, possibly, security and privacy). A storage system that is aware of these requirements can optimize resource provisioning to meet individual items’ QoS requirements rather than pessimistically provision for the most demanding QoS level. Versioning: Applications can use metadata to indicate a requirement to preserve past versions of a file.  Consistency  requirements:  applications can use metadata to inform the storage system that the application can tolerate lower consistency requirements (e.g., by stating continuous consistency [145] requirements at the file level). Allowing the application to loosen the default consistency requirements allows the application to manage the tradeoffs between performance and consistency. Energy 144  optimization: An energy optimized system may inform the storage system of the planned shutdown of subset of nodes in the system, the storage system can use this hint to inform the replication and consistency mechanism to avoid unnecessary replication and maintain consistency with replicas on shutdown nodes. Cross-layer optimizations enabled by bottom-up information passing. In addition to exposing location information to enable effective scheduling decisions. Additional storagelevel information (e.g., replication count, information about inconsistencies between replicas, properties and status of the storage device, caching status) could be useful when making application level-decisions as well (e.g., scheduling, data loss risk evaluation). For instance, exposing device specific performance characteristics can enable optimizing database operations [134], or optimizing the application I/O operations by matching the access pattern to the disk drive characteristics [135], or enable energy optimizations by exposing which nodes store less popular or well replicated blocks and can be shut down. Similar to Tantisiroj et al. [146] and unlike many specialized storage system that advocate abandoning POSIX to enable extra functionality or enable higher performance (e.g. HDFS), I argue that storage systems can be specialized for certain applications while supporting POSIX. For instance, HDFS provides special API for getting the file location for location-aware scheduling, setting the replication level, or extracting file system statistics. All these operations can be expressed using the proposed cross layer optimization approach to enable the same optimizations. Limitations. The proposed approach and design have two main limitations. First, the proposed per-file cross layer optimization approach assumes that data of each file is stored separately from the other files, this limits the use of this approach in systems in which a single data block can be part of multiple files (e.g. content addressable storage, or copy-onwrite storage system) as it is possible for separate files that share a block to have conflicting application hints.  Second, the presented design allows extending the system and add  optimization modules. This design decision is not accepted in secure storage system as it adds significant vulnerabilities. Summary. This project proposes using custom metadata as a bidirectional communication channel between applications and the storage system. I argue that this solution unlocks an incremental adoption path for cross layer optimizations in storage systems. This project 145  demonstrates this approach in context of workflow execution systems. The workflow optimized storage system (WOSS), exploits application hints to provide per-file optimized operations, further, the system exposes data location to enable location-aware scheduling. The evaluation shows that WOSS brings tangible performance gains.  146  Chapter 4 4 Summary and Impact Distributed storage system middleware acts as a bridge between the upper layer applications, and the lower layer storage resources available in the deployment platform. Storage systems are expected to efficiently support the applications’ workloads while reducing the cost of the storage platform. In this context, two factors increase the complexity of the design of storage systems: First, the applications’ workloads are diverse among number of axes: read/write access patterns, data compressibility, and security requirements to mention only a few. Second, storage system should provide high performance within a certain (dollar or power) budget. This dissertation addresses two interrelated issues in this design space. First, can the computational power of the commodity massively multicore devices be exploited to accelerate storage system operations without increasing the platform cost? Second, is it possible to build a storage system that can support a diverse set of applications yet can be optimized for each one of them? StoreGPU: Harnessing the Computational Power of Multicore Platforms. This work highlights that the significant drop in the cost of computation, brought by the technology advancement in computing devices, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project highlights the dramatic change, often overlooked by system designers, in the systems’ design space and demonstrates the significant benefits brought by exploiting GPU compute power. Further, this project demonstrates the significant performance improvement brought by offloading in a domain thought to be a bad fit for GPU computing, namely data intensive operations. This project demonstrates that, after applying a set of optimizations, GPU offloading is viable even for data intensive systems. This study demonstrates the feasibility of exploiting GPU computational power to support distributed storage systems, by accelerating popular compute-intensive primitives, and demonstrates the benefits of offloading storage system compute intensive operations to massively multicore processors. This project presents a storage system prototype capable of 147  harvesting GPU power, and evaluated it under two system configurations: as a content addressable storage system and as a traditional system that uses hashing to preserve data integrity. Further, the evaluation evaluated the impact of offloading to the GPU on competing applications’ performance, and the impact of competing applications on the storage system performance. The evaluation results show that this technique can bring tangible performance gains enabling up to 15x speed up in hashing computation compared to multithreaded implementation on a dual CPU configuration, enabling close to optimal system performance (up to 2.3x higher system throughput) under the set of workloads and configurations considered without negatively impacting the performance of concurrently running applications. Impact: In addition to its research contributions, this project had four main impacts in the research community:   First, this project highlights the significant change in the computing systems design spectrum caused by the recent advancement in computing devices. This project demonstrates that significant benefits can be reaped and compute intensive mechanisms become viable when exploiting the computational power of multicore devices. Based on this project findings, system designers have a wider design spectrum to consider in their designs.    Second, contrary to the belief that GPU acceleration is only viable for compute intensive applications, this project demonstrates the viability of employing GPUs to support data-intensive storage system services.    Third, this work pioneered the idea of adopting GPUs as storage system accelerators. This approach, has recently been adopted by others in the storage systems community, including a GPU accelerated encrypted storage system (PTask based system [127], GPUStore [143]), a deduplicated storage system (Shredder [44] P-Dedupe [156], and GHOST [94]), low cost RAID storage [93], and file matching service [106], erasure coding based reliable storage [149].    Finally, this work has resulted in two open source software artifacts, HashGPU programing library, and StoreGPU a GPU accelerated deduplication storage system. 148  Cross-Layer Optimizations in Storage Systems This project proposes using custom metadata as a bidirectional communication channel between applications and the storage system. I argue that this solution unlocks an incremental adoption path for cross layer optimizations in storage systems. In the last two decades a number of projects proposed application informed optimization in storage system. In effect, the wide range (and incompatibility!) of past such solutions proposed in the storage space (and incorporated to some degree by production systems pNFS, PVFS, GPFS, Lustre, and other research projects [33, 37, 38, 70, 135, 142]), only highlights that adopting an unifying abstraction is an area of high potential impact. The novelty of the proposed approach comes from the "elegant simplicity" of the solution proposed. Unlike past work, the proposed approach maintains the existing API (predominantly POSIX compatible), and, within this API, uses the existing extended file attributes as a flexible, application-agnostic mechanism to pass information across the application/storage divide. This project proposes an extensible storage system design that can be extended with new optimizations that are triggered by application hints. The project demonstrates this approach in context of workflow execution systems. The workflow optimized storage system (WOSS), exploits application hints to provide per-file optimized operations, further, the system exposes data location to enable location-aware scheduling. The evaluation shows that WOSS brings tangible performance gains. The evaluation demonstrates, using synthetic benchmarks as well as three real-world workflows that this design brings sizeable performance gains. On a commodity cluster, the synthetic benchmarks reveal that, compared to a traditionally designed distributed storage system that uses the same hardware resources, WOSS achieves 30% to up to 2x higher performance depending on the access pattern. Further, compared to a NFS server deployed on a well provisioned server-class machine (with multiple disks, and large memory), WOSS achieves up to 10x performance gains. (NFS only provided competitive performance under cache friendly workloads due to its well provisioned hardware.) Further, under real application, WOSS enabled 20-30% application-level performance gain, and 30-50% gain compared to NFS. Finally, the evaluation on the BG/P machine at ANL shows that WOSS can scale to support larger workloads and enables sizable gains compared to the deployed 149  backend storage (GPFS). Impact: In addition to its research contributions, this project had two main impacts in the research community:   First, this project debunks the generally accepted principle of the necessity to abandon POSIX for higher performance [15, 81, 115, 139]. The project proposes an innovative approach that enables significant improvements, without abandoning POSIX. Further, unlike systems specialized for specific workloads, this work demonstrates that it is feasible to have a POSIX compatible storage system yet optimized for each application (or application mix) even if the application exhibits a heterogeneous data access pattern across files.    Second, this project prototyped a workflow aware storage system (WOSS). WOSS has proved stable in test deployments at Argonne National Laboratory where it has been used as a backend storage system for three bioinformatics workflow applications.  150  References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]  [26] [27]  [28] [29]  Elcomsoft password recovery software [cited 2008; http://www.elcomsoft.com. FUSE, Filesystem in Userspace. [cited 2011; http://fuse.sourceforge.net/. GZip compression tool. [cited 2011; http://www.gzip.org/. Iperf website. [cited 2008 April]; 2.0.2:[Available from: http://dast.nlanr.net/Projects/Iperf/. Kaspersky Antivirus [cited 2008; http://www.kaspersky.com/. Luster Storage System. [cited 2012; www.lustre.org. Lustre website [cited 2009; http://www.lustre.org/. modFTDock. [cited 2012; http://www.mybiosoftware.com/3d-molecular-model/922. Overview of DOCK. [cited 2009; http://dock.compbio.ucsf.edu/Overview_of_DOCK/index.htm. rsync files synchronization tool. [cited 2009; http://www.samba.org/rsync/. SNIA XAM Initiative. [cited 2010; http://www.snia.org/forums/xam/. Geforce 8 Series, http://www.nvidia.com/page/geforce8.html. 2008. MosaStore website. 2008 [cited 2008; www.mosastore.net. NVIDIA CUDA Compute Unified Device Architecture: Programming Guide v2.0. 2008. High End Computing Extensions Working Group 2012 [cited; https://collaboration.opengroup.org/platform/hecewg. MetaFS, http://metafs.sourceforge.net/. 2012. M. Abd-El-Malek, W. V. C. II, C. Cranor, G. R. Ganger, et al. Ursa minor: versatile clusterbased storage. in FAST 2005. E. Adar, D. Karger, and L. Stein. Haystack: Per-User Information Environments. in International conference on Information and knowledge management. 1999. P. K. Agarwal, Role of Protein Dynamics in Reaction Rate Enhancement by Enzymes. Journal of the American Chemical Society, 2005. 127(43): p. 15248 -15256. S. Al-Kiswany, A. Bahramshahry, H. Ghasemi, M. Ripeanu, et al. A High-Performance GridFTP Server at Desktop Cost. in IEEE/ACM Supercomputing Conference. 2007. S. Al-Kiswany, C. Constantinescu, P. Sarkar, M. Seaman, et al., A Mechanism and a System for Virtual Machines Co-Migration. 2010. S. Al-Kiswany, A. Gharaibeh, and M. Ripeanu. The Case for Versatile Storage System. in Workshop on Hot Topics in Storage and File Systems (HotStorage). 2009. S. Al-Kiswany, A. Gharaibeh, and M. Ripeanu, GPUs as Storage System Accelerators. IEEE Transactions on Parallel and Distributed Systems (TPDS), 2012. S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, and M. Ripeanu, On GPU's Viability as a Middleware Accelerator. Journal of Cluster Computing, 2009. S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, et al. StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems. in ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC). 2008. S. Al-Kiswany, H. Hacigumus, Z. Liu, and J. Sankaranarayanan, Cost Exploration of Data Sharings in the Cloud. 2012. S. Al-Kiswany, H. Hacigumus, Z. Liu, and J. Sankaranarayanan. Cost Exploration of Data Sharings in the Cloud. in Submited to International Conference on Extending Database Technology (EDBT). 2012. S. Al-Kiswany and M. Ripeanu, Deduplicated Virtual Machine Transfer. 2012. S. Al-Kiswany, M. Ripeanu, S. Vazhkudai, and A. Gharaibeh. stdchk: A Checkpoint Storage System for Desktop Grid Computing. in Technical report, Networked Systems Lab, University of British Columbia, NetSysLab-TR-2007-04. 2007. 151  [30]  [31]  [32]  [33] [34] [35] [36]  [37]  [38]  [39]  [40] [41]  [42] [43] [44] [45] [46]  [47]  [48] [49]  S. Al-Kiswany, M. Ripeanu, S. Vazhkudai, and A. Gharaibeh. stdchk: A Checkpoint Storage System for Desktop Grid Computing. in International Conference on Distributed Computing Systems (ICDCS ‘08). 2008. Beijing, China. S. Al-Kiswany, D. Subhraveti, P. Sarkar, and M. Ripeanu. VMFlock: virtual machine comigration for the cloud. in International Symposium on High Performance Distributed Computing (HPDC). 2011. S. Al-Kiswany, E. Vairavanathan, L. B. Costa, H. Yang, et al. The Case for Cross-Layer Optimizations in Storage: A Workflow-Optimized Storage System. in Submitted to ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC). 2013. G. Alonso, D. Kossmann, and T. Roscoe. SwissBox: An Architecture for Data Processing Appliances. in Biennial Conference on Innovative Data Systems Research (CIDR). 2011. S. F. Altschul, W. Gish, W. Miller, E. Myers, et al., Basic Local Alignment Search Tool. Molecular Biology, 1990. 215: p. 403–410. S. Ames, N. Bobb, S. A. Brandt, A. Hiatt, et al. Richer file system metadata using links and attributes. in NASA Goddard Conference on Mass Storage Systems and Technologies. 2005. S. Ames, N. Bobb, K. Greenan, O. Hofmann, et al. LiFS: An attribute-rich file system for storage class memories. in 23rd IEEE / 14th NASA Goddard Conference on Mass Storage Systems and Technologies. 2006. A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, L. N. Bairavasundaram, T. E. Denehy, et al., Semantically-smart disk systems: past, present, and future. SIGMETRICS Performance Evaluation Review, 2006. 33(4): p. 29-35. A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, N. C. Burnett, T. E. Denehy, et al. Transforming Policies into Mechanisms with Infokernel. in Symposium on Operating Systems Principles (SOSP). 2003. A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 2009. Boston, MA. P. Barham, B. Dragovic, K. Fraser, S. Hand, et al. Xen and the Art of Virtualization. in ACM Symposium on Operating Systems Principles (SOSP). 2003. J. Bent, D. Thain, A. C.Arpaci-Dusseau, R. H. Arpaci-Dusseau, et al. Explicit Control in a Batch-Aware Distributed File System. in Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI '04). 2004. San Francisco, California. R. Bhagwan, K. Tati, Y. Cheng, S. Savage, et al. TotalRecall: System Support for Automated Availability Management. in NSDI'04. 2004. S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, et al., Characterization of Scientific Workflows, in Workshop on Workflows in Support of Large-Scale Science. 2008. P. Bhatotia, R. Rodrigues, and A. Verma, Shredder: GPU-Accelerated Incremental Storage and Computation, in USENIX Conference on File and Storage Technologies (FAST). 2012. B. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of ACM, 1970. 13(7): p. 422-426. S. Boboila, Y. Kim, S. S. Vazhkudai, P. J. Desnoyers, et al. Active Flash: Out-of-core Data Analytics on Flash Storage. in IEEE Conference on Mass Storage Systems and Technologies (MSST). 2012. W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs. in International Conference on Measurement and Modeling of Computer Systems(SIGMETRICS). 2000. S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. in ACM SIGMOD. 1995. P. H. Carns, W. B. Ligon-III, R. B. Ross, and R. Thakur. PVFS: A Parallel File System for Linux Clusters. in 4th Annual Linux Showcase and Conference. 2000. Atlanta, GA. 152  [50]  [51]  [52]  [53]  [54] [55]  [56]  [57] [58] [59]  [60]  [61]  [62] [63] [64] [65] [66] [67]  [68]  Y. Chen, W. Chen, M. H. Cobb, and Y. Zhao, PTMap A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites. Proceedings of the National Academy of Sciences of the USA, 2009. 106(3). Z. Chen, Y. Zhang, Y. Zhou, H. Scott, et al. Empirical evaluation of multi-level buffer cache collaboration for storage systems. in International conference on Measurement and modeling of computer systems (SIGMETRICS). 2005. B.-G. Chun, F. Dabek, A. Haeberlen, E. Sit, et al. Efficient Replica Maintenance for Distributed Storage Systems. in 3rd USENIX Symposium on Networked Systems Design & Implementation (NSDI). 2006. San Jose, CA. J. Cipar, M. D. Corner, and E. D. Berger. TFS: A Transparent File System for Contributory Storage. in Proceedings of the 5th USENIX Conference on File and Storage Technologies FAST '07. 2007. J. Corbet, A. Rubini, and G. Kroah-Hartman, Linux Device Drivers. 3 ed. 2005: O'Reilly Media. L. Costa, S. Al-Kiswany, R. Lopes, and M. Ripeanu. Assessing Data Deduplication trade-offs from an Energy Perspective. in Workshop on Energy Consumption and Reliability of Storage Systems (ERSS). 2011. L. Costa, S. Al-Kiswany, and M. Ripeanu. GPU Support for Batch Oriented Workloads. in IEEE International Performance Computing and Communications Conference (IPCCC). 2009. Phoenix, AZ. L. B. Costa and M. Ripeanu. Towards Automating the Configuration of a Distributed Storage System. in ACM/IEEE International Conference on Grid Computing (Grid). 2010. L. P. Cox and B. D. Noble. Samsara: honor among thieves in peer-to-peer storage. in ACM Symposium on Operating Systems Principles. 2003. M. L. Curry, A. Skjellum, H. L. Ward, and R. Brightwell. Accelerating Reed-Solomon Coding in RAID Systems with GPUs. in IEEE International Parallel and Distributed Processing Symposium, IPDPS. 2008. F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, et al. Wide-area cooperative storage with CFS. in 18th ACM Symposium on Operating Systems Principles (SOSP '01). 2001. Chateau Lake Louise, Banff, Canada. D. Dabiri and I. F. Blake, Fast parallel algorithms for decoding Reed-Solomon codes based on remainder polynomials. IEEE Transactions on Information Theory, 1995. 41(4): p. 873885. I. Damgard. A Design Principle for Hash Functions. in Advances in Cryptology - CRYPTO. 1989: Lecture Notes in Computer Science. R. Datta, Dhiraj Joshi, J. Li, and J. Z. Wang, Image Retrieval: Ideas, Influences, and Trends of the New Age. ACM Computing Surveys, 2008. 40(2): p. 1-60. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, et al. Dynamo: Amazon's Highly Available Key-value Store. in SOSP07. 2007. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, et al., Pegasus: Mapping Scientific Workflows onto the Grid. Lecture Notes in Computer Science, Grid Computing. 3165: p. 11-20. T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Bridging the Information Gap in Storage Protocol Stacks. in USENIX Annual Technical Conference. 2002. E. N. Elnozahy and J. S. Plank, Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery. Transactions on Dependable and Secure Computing, 2004. K. Eshghi, M. Lillibridge, L. Wilcock, G. Belrose, et al. JumboStore: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service. in USENIX Conference on File and Storage Technologies, FAST. 2007.  153  [69]  [70]  [71] [72] [73]  [74]  [75] [76]  [77]  [78]  [79] [80] [81]  [82] [83] [84]  [85] [86] [87] [88]  G. Falcao, L. Sousa, and V. Silva. Massive Parallel LDPC Decoding on GPU. in ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP). 2008. Salt Lake City. G. Fedak, H. He, and F. Cappello. BitDew: a programmable environment for large-scale data management and distribution. in International Conference on High Performance Networking and Computing (Supercomputing). 2008. R. Fonseca, G. M. Porter, R. H. Katz, S. Shenker, et al., IP Options are not an option, in Technical Report No. UCB/EECS-2005-24. 2005. B. C. Forney, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Storage-Aware Caching: Revisiting Caching for Heterogeneous Storage Systems. in FAST. 2002. K. Fujimoto, H. Akaike, N. Okada, K. Miura, et al. Power-aware Proactive Storage-tiering Management for High-speed Tiered-storage Systems. in Workshop on Sustainable Information Technology. 2010. A. Gharaibeh, S. Al-Kiswany, S. Gopalakrishnan, and M. Ripeanu. A GPU Accelerated Storage System. in EEE/ACM International Symposium on High Performance Distributed Computing (HPDC). 2010. A. Gharaibeh, S. Al-Kiswany, and M. Ripeanu. Configurable Security for Scavenged Storage Systems. in International Workshop on Storage Security and Survivability (StorageSS). 2008. A. Gharaibeh, S. Al-Kiswany, and M. Ripeanu. CrystalGPU: Transparent and Efficient Utilization of GPU Power. in Technical report, Networked Systems Lab, University of British Columbia, NetSysLab-TR-2010-01. 2010. A. Gharaibeh, S. Al-Kiswany, and M. Ripeanu, ThriftStore: Finessing Reliability Tradeoffs in Replicated Storage Systems. IEEE Transactions on Parallel and Distributed Systems, 2011. 22(6): p. 910-923. A. Gharaibeh and M. Ripeanu. Exploring Data Reliability Tradeoffs in Replicated Storage Systems. in ACM/IEEE International Symposium on High Performance Distributed Computing. 2009. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. in 19th ACM Symposium on Operating Systems Principles. 2003. Lake George, NY. J. Gilchrist. Parallel Compression with BZIP2. in IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS). 2004. G. Grider, L. Ward, R. Ross, and G. Gibson. A business case for extensions to the POSIX I/O API for high end, clustered, and highly concurrent computing. 2006 [cited; http://www.opengroup.org/platform/hecewg/uploads/40/10891/POSIXIO-API-Business-caseHEC-ggrider.pdf. K. Gupta, R. Jain, I. Koltsidas, H. Pucha, et al., GPFS-SNC: An enterprise storage framework for virtual-machine clouds IBM Journal of Research and Development 2011. A. Gurtov and S. Floyd, Modeling wireless links for transport protocols. ACM SIGCOMM Computer Communication Review, 2004. 34(2): p. 85 - 96. A. Gurtov and R. Ludwig, Lifetime packet discard for efficient real-time transport over cellular links. ACM SIGMOBILE Mobile Computing and Communications Review, 2003. 7(4): p. 32 - 45. B. Hall, Beej's Guide to Network Programming. 2011: Jorgensen Publishing. P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters. in Scientific Discovery through Advanced Computing Program (SciDAC). 2006. O. Harrison and J. Waldron. Practical Symmetric Key Cryptography on Modern Graphics Hardware. in USENIX Security Symposium. 2008. San Jose, CA. M. P. Herlihy and J. M. Wing, Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 1990. 12(3): p. 463 - 492. 154  [89] [90] [91] [92]  [93]  [94]  [95]  [96] [97]  [98] [99]  [100] [101] [102] [103] [104] [105] [106] [107]  [108] [109] [110]  M. D. Hill and M. R. Marty, Amdahl's Law in the Multicore Era. IEEE Computer, 2008. 41(7): p. 33-38. J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, et al., Scale and Performance in a Distributed File System. ACM Transactions on Computer Systems, 1988. 6(1). S. Jones, C. Strong, A. Parker-Wood, A. Holloway, et al. Easing the Burdens of HPC File Management. in Parallel Data Storage Workshop (PDSW). 2011. D. S. Katz, T. G. Armstrong, Z. Zhang, M. Wilde, et al., Many-Task Computing and Blue Waters, in Technical Report CI-TR-13-0911. Computation Institute, University of Chicago & Argonne National Laboratory. arXiv:1202.3943v1. 2012. A. Khasymski, M. M. Rafique, A. R. Butt, S. S. Vazhkudai, et al. On the Use of GPUs in Realizing Cost-Effective Distributed RAID. in International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. 2012. C. Kim, K.-W. Park, and K. H. Park. GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system. in International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM). J. Koren, A. Leung, Y. Zhang, C. Maltzahn, et al. Searching and Navigating Petabyte Scale File Systems Based on Facets. in ACM Petascale Data Storage Workshop, in Supercomputing 2007. Reno, NV. R. Kotla, L. Alvisi, and M. Dahlin. SafeStore: A Durable and Practical Storage System. in USENIX Annual Technical Conference. 2007. A. C. Laity, N. Anagnostou, G. B. Berriman, J. C. Good, et al. Montage: An Astronomical Image Mosaic Service for the NVO. in Proceedings of Astronomical Data Analysis Software and Systems (ADASS). 2004. A. W. Leung, M. Shao, T. Bisson, S. Pasupathy, et al. Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems. in FAST. 2009. M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, et al. Functional Partitioning to Optimize Endto-End Performance on Many-core Architectures. in Proceedings of Supercomputing (SC). 2010 M. J. Litzkow, Miron Livny, and M. W. Mutka. Condor: a hunter of idle workstations. in Proceedings of the 8th International Conference of Distributed Computing Systems. 1988. R. Love, Linux Kernel Development. 2005: Novell Press. C. Maltzahn, N. Bobb, M. W. Storer, D. Eads, et al. Graffiti: A Framework for Testing Collaborative Distributed Metadata. in Informatics 21. 2007. N. Mandagere, J. Diehl, and D. Du. GreenStor: Application-Aided Energy-Efficient Storage. in IEEE Conference on Mass Storage Systems and Technologies (MSST). 2007. R. Merkle. A Certified Digital Signature. in Advances in Cryptology - CRYPTO. 1989: Lecture Notes in Computer Science. M. P. Mesnier and J. B. Akers, Differentiated storage services. ACM SIGOPS Operating Systems Review, 2011. 45(1). D. Mohan and J. Cavazos. Faster File Matching Using GPUs. in Symposium on Application Accelerators in High-Performance Computing. 2012. H. M. Monti, A. R. Butt, and S. S. Vazhkudai. Reconciling Scratch Space Consumption, Exposure, and Volatility to Achieve Timely Staging of Job Input Data. in IEEE International Parallel & Distributed Processing Symposium (IPDPS). 2010. A. Moss, D. Page, and N. Smart. Toward Acceleration of RSA Using 3D Graphics Hardware. in Cryptography and Coding. 2007. K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-Aware Storage Systems. in USENIX Annual Technical Conference. 2006. A. Muthitacharoen, B. Chen, and D. Mazieres. A Low-bandwidth Network File System. in Symposium on Operating Systems Principles (SOSP). 2001. Banff, Canada. 155  [111]  [112]  [113] [114] [115]  [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127]  [128]  [129]  [130]  [131]  E. B. Nightingale, D. Peek, P. M. Chen, and J. Flinn. Parallelizing security checks on commodity hardware. in international conference on Architectural support for programming languages and operating systems (ASPLOS). 2008. Seattle, WA. R. A. Oldfield, A. B. Maccabe, S. Arunagiri, T. Kordenbrock, et al. Lightweight I/O for Scientific Applications. in Proc. 2006 IEEE Conference on Cluster Computing. 2006. Barcelona, Spain. J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, et al., A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 2007. 26(1): p. 80-113. A. Parker-Wood, D. D. E. Long, E. L. Miller, M. Seltzer, et al. Making Sense of File Systems Through Provenance and Rich Metadata. in Technical Report UCSC-SSRC-12-01. 2012. S. Patil, G. A. Gibson, G. R. Ganger, J. Lopez, et al. In search of an API for scalable file systems: under the table or above it? in Conference on Hot topics in cloud computing (HotCloud). 2009. R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, et al. Informed prefetching and caching. in ACM symposium on Operating Systems Principles (SOSP). 1995. B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, et al. NFS Version 3: Design and Implementation. in Proceedings of the Summer 1994 USENIX Technical Conference. 1994. M. Polte, J. Simsa, and G. Gibson. Comparing Performance of Solid State Devices and Mechanical Disks. in Petascale Data Storage Workshop 2008. D. Presotto. Plan 9. in USENIX Workshop on Micro-kernels and Other Kernel Architectures. 1992. S. Quinlan and S. Dorward. Venti: A New Approach to Archival Data Storage. in USENIX Conference on File and Storage Technologies (FAST). 2002. Monterey, CA. I. Raicu, I. T. Foster, and Y. Zhao, Many-Task Computing for Grids and Supercomputers, in IEEE Workshop on Many-Task Computing on Grids and Supercomputers 2008. I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, et al. Falkon: a Fast and Light-weight tasK executiON framework. in SuperComputing. 2007. I. S. Reed and G. Solomon, Polynomial Codes Over Certain Finite Fields. Journal of the Society for Industrial and Applied Mathematics, 1960. 8(2): p. 300-304. S. C. Rhea, R. Cox, and A. Pesterev. Fast, Inexpensive Content-Addressed Storage in Foundation. in USENIX Annual Technical Conference. 2008. D. Rochberg and G. Gibson. Prefetching Over a Network: Early Experience With CTIP. in ACM SIGMETRICS Performance Evaluation Review. 1997. M. Rosenblum and J. K. Ousterhout. The Design and Implementation of a Log-Structured File System. in ACM Transactions on Computer Systems. February 1992. C. Rossbach, J. Currey, M. Silberstein, B. Ray, et al., PTask: Operating System Abstractions To Manage GPUs as Compute Devices, in Symposium on Operating Systems Principles (SOSP). 2011. J. Ruscio, M. Heffner, and S. Varadarajan. Dejavu: Transparent user-level checkpointing, migration, and recovery for distributed systems. in Proceedings of 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2007. Long Beach, CA, USA. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. in ACM SIGPLAN Symposium on Principles and practice of parallel programming. 2008. S. A. B. Sage Weil, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn. Ceph: A Scalable, High-Performance Distributed File System. in Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI '06). 2006. E. Santos-Neto, S. Al-Kiswany, N. Andrade, S. Gopalakrishnan, et al. Beyond Search and Navigability: Custom Metadata Can Enable Cross-Layer Optimizations in Storage Systems. in ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC) - Hot Topics Track. 2008. 156  [132]  [133]  [134] [135]  [136] [137] [138] [139] [140]  [141] [142] [143] [144] [145] [146] [147] [148] [149]  [150]  [151]  [152]  E. Santos-Neto, S. Al-Kiswany, N. Andrade, S. Gopalakrishnan, et al. Enabling Cross-Layer Optimizations in Storage Systems with Custom Metadata. in ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC) - Hot Topics Track. 2008. D. J. Santry, M. J. Feeley, N. C. Hutchinson, A. C. Veitch, et al. Deciding when to forget in the Elephant file system. in 17th ACM Symposium on Operating Systems Principles (SOSP 99). 1999. J. Schindler, A. Ailamaki, and G. R. Ganger. Lachesis: Robust Database Storage Management Based on Device-specific Performance Characteristics. in VLDB 2003. J. Schindler, J. L. Griffin, C. R. Lumb, and G. R. Ganger. Track-aligned Extents: Matching Access Patterns to Disk Drive Characteristics. in Conference on File and Storage Technologies (FAST). 2002. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. in ACM SIGMOD international conference on Management of data. 2003. F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. in 1st USENIX Conference on File and Storage Technologies (FAST'02). 2002. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, et al. Larrabee: A Many-Core x86 Architecture for Visual Computing. in IEEE Micro. 2009. M. Seltzer and N. Murphy. Hierarchical file systems are dead. in Conference on Hot topics in operating systems (HotOS). 2009. T. Shibata, S. Choi, and K. Taura, File-access patterns of data-intensive workflow applications and their implications to distributed filesystems, in International Symposium on High Performance Distributed Computing (HPDC). 2010. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. in IEEE Symposium on Mass Storage Systems and Technologies (MSST) 2010. H. Song, Y. Yin, Y. Chen, and X.-H. Sun, Cost-intelligent application-specific data layout optimization for parallel file systems. Cluster Computing, 2012. W. Sun, R. Ricci, and M. L. Curry. GPUstore: harnessing GPU computing for storage systems in the OS kernel. in International Systems and Storage Conference (SYSTOR). 2012. A. S. Tanenbaum and M. v. Steen, Distributed Systems: Principles and Paradigms. 2002: Prentice-Hall, Inc. A. S. Tanenbaum and M. V. Steen, Distributed Systems: Principles and Paradigms. 2 ed. 2006: Prentice Hall. W. Tantisiriroj, S. Patil, G. Gibson, S. W. Son, et al. On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS. in Supercomputing (SC). 2011. I. B. G. team, Overview of the IBM Blue Gene/P Project. IBM Journal of Research and Development, 2008. 52 D. Thain and M. Livny. Parrot: Transparent User-Level Middleware for Data-Intensive Computing. in Workshop on Adaptive Grid Middleware. 2003. New Orleans, Louisiana. C.-K. Tseng, S.-C. Lin, and Y. Hsu, A File System Using GPU-Accelerated File-wise Reliability Scheme, in International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA). 2012. E. Vairavanathan, S. Al-Kiswany, A. Barros, L. B. Costa, et al., A Case for Workflow-Aware Storage: An Opportunity Study using MosaStore. Journal of Future Generation Computer Systems, 2013. E. Vairavanathan, S. Al-Kiswany, L. Costa, Z. Zhang, et al. A Workflow-Aware Storage System: An Opportunity Study. in International Symposium on Clusters, Cloud, and Grid Computing (CCGrid). 2012. S. S. Vazhkudai, X. Ma, V. W. Freeh, J. W. Strickland, et al., Constructing collaborative desktop storage caches for large scientific datasets. ACM Transaction on Storage (TOS), 2006. 2(3): p. 221 - 254. 157  [153] [154] [155] [156]  [157] [158] [159]  [160]  H. Weatherspoon and J. Kubiatowicz. Erasure Coding vs. Replication: A Quantitative Comparison. in International Workshop on Peer-to-Peer Systems IPTPS. 2002. M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, et al., Swift: A language for distributed parallel scripting. Parallel Computing, 2011. J. Wozniak and M. Wilde. Case studies in storage access by loosely coupled petascale applications. in Petascale Data Storage Workshop. 2009. W. Xia, H. Jiang, D. Feng, L. Tian, et al. P-Dedupe: Exploiting Parallelism in Data Deduplication System. in IEEE Seventh International Conference on Networking, Architecture, and Storage. 2012. U. Yildiz, A. Guabtni, and A. H. H. Ngu, Towards scientific workflow patterns, in Workshop on Workflows in Support of Large-Scale Science. 2009. A. R. Yumerefendi and J. S. Chase. Strong Accountability for Network Storage. in FAST'07. 2007. Z. Zhang, A. Espinosa, K. Iskra, I. Raicu, et al. Design and Evaluation of a Collective I/O Model for Loosely-coupled Petascale Programming. in Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS). 2008. B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. in USENIX Conference on File and Storage Technologies (FAST). 2008.  158  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0074016/manifest

Comment

Related Items