UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Storage system tracing and analysis Meyer, Dutch T. 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2015_november_meyer_dutch.pdf [ 3.02MB ]
Metadata
JSON: 24-1.0166682.json
JSON-LD: 24-1.0166682-ld.json
RDF/XML (Pretty): 24-1.0166682-rdf.xml
RDF/JSON: 24-1.0166682-rdf.json
Turtle: 24-1.0166682-turtle.txt
N-Triples: 24-1.0166682-rdf-ntriples.txt
Original Record: 24-1.0166682-source.json
Full Text
24-1.0166682-fulltext.txt
Citation
24-1.0166682.ris

Full Text

Storage System Tracing and AnalysisbyDutch T. MeyerB.Sc., The University of Washington, 2001B.A., The University of Washington, 2001M.Sc., The University of British Columbia, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2015c© Dutch T. Meyer 2015AbstractFile system studies are critical to the accurate configuration, design, andcontinued evolution of storage systems. However, surprisingly little has beenpublished about the use and behaviour of file systems under live deployment.Most published file system studies are years old, or are extremely limitedin size and scope. One reason for this deficiency is that studying clusters ofstorage systems operating at scale requires large data repositories that aredifficult to collect, manage, and query. In effect, the demands introduced byperforming detailed file system traces creates new and interesting storagechallenges in and of itself. This thesis describes a methodology to facili-tate the collection, analysis, and publication of large traces of file systemactivity and structure so that organizations can perform regular analysis oftheir storage systems and retain that analysis to answer questions abouttheir system’s behaviour. To validate this methodology I investigate theuse and performance of several large storage deployments. I consider theimpact of the observed system usage and behaviour on file system design,and I describe the processes by which the collected traces can be efficientlyprocessed and manipulated. I report on several examples of long standingincorrect assumptions, efficient engineering alternatives, and new opportu-nities in storage system design.iiPrefaceExcept where specifically stated otherwise, the contents of this thesis aremy own. The research draws primarily from two published research pa-pers which I have extended here with further unpublished analysis into thedatasets gathered in the original work.Versions of figures 5.1, 5.2, 5.3, 5.4, 5.5, 5.7, 5.8, 5.9, 5.11, 7.1, 7.2, 7.3,7.4, 7.5, 7.6, and 7.8 as well as Tables 5.1, and 5.2 appeared in “A Study ofPractical Deduplication” published by USENIX in 2010, in which I was thelead investigator [MB11]. The research was conducted at Microsoft Corpo-ration in Redmond, Washington. After the initial draft of this paper, whichI wrote, Bill Bolosky and I collaborated in the authorship of subsequentdrafts and the final research paper. Modified portions of this collaborativetext are used with permission in Chapters 5, 6, and 7. Bill Bolosky wrotethe SQL functions in Appendix A along with other unlisted routines usedto generate the processed data from which the graphs which reference thisdataset are generated.Figure 6.1 was drawn by my advisor Andrew Warfield and myself aspart of “Capo: Recapitulating Storage for Virtual Desktops” published byUSENIX in 2010 [SMW+10]. Other co-authors of the Capo paper includeMohammad Shamma, Jake Wires, Maria Ivanova, and Norm Hutchinson.iiiPrefaceThe excerpts of the paper that appear here, including Figures 6.2, 6.3, 6.4,6.5, 6.6, 6.7, 7.9, 7.10, 7.11, Table 7.1, and the original text that I modifiedfor this thesis in Chapters 6 and 7 are all my own contributions to thatpaper. I conducted the research and gathering the data presented in thesechapters. Maria Ivanova helped write some of the tools used to process thedata in Figures 6.6 and 6.4.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 42 A Brief Case Study: Storage System Complexity . . . . . 73 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1 Storage Tracing . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.1 Applications of Workload Traces . . . . . . . . . . . . 14vTable of Contents3.1.2 Trace Altitude . . . . . . . . . . . . . . . . . . . . . . 163.1.3 Cache Considerations . . . . . . . . . . . . . . . . . . 173.2 Storage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 The Value of Storage Analysis . . . . . . . . . . . . . 183.2.2 Storage Analysis Considerations . . . . . . . . . . . . 203.3 The Windows Storage Stack . . . . . . . . . . . . . . . . . . 213.3.1 Device Management and the Block-Layer . . . . . . . 223.3.2 The Virtual File System Layer . . . . . . . . . . . . . 243.3.3 Filter Manager . . . . . . . . . . . . . . . . . . . . . . 243.3.4 I/O and Cache Managers . . . . . . . . . . . . . . . . 253.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 263.4 The Traces and Analysis in This Thesis . . . . . . . . . . . . 283.4.1 File System Content Study . . . . . . . . . . . . . . . 283.4.2 Storage Stack Tracing . . . . . . . . . . . . . . . . . . 343.4.3 Improving the State of Storage Analysis . . . . . . . 393.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Data Collection and Analysis . . . . . . . . . . . . . . . . . . 424.1 File System Content Scanner . . . . . . . . . . . . . . . . . . 444.1.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . 454.1.2 Scanner Architecture . . . . . . . . . . . . . . . . . . 464.1.3 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . 494.1.4 Data Collection Methodology . . . . . . . . . . . . . 504.1.5 Study Biases and Sources of Error . . . . . . . . . . . 514.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 53viTable of Contents4.2 Workload Tracer . . . . . . . . . . . . . . . . . . . . . . . . . 544.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . 544.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . 554.2.3 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . 574.2.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . 574.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 594.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 604.3.1 Interactive Processing . . . . . . . . . . . . . . . . . . 604.3.2 Optimizing for Content Hashes . . . . . . . . . . . . . 614.3.3 Anonymization and Release . . . . . . . . . . . . . . 624.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665 Capacity Management: Deduplication . . . . . . . . . . . . . 675.1 Deduplication Overview . . . . . . . . . . . . . . . . . . . . . 705.2 The Capacity and Utilization of File Systems . . . . . . . . . 735.2.1 Raw Capacity . . . . . . . . . . . . . . . . . . . . . . 735.2.2 File Sizes and Space Consumed . . . . . . . . . . . . 745.2.3 Disk Utilization . . . . . . . . . . . . . . . . . . . . . 825.3 Deduplication in Primary Storage . . . . . . . . . . . . . . . 845.3.1 The Deduplication Parameter Space . . . . . . . . . . 845.3.2 A Comparison of Chunking Algorithms . . . . . . . . 895.3.3 Characterizing Whole File Chunks . . . . . . . . . . . 925.3.4 The Impact of Sparse Files . . . . . . . . . . . . . . . 955.4 Deduplication in Backup Storage . . . . . . . . . . . . . . . . 96viiTable of Contents5.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . 996 Workload Introspection . . . . . . . . . . . . . . . . . . . . . . 1026.1 Performance Analysis of VDI Workloads . . . . . . . . . . . 1046.1.1 VDI Overview . . . . . . . . . . . . . . . . . . . . . . 1056.1.2 Temporal Access Patterns - Peaks and Valleys . . . . 1096.1.3 VM Contributors to Load . . . . . . . . . . . . . . . . 1136.1.4 Disk Divergence . . . . . . . . . . . . . . . . . . . . . 1146.1.5 Capo: Caching to Alleviate Peak Workloads . . . . . 1156.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1216.2 Eliminating Wasted Effort . . . . . . . . . . . . . . . . . . . 1226.3 On-Disk Fragmentation . . . . . . . . . . . . . . . . . . . . . 1256.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277 Namespace Modelling . . . . . . . . . . . . . . . . . . . . . . . 1297.1 File System Namespace Characterization . . . . . . . . . . . 1307.1.1 Namespace Complexity . . . . . . . . . . . . . . . . . 1327.1.2 File Types . . . . . . . . . . . . . . . . . . . . . . . . 1397.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1427.2 Leveraging Namespaces with Differential Durability . . . . . 1437.2.1 Differential Durability . . . . . . . . . . . . . . . . . . 1467.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . 1558 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159viiiTable of ContentsAppendicesA Selected Database SQL Queries . . . . . . . . . . . . . . . . . 185A.1 Mean File System Fullness . . . . . . . . . . . . . . . . . . . 185A.2 File Systems by Count of Files . . . . . . . . . . . . . . . . . 185A.3 Whole File Duplicates by File Extension . . . . . . . . . . . 185B Scanner Output Format . . . . . . . . . . . . . . . . . . . . . . 187ixList of Tables3.1 Block vs. file characteristics. . . . . . . . . . . . . . . . . . . 163.2 Metadata examples. . . . . . . . . . . . . . . . . . . . . . . . 193.3 File system studies. . . . . . . . . . . . . . . . . . . . . . . . . 303.4 Storage traces. . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1 Fraction of duplicates by file extension. . . . . . . . . . . . . . 895.2 Whole file duplicates by extension. . . . . . . . . . . . . . . . 945.3 The impact of sparseness on capacity consumption. . . . . . . 956.1 Workload peaks selected for more detailed analysis. . . . . . . 1136.2 Workload duplication of effort. . . . . . . . . . . . . . . . . . 1186.3 Applications running in each of the peak load periods. . . . . 1237.1 Differential Durability policies. . . . . . . . . . . . . . . . . . 151xList of Figures2.1 Microsoft Azure Storage: A simplified example . . . . . . . . 93.1 The Microsoft Windows storage stack. . . . . . . . . . . . . . 234.1 File System Content Scanner architecture. . . . . . . . . . . . 474.2 Workload Tracer architecture. . . . . . . . . . . . . . . . . . . 565.1 CDF of file systems by file system capacity. . . . . . . . . . . 745.2 Histogram of files by size. . . . . . . . . . . . . . . . . . . . . 765.3 CDF of total bytes consumed by containing file size. . . . . . 775.4 Histogram of total bytes consumed by containing file size. . . 785.5 Total bytes consumed versus file extension. . . . . . . . . . . 795.6 Histogram of bytes by containing file extension. . . . . . . . . 815.7 CDF of file systems by file system fullness. . . . . . . . . . . . 835.8 Duplicates vs. chunk size. . . . . . . . . . . . . . . . . . . . . 855.9 Deduplication vs. deduplication domain size. . . . . . . . . . 875.10 Percentage contribution to deduplication by file type. . . . . . 915.11 Bytes by containing file size. . . . . . . . . . . . . . . . . . . . 935.12 Backup deduplication options compared. . . . . . . . . . . . . 98xiList of Figures6.1 Typical image management in VDI systems. . . . . . . . . . . 1066.2 Activity measured in I/O operations over each of the 7 days. 1116.3 Contributors to each of the major workload peaks. . . . . . . 1136.4 Bytes of disk diverging from the gold master. . . . . . . . . . 1156.5 The architecture of Capo, a cache for VDI servers. . . . . . . 1166.6 The benefits of delaying writeback. . . . . . . . . . . . . . . . 1206.7 File Fragmentation. . . . . . . . . . . . . . . . . . . . . . . . . 1267.1 CDF of file systems versus directories. . . . . . . . . . . . . . 1347.2 CDF of file systems versus files. . . . . . . . . . . . . . . . . . 1347.3 Histogram of number of files per directory. . . . . . . . . . . . 1367.4 Histogram of number of subdirectories per directory. . . . . . 1367.5 CDF of number of files versus depth in file system hierarchy. 1377.6 CDF of bytes versus depth in file system hierarchy. . . . . . . 1377.7 CDF of files versus extensions. . . . . . . . . . . . . . . . . . 1407.8 Percentage of files versus extension. . . . . . . . . . . . . . . . 1417.9 Size and amount of block-level writes by file system path. . . 1447.10 Total divergence versus time for each namespace category. . . 1467.11 Percentage of writes by cache-coherence policy. . . . . . . . . 153xiiAcknowledgementsI would like to thank my advisor Andrew Warfield for his continuous guid-ance and support thoughout my Ph.D. and for so much helpful collaborationin the research and publishing process. I have taken every opportunity toruthlessly steal as many lessons from him as possible, and I have no inten-tion of stopping any time soon. I would also like to thank my collaboratorsat UBC, Mohammad Shamma, Jake Wires, and Maria Ivanova with whomI coauthored one paper which contibuted to this thesis. Each contributedto the design and evaluation of the Capo system, which is referenced here.I would also thank Geoffrey Lefebvre and Brandan Cully for their guidanceas senior graduate students throughout my research.I would like to thank Microsoft for enabling and supporting analytical re-search, and ultimately for allowing the results to be released, complete witha public data repository. Bill Bolosky was critical in guiding me through themechanics of conducting the disk content study and Richard Draves assistedwith the bureaucracy. Bill Bolosky wrote most of the sql queries that wereused to collect the data from disk content study database after my intern-ship had ended. The University of British Columbia IT department wasalso very generous in facilitating data collection and installing the tracingsoftware, particularly Brent Dunnington.xiiiDedicationTo Nanners and Finnegan and Luna. I had no idea when I started thiswork how much effort each of you would put into my completing it. It is noexaggeration to say that your support and sacrifices made this possible.xivChapter 1IntroductionThe performance and efficiency of enterprise storage has grown in impor-tance as increased CPU and memory performance, combined with highbandwidth networks and dense virtualization, have exacerbated the stor-age bottleneck. At the same time, storage systems are growing more featurerich and thus more complicated. Unfortunately, file systems and the toolsthat support them have changed relatively little in their ability to eluci-date the workload and data organization that define the storage system’sbehaviour. We frequently operate with surprisingly little specific knowledgedescribing what our storage systems are doing.The lack of insight into storage system behaviour leads users and de-signers to accept inefficiencies that we would not otherwise tolerate. Systemarchitects miss opportunities to improve their systems because those op-portunities are hidden in the system’s workload. For example, one recentinvestigation into a specific data center workload found that a typical pieceof data in a widely used storage architecture is relocated on disk 17 timeswhen being written [HBD+14] - such a problem is relatively easy to address,once identified. Related problems hamper storage system administratorswho have very little analytic data upon which to base the decision to deploy1Chapter 1. Introductionoptional features. For example, prior to the work in this thesis, there wasno publicly available measurement of the compression provided by dedu-plication on general purpose datasets of any significant size. This leavesadministrators without any basis to select from various deduplication op-tions aside from their own direct experience. Finally, workloads initiated byusers are sometimes wholly unnecessary, such as defragmentation appliedby default over a shared file system, where locality may have little impact.Such a condition is particularly likely to remain unnoticed in dense comput-ing environments where user workloads are hidden behind several layers ofvirtualization.Traditionally, analysis of storage system behaviour has been difficult, te-dious, and costly, which too often leads it to be extremely limited in scopeand kept private. To address these challenges, I present several advances in,and a collection of findings from, the measurement of data storage systems.I draw my contributions from two case studies of live system tracing andanalysis, in two different large commercially deployed enterprise storage en-vironments. Although there are relatively few similar studies available inthe published literature, this work is additionally noteworthy for consider-ing detailed traces of large numbers of systems over long time scales, and inenvironments that are historically difficult to collect data from. I present ananalysis of the data obtained in these case studies, which is useful for bothdesigners and researchers. These insights include observations drawn fromfeatures that are not typically considered in depth, such as the effects ofvirtualization and deduplication. I also present the data analysis techniquesthat have made obtaining these results possible, which are themselves novel21.1. Thesis Statementand useful to researchers wishing to recreate my work and apply it to otherstudies. In addition, I present the data collection techniques and tools usedto gather this data, which facilitate highly detailed trace collection with lowimpact for those who would apply or adapt my methodology to other sys-tems. Finally, to enable other researchers to conduct further analysis againstthese datasets, I have shared and published most of the data collected inthis work in the rawest forms possible, which has required novel approachesto data anonymization. These contributions constitute a framework for or-ganizations to frequently gather, persist, analyze, and share detailed storagetraces that are meant to be sufficiently simple that it can be be deployed byenterprise system administrators.1.1 Thesis StatementThe thesis of my work is that the growing complexity of modern storagesystems can be better met than it is presently, by organizations periodicallycollecting and retaining large, extremely detailed traces of system state andbehaviour.I define growing complexity as an increase in I/O intensive workloads,larger datasets, and more feature-rich storage systems. The trace-orientedapproach that I describe here can be applied quite broadly to address storagesystem complexity by providing insight into many aspects of storage systemoperation. To support this thesis I focus on three applications of tracing andanalysis: Capacity Management, Workload Introspection, and NamespaceModelling.31.2. Thesis OrganizationSpecifically, I claim:• Capacity Management: That detailed tracing and analysis can beapplied to manage disk capacity more effectively than is commonlydone today.• Workload Introspection: First, that there are significant areas ofmisplaced effort in the operation of storage systems for enterpriseworkstations. And second, that tracing and analysis provides a mech-anism for eliminating that wasted effort by highlighting the otherwisehidden behaviour of storage workloads.• Namespace Modelling: That detailed analysis of the structure offile system namespaces, although rarely done, can inform and justifynew storage features. In addition, that existing storage systems candirectly exploit knowledge of file system namespace organization toimprove performance.In addition, throughout this thesis, based on the case studies I haveperformed, I present a significant and broad body of new insights abouthow current systems operate in the field, and how they might be improved.1.2 Thesis OrganizationIn Chapter 2 I briefly expand on the complexity of modern storage systemsby describing the structure of Microsoft Azure Storage, a large cloud-basedenterprise storage system.41.2. Thesis OrganizationI then detail the background and motivation that drives the study ofstorage systems in Chapter 3. I discuss the differences between live systemtracing and studying static on-disk data, and describe some of the designchoices available in performing studies of each type. I then describe the casestudies used to validate this thesis in the context of these choices and discussrelated work.In Chapter 4 I describe the tools I have developed for data collection andanalysis, including the process used to install, collect, analyze and anonymizetraces. I also discuss the challenges and costs of gathering and retainingtraces and the lessons drawn from the process of analyzing the data, limita-tions and biases in the dataset, and possible attacks to the anonymizationprocess.In Chapter 5 I present a case study addressing Capacity Managementin enterprise storage. In this chapter I study a large number of file systemsto characterize recent changes to disk size and utilization, and measure theeffectiveness of a widely deployed but poorly understood feature - dedupli-cation.In Chapter 6, I present a set of case studies addressingWorkload Intro-spection in virtualized storage environments. I first provide the backgroundinformation necessary to understand the virtualized environment and dis-cuss some of the administrative challenges it poses. I then demonstrate thatcollecting traces of individual storage requests at much higher detail than istypical can support novel actionable investigations into system behaviour. Ispecifically show how tracing can identify wasted efforts in user workloadsand how to eliminate them, and how analysis can measure the effectiveness51.2. Thesis Organizationof scheduled background defragmentation, which is a workload-oriented con-cern that administrators have long shared.In Chapter 7, I present a case study investigating the topology of filesystem namespaces and the distribution of files within them, which I referto as Namespace Modelling. I apply this to enterprise desktop workloadsin two ways. I first compare the namespace models I have gathered tootherwise unchecked assertions from prior research efforts. I then go on toshow that enterprise workloads can benefit from leveraging detailed tracesof file system namespaces by illustrating new opportunities for optimization.I conclude in Chapter 8 that there are many examples of simple im-provements to existing deployed storage systems that are made obvious bytracing and analysis. This finding lends support to the hypothesis that thereis benefit to organizations performing detailed traces of file system studiesand workload.6Chapter 2A Brief Case Study: StorageSystem ComplexityToday, enterprise storage systems operate in a broad range of environmentsand exist at many different scales. However, all but the smallest systemsface similar fundamental challenges: They must serve a large number of I/Ooperations per second, they must store a large amount of data, and theyare pressured to expose complex interfaces and feature sets. The scale ofeach of these requirements is significantly different from storage workloadsa decade ago, and it drives these system to be very complex.In this chapter, to help contextualize this increase in storage systemcomplexity, I present a brief description of the structure of one enterprisestorage system. Except where otherwise stated, I draw the entirety of thissimplified 1 example from the description provided in 2011 in “WindowsAzure Storage: A Highly Available Cloud Storage Service with Strong Con-sistency” [CWO+11].Microsoft Azure Storage is a cloud service in widespread use since 2010,1In this example I omit many aspects of Azure’s operation and have made changes tosome component names, for the purpose of explaining the relevant structural aspects ofthe system succinctly.7Chapter 2. A Brief Case Study: Storage System Complexitywhich provides customers with pay-for-use online storage. Customers arefree to grow their usage without explicit limits and the underlying architec-ture scalably accommodates the load. Its interface natively supports files(called blobs), tables of structure data, and queues for message delivery.Each of these objects is native to the file system and is accessed with anindependent API.A diagram of the Azure Storage architecture is provided in Figure 2.1.Like most storage systems, it is structured in a stack of layers, each with awell defined narrow interface. The top of the stack is the layer responsiblefor load balancing. The load balancing layer passes requests to the parti-tion layer, which is responsible for presenting the different data abstractions(blobs, tables, and queues), organizing those objects into a namespace sothey can be located by their keys, and replicating objects to remote dat-acenters for disaster recovery. Below the partition layer, the stream layerorganizes data updates into blocks to be written to underlying storage, andensures that these blocks are replicated to different devices across differentfault domains. In each fault domain, data is is written to stream nodeswhich are independently responsible for storing their own data. I will nowdescribe the mechanisms by which each layer performs these functions.At the bottom of Figure 2.1 I have formed a valid URL to write a blob ofdata to Azure 2. This request is sent over the internet to the load balancinglayer, which is made simple to avoid acting as a bottleneck. Based onphysical location, a load balancer routes the request to the partition layerbelow. Conceptually, the payload for this new blob of data will be written to2Setting aside the authentication and authorization requirements.8Chapter 2. A Brief Case Study: Storage System ComplexityPUT https://myaccount.blob.core.windows.net/mycontainer/myblob HTTP/1.1 “hello world”Load BalancingPartition LayerStream LayerMap TablePartitionServercreate myblobPartitionManagerCommit logFigure 2.1: Microsoft Azure Storage: A simplified exampledisk and a record of the object will be recorded in a lookup table. However,because a lookup table for trillions of objects would be impractical, it mustbe split across the nodes in the system. To identify the location in thepartition table that will be reserved for this object, an index in the maptable is created. That table directs the request to the appropriate partitionserver, which writes the payload, and logs an entry of the blob’s creation in acommit log. Periodically, the commit log is collapsed into a checkpoint to savespace, and any stale entries are collected. Each of these four objects: Themap table, commit log, checkpoint, and the blob payload itself are writtento the next layer in the stack – the stream layer. In addition, the partitionlayer also replicates the request to another datacenter for the purpose of9Chapter 2. A Brief Case Study: Storage System Complexityrecovery in the event of a datacenter-wide failure, thus at a global scale,every operation in this example is repeated twice.At the stream layer, each write to each of these objects is split into anumber of updates, and those updates are organized into blocks for storageto disk. Small writes may fit in a single update, while large writes may besplit across many. Blocks are replicated across three fault domains, whichare logical organizations of storage devices that partition the cluster intoregions within which there are no correlated sources of failure. Among otherrestrictions, this means that each replica must be written to a different nodein the system.These stream nodes each maintain their own blocks, and block locationsare written to an index so they can be found later. In addition, a copy ofthe block is written to a log as an optimization. Usually this log write isfaster than writing the index and the block, which allows the stream node toacknowledge the write as complete once is is logged. The log is not indexed,but it can be replayed to rebuild the index and original block in the event offailure. These three objects at the stream node (the block, index, and log)are written to files within the the Windows NT file system (NTFS). NTFSis itself a complex storage artifact, which is to say that these writes mayrequire journaling and metadata and inode updates, in addition to writingthe actual data. The tracing and analysis in my thesis is all based on NTFS,which I discuss in more depth in Chapter 3.In sum, the single original write request, which may have been issuedto Azure’s URL from a user’s phone or browser, and likely without theirexplicit knowledge, results is a tremendous cascade of subsequent requests.10Chapter 2. A Brief Case Study: Storage System ComplexityIt touches potentially four objects at the partition layer. Each of those maybe further subdivided at the stream layer, which then replicates all of theresulting requests 3 ways. Each of those replicas can yield three more writesat the stream nodes, which in turn can create subsequently more requestswithin NTFS. Furthermore, each point of the system’s design represents adecision made by a file system developer based on workload assumptions,if not direct measurement. For example, consider the checkpoint rate ofthe commit log at the partition layer. An overly aggressive checkpoint rateyields write amplification through the system. On the other hand, infrequentcheckpointing lengthens the log, which can hamper recovery.This complexity in Microsoft Azure Storage is largely unavoidable be-cause it comes as a direct response to the burdens of its workloads. In2014, it serviced 3 million requests per second, across 30 trillion user ob-jects [Gur14]. However, it is not unique in this regard. Many other cloudstorage systems are similarly complex, yet are completely different in theirimplementation. Facebook’s Haystack is designed for servicing petabytesof immutable photos [BKL+10]. GoogleFS is a general purpose scalablecluster file system used internally for many of Google’s services [GGtL03].Besides these and many other cloud storage examples, modern enterprisestorage arrays are also generally implemented as a cluster of servers whichcoordinate to spread load and distribute data, which means they share mostof these same drivers in system complexity. Increasingly, these systems arealso servicing millions of requests per second [CWM+14], just like MicrosoftAzure Storage.By comparison, consider the canonical Berkeley Fast File System (FFS),11Chapter 2. A Brief Case Study: Storage System Complexitywhich was originally evaluated on a single system with 920 Megabytes offormatted space and a 983 kilobyte maximum throughput [MJLF84]. Thisis no straw man: FFS remains quite influential, so much so that manyof the insights drawn from that work continue to be used today. The 4kilobyte block size employed by most file systems is a direct result fromFFS, as is the practice of allocating inodes statically in each cylinder group,which makes inode count a function of disk size. These insights were notobvious at the time; they were drawn from measurement and analysis of athen complex system. However, the parameters in FFS and their expectedimpacts were determined based on the measurement of a live user systemwith characteristics that bear little resemblance to storage systems today.Microsoft Azure Storage is radically more complex than FFS. Even whenone understands the structure and design, the overall behaviour under a loadof millions of requests per second spread over exabytes of data in trillions ofobjects is impossible to intuit accurately. However, as was true for FFS, mea-surement is required to make good decisions in all storage systems, across allstorage domains. The complexity of modern storage systems not only callsthe validity of legacy file system choices into question, it also challenges ourability to even consider their impacts. This is because performing analysis onlarge-scale high-throughput systems with complex request inter-dependencyand large features sets is a difficult undertaking. In the following chapter, Iprovide background on tracing and analysis. I describe how the tracing inthis thesis was performed, how it has been done in the past, and how thatrelates to file system design and architecture.12Chapter 3BackgroundThere is a long history of tracing and analysis of storage systems, both asa research endeavour to provide general insights into storage system be-haviour [OCH+85, BHK+91, Vog99, WDQ+12, DB99, ABDL07] and as apractice to evaluate a specific idea or argument [AADAD09, AAADAD11,TMB+12, BGU+09]. In addition to the specific findings I will detail inthe following chapters, this thesis builds upon that analytic tradition, whilesimultaneously demonstrating new techniques to facilitate more frequent,larger scale, and more detailed tracing and analysis practices. This chapterserves to contextualize my contribution and provide the background to un-derstand the tools that I have developed and how they have been appliedto working systems.I have argued that storage systems are complex, and that useful infor-mation about their operation is often obscured. To understand why thisoccurs, how tracing and analysis can help, and the limits of this approach,one must understand how storage systems process requests. In Section 3.1 Ibegin by describing, in the general sense, problems that tracing tools face asthey relate to the storage stack. I consider where tracing might and shouldoccur, how traces are gathered, and potential pitfalls. In Section 3.2 I simi-133.1. Storage Tracinglarly describe the choices one is presented with when analyzing the on-diskdata that results from the use of the storage stack over time. In Section 3.3I describe the organization and architecture of the Microsoft Windows stor-age stack. I also highlight where Windows employs designs that differ fromLinux and other operating systems. Finally, in Section 3.4 I describe thetracing and analysis I have done as part of the case studies in this thesisand compare my work to related research.3.1 Storage TracingStorage traces, such as “Measurement and analysis of large-scale networkfile system workloads” from 2008 [LPGM08], are built from recording theactivity that passes through some point in the storage stack. Since tracesfocus on the active use of the system, they include details about data thatis accessed over the duration of the trace, but typically do not include in-formation about files that are not accessed, and typically do not includethe contents of files or of file system metadata. Typically tracing involvesloading a software module or otherwise modifying the storage stack in orderto passively monitor the requests. However there are several decisions thatmust be made when performing such a trace. In this section I describe whatcan be learned from tracing storage systems and detail some of the concernscommon to storage system traces.143.1. Storage Tracing3.1.1 Applications of Workload TracesTraces can be applied to improve storage systems in a number of ways. Forthe comparative study of systems, they can be replayed either by informingthe creation of a synthetic workload generator, or by direct replay. Forexample: Tarasov et al. used block-level traces of build servers and onlinetransaction processing servers to generate replayable workload generatorsin 2012 [TKM+12]. Traces can also be studied to produce statistics andcharacteristics of the workload in order to guide system design. For example:A trace gathered by Narayanan et al. in 2008 was used to detect hot-spots infile accesses within a storage cluster in order to divert workload to less loadedmachines [NDT+08]. When applied to existing systems, traces can oftenidentify performance bottlenecks, misconfigurations, or flaws by highlightingareas of poor performance or wasted effort [HBD+14]. As storage systemsgrow more complex and workloads become more intensive, there are moreopportunities for such inefficiencies to go unnoticed.Each of these applications of tracing is featured in the rightmost columnof Table 3.1, which lists the data associated with individual entries in moststorage traces. The leftmost column lists the corresponding characteristic ofthe workload that is typically accessible in a trace of the system. Of course,these results can be made more valuable by aggregating them. For example,one can determine and compare I/O operations per second of different sys-tems, or read/write percentages of common workloads. The middle columndescribes the approximate layer of the storage stack (block or file) wherethat characteristic is typically available. Some characteristics are available153.1. Storage TracingCharacteristic Layer Example Applicationdisk offset block Characterizes linearity of access to disk.request size both Measures workload throughput for evaluation.issue time both Ensure correct timing in a replayable workload.req. latency both Measures performance of underlying storage system.read/write both Sets optimization priorities based on request.request type file Open-to-close duration provides file lifetimes.target file file Access distribution is used in workload modelling.file offset file Linearity of access informs prefetching systems.source app. file Attributing problematic workloads to applications.access flags file Flags (sync, temp) give hints for workload prediction.cache status file Analyzes effectiveness of page cache.Table 3.1: Block vs. file characteristics. The different characteristics typi-cally measurable at block and file-layers and how they might be used.at both file- and block-levels, though their interpretation would be differentdepending on the level in question. Further, tracing typically only providesinsight into one point of the stack, so not all information will be available inevery trace. I return to this table in the next section, when discussing therelative merits of tracing at different points in the stack.3.1.2 Trace AltitudeThe modularity of the storage stacks in most file systems afford multiplelocations where one can trace, and each yields different results. The higherlevels of the file system are useful because they provide access to the rel-atively rich file system API. At this layer, for example, distinctions be-tween file data and file metadata access are apparent, and one can generatea trace suitable for replay against a file system. As shown in Table 3.1,there is clearly richer semantic information at the file-layer; however, theinformation at the block-layer is generally no less useful. The logical disk163.1. Storage Tracingoffsets where data actually lies are critical to understanding the linearityof workloads, which can have a huge impact on disk performance, but istypically unavailable above the block-layer. Furthermore, because requestsare frequently added and removed throughout the storage stack, the sum ofrequests that are ultimately issued to hardware is only visible at the lowestlayers of the stack.Another place to trace is above the file system entirely – at the application-level. Application-level tracing may provide some increased ability to un-derstand semantic patterns in application workloads [YBG+10], but alsoremoves any insight into file system or disk operation. Most commonly,such traces are used to build synthetic workload generators or replayabletrace applications. Furthermore application traces are usually employedwhen there is one critical application running on a system, such as an NFSserver, that will be generating most of the storage requests.3.1.3 Cache ConsiderationsThe other primary concern with respect to tracing is the trace point inrelation to memory caches. For example, when building a replayable trace,it is often helpful to both trace and replay below the cache so as to normalizethe effect of caching between the original host and the system subject toreplay. In contrast, some traces such as “File System Usage in Windows4.0” from 1999 [Vog99], have included trace information above the cache inorder to study the behaviour of the cache management sub-system itself.Block device drivers generally operate below the cache, and althoughthe unified cache in Linux does include a buffer cache that is capable of173.2. Storage Analysisblock-level caching, in practice most population of that cache is done at thefile-level. With the Windows Cache Manager, I/O is visible to all filtersabove the Virtual File System (VFS)-layer whether the request is cached ornot, and so tracing at this layer is generally parallel to the cache – neitherabove nor below. I will discuss this arragement in more depth in Section 3.3.3.2 Storage AnalysisI now turn to discuss storage system analysis, which differs considerablyfrom active tracing. Whereas traces passively observe live activity from arunning system, a storage metadata or content study generally queries thestorage system in order to discern the state that prior workloads have drivenit into. There are a wide range of characteristics of storage systems that canbe studied from this perspective, but most attributes of on-disk data can canbroken into four categories: First, information about file system metadata,which is the data that a file system stores about each of its files and thefile system generally. Second, the organization of files within the file system,such how they are placed within directories. Third, where data ultimately isplaced to disk. And finally, analysis of the data within the files themselves.This section describes these characteristics in more depth and explores howthey can best be analyzed and what can be learned from them.3.2.1 The Value of Storage AnalysisThe structure and content of a file system provides many opportunitiesfor performing interesting analysis. Table 3.2 provides a small sample of183.2. Storage AnalysisExample Attribute Attribute TypeFile size File AttributeFile flags File AttributeFile type File AttributeTime stamps File AttributeSpace Utilization FS AttributeFile name Namespace OrganizationPath Namespace OrganizationHard links Namespace OrganizationData Fragments On-Disk LocationCompressability Data AttributeDeduplicability Data AttributeTable 3.2: Metadata examples. Examples of metadata that can be includedin a storage system analysis.attributes that are typically available, and a categorization of the type ofattributes according to the 4 categories above.In the past, analyses that have drawn on findings like those in Table 3.2have leveraged better understanding of real world namespace organization tobuild synthetic file system generators [AADAD09], which mimic the layout ofreal file systems but can be parameterized and used to test new applicationsor storage systems. Similar results in the file attributes domain can also beused to inform the structure of file systems, for example, guiding designersto correctly optimize for the ratio of small to large files. The distribution offile sizes is a useful statistic when designing a file system because it informs,for example, the portion of files can be stored directly with the file metadata(in Windows, such a file can exist entirely in the Master File Table) and theportion of files that will need an indirect reference to a block of on-diskdata. Similarly, knowing which file types tend to be large is useful whenallocating contiguous space on the disk. Better characterizations of on-193.2. Storage Analysisdisk data placement influences research in data layout optimization [HSY05,BGU+09]. Understanding the structure of data itself has many applicationsas well, including the inspiration and evaluation of new data managementtechniques such as compression and deduplication [SPGR08, CGC11].As file systems grow larger in capacity and the feature sets they exposegrow more complex, understanding these attributes becomes more impor-tant. For example, very large systems are often driven to combine storageresources through virtualization, and this creates opportunities for sharedcaching or redundancy eliminating storage, however the effectiveness of suchtechniques depends on the similarity of data across different machines. Un-derstanding the real world applicability of such a mechanism prior to invest-ing in its construction is only possible with tracing and analysis of existingsystems.3.2.2 Storage Analysis ConsiderationsUnlike storage stack tracing, on-disk data analysis is almost always per-formed at the file-level, where the semantic information associated with filescan provide structure to the data. Presuming one finds a suitable collectionof file systems to measure, storage system analysis typically struggles withtwo mechanical issues: scanning the file system is time consuming, and theresulting datasets can be very large.The severity of both issues depend on what is actually read. If only filemetadata is considered the data can typically be gathered in minutes andstored on the order of megabytes per file system investigated. This can bedone in Windows by reading the Master File Table or the inode tables in203.3. The Windows Storage StackLinux’s Ext2/3 file systems. If the study requires processing file data, asthe analysis in this thesis does, the process takes considerably longer. Ineither case, the process takes long enough that the file system is likely to bechanging while the data is gathered. For many types of analysis, this is unde-sirable, because it means the file system is not in a crash consistent state asit is read, and thus may include inconsistencies. If file system snapshots areavailable, as they increasingly are in modern file systems [RBM13, Cor10a],they may be used to create a crash consistent file system image. As I showin Section 4.1.2, where I describe my own file system analysis framework,Windows can create a fully consistent view of the file system for many ap-plications. If snapshots are used in this manner, care must be taken thatthey themselves are not treated as subjects of the data analysis.A second concern with a file system study is where the data generatedin the study is stored. Storing the data directly to the file system itselfmust change the file system. Alternatively, storing the data on a remotenetwork storage target, particularly if the dataset is large, risks data lossdue to unavailability of the network store. One solution is to eliminateentries associated with storing of data from the trace in post processing,which removes much of the effects of tracing, though does still potentiallychange the physical layout of the data on disk somewhat.3.3 The Windows Storage StackIn this section I describe the Windows storage stack because it is the subjectof the tracing and analysis in this thesis. Tracing in Windows is challenging213.3. The Windows Storage Stackbecause of the breadth and complexity of its APIs. However, it is alsorewarding because Windows represents a large installed base of computersystems.Like most storage systems, The Windows storage stack is structured as acollection of layered software modules that each transform storage requestsfrom the high-level interface enjoyed by applications progressively down tothe raw access provided by the disk drive itself. This layering serves two pur-poses. First, it simplifies the storage system. At each layer a developer canrestrict their concerns primarily to the operations and semantics requiredby requests at that layer. Second, it provides extensibility, in that mostoperating systems provide some mechanism for adding multiple file systemsand other custom software modules to the storage stack. Both Linux andMicrosoft Windows provide this support by allowing administrators to loadmodular kernel-mode drivers that manipulate storage requests.An illustration of the structure of the Windows storage stack is shownin Figure 3.1. For the sake of simplicity I have limited the discussion tocomponents relevant to this thesis. Through the remainder of this sectionI will describe the Windows storage stack from the bottom to the top. Amore comprehensive discussion can be found in [RSI12].3.3.1 Device Management and the Block-LayerAt the bottom of the storage stack in Figure 3.1 is the block-layer, whichoperates very much like a simplified interface to an underlying disk device.Each request is either a read or a write to a logical block address on thedevice and is aligned to a 512KB boundary. However, requests are almost223.3. The Windows Storage StackNTFS FATFilter Manager(Sec. 2.1.3)Cache Manager(Sec. 2.1.4)Kernel ModeUser ModeWin32 APIFile System Filter Drivers(Sec. 2.1.3)Virtual File System (VFS)(Sec. 2.1.2)Device Management (Sec. 2.1.1)Other Filter DriversUpper Filter DriversMini!lter Drivers (Sec. 2.1.3)Block Layer(Sec. 2.1.1)File Layer(Sec. 2.1.2 - 2.1.4)Storage Class DriverI/O Request Packets (IRPs)I/O Request Packets (IRPs)SCSI Request Blocks (SRBs)I/O requests to pageto/from the cacheI/O Manager(Sec. 2.1.4)File cache256KB granularityFigure 3.1: The Microsoft Windows storage stack.always aligned to a 4KB boundary and a multiple of 4KB in size. Thereare no files or directories at this layer. The stack is also bidirectional, inthat each driver can register to see requests issued (down the stack), andseparately to see them completed (up the stack). Though the programminginterface is quite different, this layer is analogous in its position and power tothe Device Mapper in Linux. Every software module at this layer serves toprepare requests for the underlying hardware, and are ordered in the stackin the layer that they are loaded into the operating system.The lowest layer of the stack I consider in depth is where upper filter233.3. The Windows Storage Stackdrivers are installed, but as the name suggests there are layers lower. Upperfilter drivers operate on a data structure called an I/O Request Packet (IRP)which encapsulates an I/O request through most of the storage stack. Belowthis point, a storage class driver, such as the generic iSCSI driver, transformsthe IRPs into other structures. Less conveniently for the sake of tracing,the storage class driver may also split and merge requests. This means thatwhile upper filter drivers see patterns of requests that closely match thoseof the underlying device, they do not have visibility into some operationsperformed at lowest layers of the stack.3.3.2 The Virtual File System LayerAs shown in Figure 3.1, above the block-layer Windows supports a virtualfile system interface in which multiple file systems can be installed, as doesLinux. However, most Windows workstations use NTFS as their primaryfile system, 3 and as NTFS is a commercial off-the-shelf closed-source prod-uct, insight into the specific transformations at this layer is challenging.Generally, like UNIX file systems, NTFS transforms requests upon files, di-rectories, and metadata into the relatively simple block-level requests of thelayer below.3.3.3 Filter ManagerEvery layer above the VFS operates on the file API. Like the layers below,the IRP structure is used here, but this interface is considerably larger and3Unlike Linux, where there are several commonly used file systems, there is a sin-gle file system used exclusively on almost every non-removable drive in every Windowsworkstation - NTFS.243.3. The Windows Storage Stackricher than the block-layers. It includes read, write, file open and close,and create and delete. Among other filter drivers that may be installedon a Windows workstation, Figure 3.1 highlights the filter manager, whichexposes a framework for writing minifilter drivers (minifilters). Like filterdrivers generally, minifilters register routines that process requests madeby Windows programs and users. Unlike filter drivers, minifilters feature asimplified interface, load in a deterministic order, and can be dynamicallyunloaded. Each routine may forward requests it receives down the stack, orcomplete them without passing them on. They may also issue new requests,which I will discuss further in the next section.Each minifilter driver is assigned a unique altitude, which is a numberthat determines when it is loaded and where it resides in the storage stack.For the purposes of tracing, one would conventionally use an altitude in theFSFilter Activity Monitor group, which includes altitude values 360000 to389999 for “drivers that observe and report on file I/O” [Cor14b]. There are22 other altitude groups available, but the Activity Monitor group is quitehigh in the order. This relatively high altitude means that traces obtained atthis level closely match the original stream of requests made by applications,because they have not yet been transformed by drivers of lower altitude.3.3.4 I/O and Cache ManagersIn Figure 3.1, two more components lie above and beside the filter driverlayer, respectively. The filter drivers receive requests from the I/O manager,which is the arbiter of all requests onto the storage stack. The I/O manageracts as a central point of dispatch to deliver IRPs to the correct layer of253.3. The Windows Storage Stackthe stack in the correct order. This is true even if the requests originatefrom drivers themselves, in which case the I/O is performed recursively.Recursive I/O passes back through the same driver again on its way fromthe I/O manager down the stack. Like non-recursive I/O, it also triggerscompletion events on the way back up the stack. This design has manyimplications, among them that file requests issued by the file system itselfwill ultimately be seen above the file system by the routines of file systemfilters.Another important consequence of this design is that, unlike a canonicalstorage stack which has a singular layer that services cached requests, thecache manager in the Windows storage stack sits aside the storage stack.Requests that are serviced by the cache manager still traverse the entire stackthrough to the VFS-layer, but do so with a flag denoting the operation asone that operates on the cache. Similarly, paging operations that move dataout of the cache traverse the stack the same way (through the file-layer),but with different flags. As a result file system filter drivers can all see thepaging activity caused by the cache manager.This organization also means that any I/O created by a tracing processmust be reissued through the I/O manager and thus will be visible to thelayer that issued it. This can be an issue if a tracing system must issue I/Orequests to write the tracing logs to disk, or if the the tracing driver mustissue new requests to, for example, determine metadata such as a file name.Generally such artifacts of tracing are best removed from a trace.263.3. The Windows Storage Stack3.3.5 DiscussionStorage stacks generally, and the storage stack used in Windows worksta-tions specifically, are complex and extensible. Each layer of the stack ob-scures the context of operations that pass both above and below it, whichmakes understanding the I/O requests that flow through the entirety ofa storage system difficult to characterize. Furthermore, requests may bebatched, delayed, merged, and/or split by the filter drivers, the file system,the block layer, or any other layer of the stack. Furthermore, this ma-nipulation of requests often leads to read- and write-ordering dependenciesbetween requests that the software must manage. As a result, the behaviourof a storage system under a real workload is complex and difficult to under-stand. Further, for any given workload, the behaviour at different layers ofthe stack may be quite different. This means that intuiting storage systembehaviour without a trace is extremely challenging, and yet tracing is usedin a vast minority of research storage system evaluations [TBZS11]. Theproblem is amplified with heavier load and with the size of the storage stackAPI. This also means that no small number of traces can characterize abroad pool of workloads at multiple layers of the stack. Part of my thesis,as described in Chapter 1, is that the benefits of understanding storage sys-tem response to workloads are significant, and that more tracing in moredepth is required to facilitate that understanding.In the following two sections, I discuss how to approach instrumentationof the storage stack and how to analyze the effects of requests on a filesystem. I then discuss existing research as it relates to the tracing and273.4. The Traces and Analysis in This Thesisanalysis I will present in this thesis. In Chapter 4, which follows, I describethe tools I have written to collect data and how those tools interact withthe storage stack.3.4 The Traces and Analysis in This ThesisNext, I will describe the traces and analysis conducted to support this thesis;I present two significant case studies. The first is an on-disk data study offile system content gathered in 2009 at Microsoft Corporation. The secondis a trace of storage system activity at the University of British Columbiain 2010. Collectively, these studies characterize behaviour at both file- andblock-layers of the stack, include most file and file system metadata, thecontent of files themselves, and describe detailed workloads at scales upto one week, as well as the changes in on-disk content over the course of amonth. In this section I describe these traces as they relate to other research.3.4.1 File System Content StudyMy study of file system contents and metadata includes nearly 1000 ma-chines which were scanned at Microsoft Corporation in Redmond, Washing-ton over 6 weeks. The bulk of my analysis in this thesis focuses on 857 filesystems spanning 162 terabytes of disk over 4 weeks, selected because I havereliable data from each of these systems for every week in that period. Theresults includes data and metadata from a broad group of employees, includ-ing software developers, management, sales & marketing, support, writersand legal staff.283.4. The Traces and Analysis in This ThesisThese results are notable in three ways. First, as I show in Table 3.3below, analysis of file system metadata is rare. Mine is the only publishedstudy in the past decade to provide a characterization of deployed work-station file system metadata. Second, my study considers snapshots of filesystem content on a weekly basis for four weeks. This means that in addi-tion to characterizing large scale trends in file system evolution by comparingmy results to those of similar studies, my dataset can identify shorter termtrends such as the portion of files that are unmodified week-to-week. Fi-nally, this study is the first to measure and record the content of the fileswithin a large set of general purpose machines. This allowed me to drawconclusions about the rate of file modification as well as to compare sev-eral approaches to a storage feature called deduplication. Deduplication isa process by which two or more large regions of identical data can be storedonce, in order to reduce data sizes. I discuss deduplication in more depth inChapter 5.The impact of working at this scale was significant, as the highly com-pressed logs gathered in this study comprised more than 4 TB of data. Theanalysis of this data, particularly that which is described in Chapter 5, re-quired identifying and counting small common sub-strings in that corpus,which is a time consuming and disk and memory intensive workload. Mi-crosoft Corporation has released an anonymized version of this data to behoused at the SNIA IOTTA storage trace repository for public use [Ass13].I consider the results of this study throughout this thesis, in Chapters 5, 6,and 7.293.4.TheTracesandAnalysisinThisThesisStudy Source Year of Metadata Count: Data Total DurationStudy or Data FS = File system Content CompressedDC = Data center Analyzed Study SizeDataset =A collection ofsubdirectoriesDouglis [WDQ+12] 2012 both2 8 FSes3 682 TB Unknown 1 weekLu [LCGC12] 2012 data 4 Datasets 10 TB Unknown Point-in-timePark [PL10] 2010 data 6 Datasets 150 GB Unknown Point-in-timeMeyer [MB11] 2009 both 3,500 FSes 162 TB 4 TB+ Weekly, 4 weeksZao [ZLP08] 2008 data 2 DCs 61 TB Unknown Daily, 31 daysPolicroniades [PP04] 2004 data 4 Datasets 200 GB Unknown Point-in-timeKulkarni [KDLT04] 2004 data 5 Datasets 8 GB Unknown Point-in-timeAggrawal [ABDL07] 2000-04 metadata 70,000 FSes N/A 91.1 GB Yearly, 5 yearsDouceur [DB99] 1998 metdata 10,000 FSes N/A 3.7 GB Point-in-timeTable 3.3: File system studies. Significant published studies of file-system data and metadata published since1999.303.4. The Traces and Analysis in This ThesisThere are relatively few large scale studies of on-disk metadata or datain the published literature. In Table 3.3 I list the significant file systemmetadata and deduplication content studies published since 1999.In 1999 [DB99] and then in subsequent years until 2004 [ABDL07] verylarge pools of workstations were analyzed for file system metadata. Relativeto the work in this thesis, those studies include a larger number of work-stations because they did not have the burden of analyzing on-disk content,only metadata. These studies are extremely valuable in the context of themetadata results in this thesis, because they provide an older snapshot offile system usage against which to compare.There are also relatively few analyses of deduplication ratios, and theones that exist suffer from several common limitations. Policroniades [PP04]studied a small corpus of data from a variety of sources in 2004 (200GB,about 0.1% the size included in this thesis) which, contrary to my results,found little whole file deduplication, and a modest difference between fixedblock and content-based chunking. Kulkarni [KDLT04] used an even smallercorpus in 2004 (8GB) to analyze fixed block deduplication, compression, anddelta-encoding. A similar study was performed by Park [PL10] on a datasetof 25GB in 2010. Others have attempted to measure deduplication rateagainst small corpuses of synthetically generated data [JM09], which maybe more easily scaled to large sizes, but is of questionable accuracy.2Study included data pertaining to file lifetimes within the deduplication system itself,not the user metadata subject to backup.3Over 10,000 machines were used for metadata analysis based on autosupport logs,whereas 8 datasets were used for deduplication analysis.313.4. The Traces and Analysis in This ThesisThe largest studies of real-world deduplication rates are performed bydeduplication vendors themselves [ZLP08, WDQ+12]. These rates provideuseful data points, but also reflect self-selection in that they measure a cor-pus of users who have decided that they would benefit from deduplication.In addition, it is difficult to distinguish the innate deduplicability of thedata from the deduplicability of the workload. For an in-line stream-baseddeduplication system there is little distinction between the two. If a user hasa single file system that at any point in time has just 1% data duplicatation,but one then creates a full backup of that system 100 times, the deduplica-tion rate would be reported by commercial vendors as approximately 100x.Results in this manner present a useful insight into the workloads appliedto deduplication systems (how many times files are replicated for the pur-poses of backup), but do not necessarily reflect the potential for reducing theoverall work that a deduplication system must do, for example, by employ-ing alternate approaches to deduplication which I will discuss in Chapter 5.Relative to the results in this thesis, Wallace, Douglis, et al. [WDQ+12] pub-lished after (and in some sense perhaps, in response to) the work I describehere. They found much higher deduplication rates across a larger corpus ofdata, but in many cases this is a question of parameterization rather thandata content – the data was simply backed up more times in practice thanthe 4x assumed in my study. Regardless, their insights into the workloadsthat drive stream-based deduplication systems are unique and extremelyvaluable.With respect to Table 3.3, compressed study sizes are not publishedfor many of these datasets, and the datasets themselves are not publicly323.4. The Traces and Analysis in This Thesisavailable, so their study size, shown in the second column from the rightis unknown. In each such case they are likely to be much smaller in termsof the size of the data analyzed than this thesis. Even in the case of theDouglis [WDQ+12], where 10,000 disk images were considered, only filemetadata was collected over that set, and from the 8 very large file systemsthat made up the 682 TB of data content studies, only the aggregate resultsof deduplication were retrieved. In some ways this is a good thing, as smallerstudy results are easier to manipulate. However, it also precludes any furtherin-depth analysis if the data from such a study was to be made public, andmeans that those efforts did not face the same challenges associated withprocessing large datasets.A few systems have approached the problem of deduplication in primarystorage [DGH+09, UAA+10], and have used largely synthetic workloads intheir evaluation. While the compressability results appear roughly in linewith my own, it has yet to be seen if aggressive deduplication in the datapath is practical from a performance perspective. In “Insights for DataReduction in Primary Storage: A Practical Analysis” [LCGC12], publishedafter the work in this thesis, researchers roughly confirmed many of myfindings with respect to deduplication rate.Both commercial and academic deduplication systems should benefitfrom the results of my file system study [LEB13, KR10, CAVL09, BCGD00,LEB+09, KU10] in being able to make more informed choices about the tradeoffs associated with deduplication. Similarly, these efforts may spur contin-ued research into developing accurate models and benchmarks for dedupli-cation analysis, which has only just started [TMB+12].333.4. The Traces and Analysis in This Thesis3.4.2 Storage Stack TracingThere are more publicly available traces of real Linux or Windows systemworkloads than there are disk-content studies. However, there are still sur-prisingly few, and little about tracing methodology has changed since Vogel’sanalysis of Windows behaviour in 1999 [Vog99]. I have attempted to improvethe state of knowledge and practice of file system workloads by publishingthe traces in this thesis, and extending live workload collection practices byincluding richer file system and block-level metadata. I have also proposedsimplifying the practice of creating and collecting these traces by leveragingincreasingly common virtualized environments, which was cited in 1999 asone of the major challenges in performing this type of research [Vog99].My workload study consists of a week-long trace of all storage traffic froma production deployment of 55 Windows Vista desktops in an executiveand administrative office of a large public organization. The traces weregathered by University of British Columbia’s IT department, UBC IT, undermy supervision and using tools I developed. My trace is shown in bold inTable 3.4 along with all relevant traces of storage system activity publishedsince Vogel’s 1998 trace of Windows workstations published in 1999 [Vog99].Traces are sorted by year they were conducted.My study is notable in that it includes simultaneous tracing of the file-and block-levels of the operating system. 4 This has allowed me for the4Here I am tracing a virtualized environment, and have chosen to trace the guestoperating system, as opposed to the host. This means I am tracing the logical block-levelas seen by the guest, which will be quite different from the physical block-layer in the hostoperating system. In the context of this trace I use the term “disk” throughout this thesis343.4. The Traces and Analysis in This Thesisfirst time to relate disk access to the region of the file system for whichthe access is directed. I have also studied NTFS file systems in a relativelyunexplored context - that of a virtual desktop deployment. Thus, the block-and file-layer shown in Table 3.4 refers to the virtualized guest as opposedto the virtualized host. Finally, the tracing in this study is facilitated bythe virtualized environment, wherein a tracing driver could be installed andtested once, then deployed to all machines in the cluster instantly. I considerthis study throughout this thesis, with the bulk of workload analysis inChapters 6 and 7.to refer to the virtual disk as seen by the guest operating system.353.4.TheTracesandAnalysisinThisThesisTrace Source Year of Layer Request Trace Machine DurationTrace Count Size 5 CountFacebook [HBD+14] 2013 Application (HBase) 5.2B 116 GB Unknown 8 daysYahoo [ALR+12] 2012 FS Metadata (Hdfs) Unknown Unknown 4000+ 1 monthM45, CMU [RKBH13] 2010-12 Application (Hadoop) 100K Unknown 473 20 monthsUBC [SMW+10] 2010 Block+FS 300M+ 75GB 55 8 daysFIU srcmap [VKUR10] 2008-09 Block 26.7M 14.6 GB 1 1 monthMS Prod Other [Ass14a] 2008 Block 163M 2.6 GB 6 1 dayFIU home [KR10] 2008 Block 18M 2.1 GB 1 20 daysFIU mail [KR10] 2008 Block 443M 11.6 GB 1 29 daysFIU web [KR10] 2008 Block 14M 351 MB 1 29 daysMS Prod. Build [Ass14a] 2007 Block 1.2B 13.0 GB 1 18 hoursMS Exchange [Ass14a] 2007 Block 64M 1.4 GB 1 1 dayAnimation07 [And09] 2007 Application (NFS) 144B 2 TB 1-50 6 9 months 7NetApp2 [LPGM08] 2007 Application (CIFS) 352M 1.5 TB 8 1 97 daysNetApp1 [LPGM08] 2007 Application (CIFS) 228M 750 GB 8 1 65 daysMSR [NDT+08] 2007 Block 434M 29.0 GB 13 7 daysAnimation03 [And09] 2003 Application (NFS) 19B 798 GB 1-50 6 6 months 7Harvard email2 [Ass14a] 2003 Application (NFS) 2.1B 32.5 GB 3 7 weeksHarvard email [Ass14a] 2002 Application (NFS) 1.7B 24.7 GB 3 6 weeksHarvard home04 [Ass14a] 2001 Application (NFS) 295M 2.5 GB 1 4 weeksHarvard home03 [Ass14a] 2001 Application (NFS) 2.1B 30.8 GB 1 14 weeksHarvard home02 [ELMS03] 2001 Application (NFS) 3.3B 49.0 GB 1 16 weeksHP instructional [RLA00] 2000 FS 318M 68.9 GB 9 19 31 daysHP research [RLA00] 2000 FS 112M 24.3GB 9 13 31 daysHP desktops [RLA00] 2000 FS 145M 31.4GB 9 8 31 daysBerkeley PCs [HS03, ZS99] 1992-00 Block 13M 5.8 GB 18 45 days 7Cornell PCs [Vog99] 1998 FS 190M+ 19 GB 45 4 weeksTable 3.4: Storage traces. Significant published traces of live storage system activity published since 1999.363.4. The Traces and Analysis in This ThesisIn comparing my trace to other work, one of the highest quality enter-prise workload traces is from the evaluation of write off-loading [NDT+08,NDR08] which used data gathered in 2007. This trace contains a collectionof several isolated independent servers, each traced over a relatively shortperiod. My overall workload is similar in size and duration, but focuses ona set of workstations operating concurrently. Still, much of the basic char-acterization of requests is similar, for example both show an approximately2:1 write to read ratio and bursty access. In both cases, the data was usedto motivate systems that attempted to address peaks in workload, albeitthrough different mechanisms, and both traces gather information at theblock-layer, though my trace simultaneously captures file-level information.There have been few published traces capturing file-level information, andthe other block-level traces have been categorically smaller, with the excep-tion of the trace of the mail server at Florida International University (FIU),which is a significant, but very homogeneous and specialized workload froma single server. Like my trace, the FIU traces have at times traced virtualmachines, capturing file- or block-level requests as they appear to the OS,but not the underlying virtual host system.Since 2001 there have been a number of traces of the NFS and CIFSprotocols that can be largely clustered into three bodies of work: The Har-5Compressed, unless otherwise noted.6In some time periods, 50 caches of the central NFS server may be seen as additionalservers, in other periods only a single server downstream of the cache is seen.7Tracing was enabled and disabled periodically over this time.8Uncompressed TCP dumps.9Scaled as percentage of requests in 150GB aggregate workload.373.4. The Traces and Analysis in This Thesisvard traces from 2001 [ELMS03], the NetApp traces from 2007 [LPGM08],and the traces of an animation cluster in 2003 and 2007 [And09]. Of these,the animation traces stand out for their size. However, due to the size, theresults only cover periods of times for which the trace recording server wasnot full, which was a minority of the time. The result is a non-contiguouscollection of large and intense workload traces, starting and stopping atseemingly random points with different collections of traces using differentmethodologies and trace settings. Still this is a useful trace for many pur-poses, and is one of the very few public traces that captures an intenseworkload. Naturally, these protocol-level traces capture the workload as itappears over the network, but not at the underlying device or file system.Many traces are now more than a decade old and are of questionablerelevance today [ZS99, RLA00, HS03, ELMS03]. Still others have made in-teresting analytic contributions by tracing synthetic workloads [HDV+12,GKA09, WXH+04, THKZ13]. Since publishing the results in this thesis,some have called for increased attention to system workloads in a variety ofenvironments from data centers [RKBH13] to desktop computers [HDV+12],and each has contributed new results. However, the storage landscape con-tinues to expand quickly. As has been pointed out by Tarasov recently inthe context of virtualized NAS storage [THKZ13], our production environ-ments have outpaced many older workload models. One potential solution,the one I call for in this thesis, is to perform workload traces frequently,in a wider variety of environments, over larger sets of data, and diggingdeeper into richer information gathering. My tracing dataset makes contri-butions in these areas, but suffers primarily from being small, even as it is383.4. The Traces and Analysis in This Thesisrelatively similar in size to most comparable general purpose workstationtraces [Vog99, NDT+08, LPGM08]. In the future, the scalable methods ofdeployment and analysis I describe in this thesis may lead to still largerstudies.3.4.3 Improving the State of Storage AnalysisOne application of system traces and file system models that I have dis-cussed is applying them toward improving storage system evaluation. Re-searchers have noted many long-standing challenges associated with systemevaluation [TJWZ]. Recently Tarrisov et al. warned that without chang-ing our evaluative criteria, “our papers are destined to provide incompa-rable point answers to subtle and complex questions.“ [TBZS11] To ad-dress these concerns, systems have been developed to facilitate analysis inthe past [WJK+05, AWZ04], or that suggest modifications to current prac-tices [SS97]. My approach is complementary to these efforts in that I havesignificantly expanded the body of findings, and raw data, available to re-searchers while offering several new methods for collecting and processingfuture studies of this sort.By adding to the published body of knowledge about use and behaviourof live systems under real workloads, my dataset helps facilitate better sys-tem modelling [AADAD09, AAADAD11], replay [TKM+12], and under-standing of specific features such as deduplication [TMB+12, HMN+12].Researchers have continued to make these methods more accurate and de-ployable, for example, by extending tracing to include causal links betweenrequests both in the storage stack [JWZ05, WHADAD13, SAGL06] and393.4. The Traces and Analysis in This Thesiselsewhere [BCG+07]. These approaches, to the extent possible, are goodcandidates for extending my work and expanding future studies. Other ex-amples include efforts to diagnose problems based on performance anomaliesin a distributed system [KTGN10, SZDR+11]. Better and more comprehen-sive tracing can only improve these efforts by providing more opportunitiesto refine detection heuristics and richer information to consider at each tracepoint.As I will discuss in the following chapter (Chapter 4), one way in whichmy workload traces differ from most is that the trace of UBC workloadincludes information typically associated with higher-level processes (e.g.,originating file name), but collected below the page cache where the under-lying storage system extends most of its effort. This approach mirrors effortsin the distributed systems community to understand request flows throughlarge complex systems [BDIM04, LGW+08, AMW+03, CAK+04, GAM+07,CCZ07, KJ08] by tracking requests. Similar efforts have been made in thestorage community [BBU+09], but not within isolated storage stacks, andnot for the purposes of detailed trace analysis.A different approach to understanding system behaviour is analyzingrequest logs, where researchers have made efforts to draw deeper knowl-edge from logs and autonomously expand their content [YZP+11, YMX+10,JHP+09]. This effort can be seen as a parallel approach, wherein the entirespace of the executing service can be mined for more information, but onethat usually focuses on anomalous behaviour. My approach has been specificto storage concerns, where my system tracing work has focused specificallyon the most common request flows. Further, my research on on-disk struc-403.5. Summarytures and data is not easily re-creatable through better event logs. Still,there may be benefit to applying both these methods in tandem for deeperglobal insights into system optimization.Ultimately, these efforts may be applied to further autonomous con-figuration [SAF07], performance [LCSZ04, BGU+09, HSY05], and work-load [YBG+10] analysis and correction. Future research may further reducethe still time-consuming human effort required to make sense of data.3.5 SummaryIn this chapter I have explored the reasons why tracing and analysis is widelyconsidered critical to understanding storage system behaviour. I have alsodescribed reasons why performing such studies for the sake of research israre, including: the cost and difficulty of performance analysis, the chal-lenges of operating at scale, and the many considerations that must besatisfied in order to successfully gather the correct data. Finally, I have ex-plored work that relates to my contributions in this thesis. In that context,I have described some of the significant benefits to tracing and analysis interms of their ability to address storage system complexity as arises fromworkloads of increasing intensity and increasingly complex feature sets. Inthe next chapter I will discuss the implementation of the tools I have devel-oped to perform tracing and analysis, and the ways in which they facilitateand advance the state of storage system studies. Subsequent chapters dis-cuss the specific case studies that demonstrate the benefits of my tracingand analysis framework.41Chapter 4Data Collection and AnalysisThis thesis addresses better understanding of storage system behaviour as aremedy for growing system complexity. One of the challenges organizationsface in deploying storage systems is matching the sophisticated array offeatures and system parameters to the workload patterns they observe. Alarge part of this problem is that in many cases users lack an accuratecharacterization of their workloads.Today there a number of approaches to this problem that are used inpractice. For example, an IT department that is concerned with data ca-pacity management can, at some significant investment of time, deploy acommercial deduplication system on a trial basis and measure the potentialbenefit. This may lead them to understand how that one commercial sys-tem might benefit their environment. Similarly a file system designer whois concerned about storage system performance might benchmark againstsome (usually synthetic) workload. These are useful practices that will nodoubt continue to benefit a broad range of enterprise storage environments,particularly those that have a specific concern in mind.Unfortunately, there are few off-the-shelf tools that help an enterprise orsoftware author to understand, in a general sense, the challenges an orga-42Chapter 4. Data Collection and Analysisnization has around storing and processing data efficiently. They may leadthem to be unaware that their workload includes intrinsically inefficient op-erations, or that simple methods of data capacity reduction they alreadyhave available would suffice. As I have argued, deep storage stacks, largedata sizes, and I/O intensive workloads obscure what storage systems aredoing, and analyzing these aspects of any environment is currently difficultto do in part because of the lack of effective tools for measurement. Thisshortcoming afflicts researchers and designers as well, who must often investin building analysis tools before they can retrieve data, and must convinceadministrators that the custom tools they have developed are safe.In this chapter I discuss the tools and methodology that I created andused to perform the data collection that I have described in the prior chap-ter, and the case studies and analysis in the chapters to follow. The toolsdetailed here are all novel, and were designed to allow an organization orindividual to more easily collect and analyze a large and rich trace of storagestack behaviour or file system content. In some cases, such as my investiga-tion of deduplication rates, these tools gather specific information to answera focused (though common and important) set of questions. In others, thetools gather general purpose data that can be used to characterize file sys-tems and their workloads. This chapter is organized into three primarysections, which each describes the architecture of a component of softwarethat was developed for this thesis.434.1. File System Content Scanner• Section 4.1 - The File System Content Scanner, which I devel-oped to collect file and file system metadata and disk content for the2009 study of file systems and their contents at Microsoft.• Section 4.2 - The Workload Tracer, which I developed to collectmulti-level workload traces at UBC in 2010.• Section 4.3 - Data Analysis, which is the system I developed toanalyze and process results from the studies above.For each tool, I describe its design and implementation, the data modelit subscribes to, and as an example, the methodology I used in the casestudies where they have been employed. This set of tools is designed to besufficiently straightforward that it can be deployed by system administratorsto better understand their own environments, but just as readily useful toresearchers looking to perform studies of live systems.4.1 File System Content ScannerThe file system scanner is designed to completely scan the file system of atypical desktop workstation with minimal impact to the user of that system.Its measurement capabilities include typical file and file system metadata,and also information about the degree of duplicated content within files.This is challenging because: file system I/O is relatively expensive andsome file systems are quite large, the hashing required to analyze duplica-tion rates is CPU intensive, and different approaches to analyzing duplicatecontent each require different hash functions on different file system APIs.444.1. File System Content ScannerThis section describes the design of the scanner, and the methodology I putinto practice to gather real data from live systems, as it pertains to:• Data content for duplicate content analysis, as will be discussed inChapter 5.• Data fragmentation, as will be discussed in Chapter 6.• File system metadata, as will be discussed in Chapter 7.4.1.1 Data ModelThe File System Content Scanner records metadata about the file system it-self, including age, capacity, and utilization. The scanner reads and recordsmetadata and content records for each file to a log. It reads Windowsfile metadata [Cor10a], including path, file name and extension, and timestamps. It records any retrieval and allocation pointers, which describe frag-mentation and sparseness respectively. It also records information about thewhole system, including the computer’s hardware information and the timeat which the defragmentation tool was last run. It takes care to exclude thepagefile, hibernation file, the scanner itself from its records. A complete listof the data recorded, and the resulting file format, is shown in Appendix B.In addition to file metadata, my scanner processes file data for thepurpose of analyzing duplication, including sub-file granularity duplicationacross a pool of file systems. The scanner does this by breaking the data ineach file into chunks using each of two chunking algorithms (fixed block andRabin fingerprinting [Rab81]) each with 4 chunk size settings (8KB- 64KBin powers of two) and then computes the hashes of each chunk for each of the454.1. File System Content Scanner8 possible parameters. One can find whole file duplicates in post-processingby identifying files in which all chunks match. In addition to reading theordinary contents of files, the scanner also collects a separate set of recordsusing the Win32 BackupRead API [Cor13b], which includes metadata aboutthe file and is more appropriate to store file system backups. My scanneralso considers each chunking parameter for this backup read interface, so atotal of 16 parameters is considered for each file.4.1.2 Scanner ArchitectureThe architecture of the file system scanner is shown in Figure 4.1. Its designis based on the principle that each file should be read just once, and thatdata content hashes can exploit multi-processor parallelism. The scanner isdivided into a File Analyzer, which reads file data and metadata, a numberof Processing Threads that create and write file hashes and file metadata ac-cording to their unique parameters, and a Windowing Manager that ensuresthat the file data is buffered appropriately for each Processing Thread.Ultimately, this design is disk bound for most file systems, because itreads files in metadata-order based on the NTFS Master File Table (MFT).It could be further optimized by consuming the entire MFT and sortingretrieval pointers, and then reading data in linear disk-order, associatingblocks with files as it proceeds. However, such an approach is more chal-lenging, particularly in that prior to the creation of this tool there was noreliable data as to the extent to which files would be linear on disk.464.1. File System Content ScannerFile AnalyzerWindowingManagerprocessing threadprocessing threadFixed, 64KBprocessing threadFixed, 64KBprocessing threadRabin, 16KBCompressed Scanner Logsread bu!erBackupReadbu!erScannerRoundComplete.o!set=0x...metadatadatametadatadatametadatadatametadatadata(metadata ordered traversal)DataRetrievalCollectionServerRuns everyThursdayRuns on a randomlyselected dayRecieves logsevery dayFigure 4.1: File System Content Scanner architecture.File AnalyzerThe scanner includes one File Analyzer that traverses the file system readingfiles and directories into two buffers, one for the standard read API and onefor the BackupRead API [Cor13b]. I assumed that since files are alwaysread together, the two system calls can be serviced from the same OperatingSystem cache. The File Analyzer is free to write to these buffers until eachis full, and many small files can fit in each buffer.Processing ThreadsBy default the scanner creates 16 Processing Threads which are each as-signed a different hash algorithm to explore (8, 16, 32, 64KB hash size)x (fixed or Rabin) x (read API or BackupRead API). These threads each474.1. File System Content Scannerconsume the appropriate buffer (read or BackupRead) and write resultsindependently into a distinct file. Every Processing Thread uses a saltedMD5 [Riv92] as the common hash algorithm. When a thread reaches theend of a buffer, it notes the end of the last chunk it completely read, passesthat value to the Windowing Manager, and waits to be signalled that itsbuffer has been refilled.Windowing ManagerThe Windowing Manager waits for all Processing Threads to reach the endof their full buffers. At this point it is the only active thread in the scanner,and a new processing round begins. It takes the minimum progress anythread has made on the last file in the trace and copies the remaining datato the front of the buffer. It then notifies the File Analyzer to continue fromwhere it left off and signals all Processing Threads to wake and continue.This ensures that every thread makes the most progress possible within thebuffers, but that the buffers always contain all the data necessary to serveeach Processing Thread.Initialization and CompletionAt the start of a scan, the scanner first takes a consistent snapshot of ev-ery fixed device (non-removable) file system with the Volume Shadow CopyService (VSS) [Cor10b]. VSS snapshots are both file system and applica-tion consistent. Application consistent snapshots are obtained by hooksthat allow any VSS-aware application to save their state cleanly before thesnapshot is taken. This allows the scanner to operate on an unchanging484.1. File System Content Scannerversion of the system. It then creates all necessary threads, and files, andcollects and writes all file system metadata (e.g., total volume size, machinespecification, etc.) to every log.Scanning of the VSS snapshot proceeds on a predetermined time of theweek each week (Thursday at 11pm by default). At the end of each scan, thescanner writes a terminal to each log so I can know that the scan completedfully, and then executes a zip-based compression routine on all log files. Logsare named according to the date so that logs from multiple weeks can beretained without naming conflicts if the Data Retrieval process fails to copya file fully.4.1.3 Data RetrievalAt midnight on a night of the week chosen randomly by the installer, thedata retrieval process, shown in Figure 4.1 copies the compressed log filesto a predefined location. In the case of the 2009 study, this was a fileserver on the corporate network, and the Data Retrieval process was givenappropriate ACLs to create files on the network share, subject to a quota,but not delete or read other files. Scattering collection throughout the weekhelps smooth the considerable network traffic that is required for a largestudy. Nevertheless, in a large dataset such as mine, the copying processcan result in the loss of some of the scans. Further, because the scannerplaces the results for each of the 16 parameter settings into separate filesand the copying process works at the file level, it is possible to collect resultsfor some, but not all of the Processing Threads.As part of retrieval, the process also checks for the existence of a specific494.1. File System Content Scannerkill-bits file on the server named “endofstudy”, which indicates to the scannerthat the study is complete. When the data retrieval process locates this file,it uninstalls itself entirely. This allows a study author to remotely end thestudy without user cooperation. Naturally this assumes that users are notadversarial and cannot create a kill-bits file themselves.4.1.4 Data Collection MethodologyIn my 2009 on-disk data study at Microsoft, the file systems under studywere selected randomly from Microsoft employees by seeding a pseudo ran-dom number generator and producing a list of numbers corresponding toentries in the employee address book. I limited myself to employees locatedin Redmond, Washington, USA. Each was contacted with an offer to installa file system scanner on their desktop work computer(s) in exchange for achance to win a weekend retreat for 2 at a nearby resort. In this email I ex-plicitly asked users not to install the scanner on any devices that at any pointmight not be directly connected to the campus network (e.g., laptops thatare taken home) because I could not ensure that such a device could commu-nicate results back to the data retrieval servers reliably. I contacted 10,500people in this manner to reach the target study size of about 1000 users.This represents a participation rate of roughly 10%, which is smaller than therates of 22% in similar prior studies [ABDL07, DB99] at Microsoft. Anecdo-tally, many potential contributors declined explicitly because the scanningprocess was quite invasive. The scanner ran autonomously in the backgroundstarting at 11PM every Thursday between September 18 and October 16,2009.504.1. File System Content Scanner4.1.5 Study Biases and Sources of ErrorThe use of Windows workstations in this study is beneficial in that the resultscan be compared to those of similar studies [ABDL07, DB99]. However, asin all datasets, this choice may introduce biases towards certain types ofactivities or data. For example, corporate policies surrounding the use ofexternal software and libraries could have impacted my results.As discussed above, the data retrieved from machines under observationwas large and expensive to generate and so resulted in network timeouts atthe data retrieval server or aborted scans on the client side. While I tookmeasures, such a transfers during off-work hours and on random days, tolimit these effects, nevertheless it was inevitable that some amount of datanever made it to the server, and more had to be discarded as incompleterecords. It is likely that, in particular, larger scan files tended to be partiallycopied more frequently than smaller ones, which may result in a bias in mydata where larger file systems are more likely to be excluded. Similarly,scans with a smaller chunk size parameter resulted in larger size scan filesand so were lost at a higher rate. In addition, my use of VSS makes itpossible for a user to selectively remove some portions of their file systemfrom my study. Among all users, two asked directly how to remove someportions of their file system, and I provided them with that information. Iwas able to identify one file system where this was apparently the case, asthe file system-level space utilization did not match the per-file/directoryutilization. This file system was included in the results.In order to keep file sizes as small as possible, the scanner truncated514.1. File System Content Scannerthe result of every hash of data contents and file names to 48 bits. Thisreduced the size of the dataset significantly, while introducing a manage-ability small error factor. For reference, the Rabin-chunked data with an8KB target chunk size had the largest number of unique hashes, somewhatmore than 768MB. I expect that about two thousand of those (0.0003%) arefalse matches due to the truncated hash.In addition, during my analysis I discovered a rare concurrency bug inthe scanning tool affecting 0.003% of files, in which identical files would beincluded in the scans multiple times. Although this likely did not affectresults, I removed the few files with this artifact.My scanner is unable to read the contents of Windows system restorepoints, though it could see the file metadata. I excluded these files from thededuplication analyses, but included them in the metadata analyses.Finally, due to Microsoft corporate policy, I was unable to extend theprize offer to temporary employees, contract employees or business guests, asonly full-time employees are eligible to win prizes as part of participation ininternal Microsoft studies of this sort. I opted to make this limitation clear inthe invitation to contribute, but to still allow anyone in the address book toopt-in and include the results of scans of their machines. As a results, theseemployees may have been less likely to contribute to my study. In one case,I am aware that a manager disallowed their 50 reports from contributing,and as a result I revoked their ability to opt-in and eliminated any of theircontributions from the study.524.1. File System Content Scanner4.1.6 DiscussionWhile file system studies always present challenges, the body of work in mythesis shows that it is tractable for an organization, and that the resultingdata is valuable to administrators, researchers, and designers alike. Acrossnearly 1000 desktop workstations, the results of the entire study is just over4TB in size (compressed), which can easily fit on two commodity hard drives.Further, as I will show the workload is tractable to analyze even withoutscale-out resources. The metadata portion of the collected data is extremelyvaluable in understanding system evolution and is smaller still.Further, while the disk content analysis of this data (discussed in Sec-tion 4.3 of this chapter, as well as Chapter 5) was time consuming, much ofthe cost was associated with considering many sizes of machine clusters andparameters. Further, much of the processing required machine, as opposedto human time. Finally, the analysis in this paper was performed on singleworkstations, whereas scale-out approaches could yield much faster answers,potentially even allowing interactive queries.There are periods of downtime in many enterprise environments [NDT+08]that make this type of scanning convenient as a scheduled process run dur-ing off-hours, like defragmentation is now. Deployment could be furthersimplified by installing a scanner along with other programs pre-installedby IT on all workstations. Annually, or according to some other schedule,results could be gathered from a sampling of available machines.534.2. Workload Tracer4.2 Workload TracerIn addition to my File System Content Scanner, The Workload Tracer isdesigned to capture high fidelity traces of the file system and block levelactivity through the stack of a single Windows workstation. It installs as apackage of drivers that instrument different layers of the storage stack. Thissection describes the design of the tracer, and the methodology I put intopractice to gather real data from live systems, as it pertains to:• Workload characteristics, as will be discussed in Chapter 6.• File system access patterns, as will be discussed in Chapter 7An accompanying Linux-based trace replayer was written by a colleagueand was used in my evaluation of Capo [SMW+10], which can take file-leveltraces from this tool and replay them onto a file system for the purpose ofsystem evaluation.4.2.1 Data ModelThe workload tracer gathers metadata about each file system and block-level request. With respect to file system requests, it gathers the size and fileoffset of the request, and the flags that are issued, which includes informationabout whether the request can be serviced by the cache or not. It is alsocapable of richer tracing facilities, including the file name modified as partof the request, and the name of the application that issued the request. Atthe block level it records absolute disk offset and request size, flags, and inmost cases the originating file name. The latest version of the tracer also544.2. Workload Tracerincludes a hash of the data content itself, for use in analyzing the potentialof features like I/O Deduplication [KR10], though these results were notavailable at the time the system was used to measure live systems in thecase study presented in this thesis.4.2.2 ArchitectureThe workload tracer is implemented as a paired upper-level filter driver(for block-level requests) with a file system minifilter driver (for file system-level requests). However the two are not strictly isolated. The resultingarchitecture is shown in Figure 4.2. Requests at the file level are recorded,regardless of their availability in the cache. Cache misses will be reissuedfrom the Windows cache manager and caught in the minifilter driver, witha flag set to read-through the cache, and then eventually seen again by theupper filter driver. Writes to the cache are similarly issued through the filtertwice, once to the cache, and again to write-back or -through, depending onthe operation.The upper-level driver also records the name of the process currentlyissuing the request. Most, but not all, application requests can be correctlytagged in this manner with the calling application. Lower-level requests arenot, but the association between the two levels can usually be re-established.Taking NamesFile system requests do not generally include context to directly record filenames, so I had to add them. I appended a context pointer to the FileObjectstructure in Windows and populated this pointer by interposing on all file554.2. Workload TracerWindowsInternalsCacheManagerWPP-based TracingNTFSFS-level DriverBlock-level Driver Block LayerTime, !le o"set/size,#ags, path,process namepath, process nameTime, disk o"set/sizeStored contextCached op.OriginalFileObjectUncached op.Request PathFigure 4.2: Workload Tracer architecture.creation/open operations, when a complete file name must be present. Thismethod consumes extra memory for the tracer, but avoids potential diskI/O on the read and write paths to look up names. In most cases, the filepointer and associated name can still be retrieved by the lower-level driverthrough the use of an undocumented OriginalFileObject pointer in theWindows I/O Request Packet (IRP) structure. In the 2010 study at UBC,the tracer system was able to make this association successfully in 93% ofall trace events. When the file name can’t be retrieved safely, the tracer setsthe file name to <UNKNOWN>.564.2. Workload Tracer4.2.3 Data RetrievalLogs are written as they are collected using the Microsoft Windows SoftwareTrace Preprocessor (WPP). This framework requires a separate process tocollect records in real-time, and my tracer installation package includes sucha tool, which can be configured to write requests to a file mounted on aremote network share.Uncompressed and on average, each entry in my dataset consumed lessthan 60 bytes per request, though the actual space consumed varies depend-ing on the average file path length and the ratio of requests that are servicedby the file and block level of the stack. 4 This is significantly larger thanwould be possible with a non-rich trace that excluded, say the file that eachblock request targeted. However, this path information is typically similarbetween requests which makes it easy to compress, and gzip obtains closeto 10x compression over the traces I have gathered.4.2.4 MethodologyMy 2010 study of VDI workstation activity is drawn from a VMware Viewsinstallation at the University of British Columbia. In this environment, endusers work from Dell FX100 Zero thin clients, while VMs are served from HPBL490c G6 Blades running ESX Server. These servers connect to a NetworkAppliance 3170s over fiber channel, for booting from the SAN, and 10GigE,for VM disk images. System images are hosted via NFS on a 14 driveRAID group with 2 parity disks. The operating systems and applications4Requests that pass through to the block-level unmerged naturally result in more logentries.574.2. Workload Tracerare optimized for the virtual environment [Sch10] and are pre-loaded withthe Firefox web browser, Microsoft Office Enterprise, and Sophos Anti-Virusamong other software.Installation of the tracer was significantly aided by the VDI infrastruc-ture itself. In this case the tool was installed only once – directly into thegold master file system image of all systems during the Thursday refresh,which I will describe in more depth in Section 6.1.1. Data was collected tothe same centralized storage that served the VDI disk images, which alreadynecessitates a highly reliable high throughput data channel. In this envi-ronment, the value proposition to the administrative staff supporting thesemachines was bolstered by the simple installation, and the insight these toolswere able to provide into their otherwise opaque workloads.Logs from the tracer framework were written to a CIFS network shareand collected on the Thursday following a full week of logging. In total Icollected 75GB of logs in a compressed binary format. I then checked forcorruption, missing logs, or missing events. Out of over 300 million entriesI found a single anomalous write to a clearly invalid block address, which Iremoved. I could find no explanation for the event.As with the file system scanner, any choice of installation environmentnecessarily implies some lack of generality, and it is well known that work-loads differ in different environments. A fuller comparison of my study toothers was provided in Chapter 3.584.2. Workload Tracer4.2.5 DiscussionTracing in this manner does introduce a small performance overhead. Acrossmy synthetic workloads, which I present in Section 7.2, the cost was less than3% in every case.In a live deployment of dozens of virtual machines or more, the storagecosts necessary (half a GB per machine per day) are not so large that ahistory of weeks or months cannot be retained. Old logs could be simplifiedto a basic statistical model for further savings [TKM+12]. This would enablecontinuous monitoring of per machine performance, as is already availablethrough VMware View’s management interface, but one that is extendedto support detailed tracing, such that unnecessary I/O consuming processescan be identified and eliminated, and workload inefficiencies can mitigated,as I will describe in Chapter 6.Storage logs of this type may well have other uses, including virus scan-ning. Virus scanning can be challenging to perform within virtual machinesthemselves, because the density of VM environments may make it morelikely that a given VM is turned off during a routine scanning schedule, andthe risk that “storms” of virtual machines scanning their file systems at thesame time will overwhelm shared resources. In contrast, stored access logscould provide an opportunity for out-of-band analysis of file system accessbehaviour.594.3. Data Analysis4.3 Data AnalysisWhether the subject of a study is tracing, on-disk analysis, or both, in-vestigators often general many compressed flat log files, containing manygigabytes or terabytes of data. I have found that transforming this datainto useful findings is best suited to an iterative process that favours quickresults and exploration. This section details a work-flow and the optimiza-tions that have made the processing of the datasets tractable.4.3.1 Interactive ProcessingMany data traces in the academic literature are performed as part of an-swering a specific predefined question, such as: “How would a new systemrespond to an existing workload?”[NDT+08]. For these purposes, ad-hocanalysis of flat text files may be an expedient way to arrive at an answer.However, when the subject of inquiry is more general, such as: “What arethe characteristics of this workload?” or “What opportunities exist to makethis workload more efficient?” one may spend considerably more time craft-ing queries than it would take to execute them.For this reason I have found it helpful to invest in translating the datainto a database supporting a structured query language. Having a higherlevel language written for the purpose of data processing helps streamlinequery creation, minimizes bugs, and allows the investigator to focus on ques-tions about the data instead of optimizing log traversal operations. All pro-cessing for my 2009 study was done with SQL queries against SQL Server,and my 2010 trace has been similarly imported into mySQL.604.3. Data AnalysisI primarily use SELECT WHERE clauses to isolate relationship in thedata. Some example queries written by a colleague are included in Ap-pendix A.4.3.2 Optimizing for Content HashesFor very large datasets such as these, even bulk database import poses asignificant challenge. At the completion of the 2009 file system contentstudy the resulting dataset was more than 16 TB of uncompressed data.This would would have required considerable machine time to import intoa database and considerable space to store.As a novel optimization to make this data tractable for an enterprise,I was able to significantly optimize the performance and capacity by hy-pothesizing that the bulk of the data would be in unique content hashes offile content. These hashes were critical to my analysis, in that I wanted todetermine how much data in files was unique versus the portion that wasnon-unique, but the actual value of any unique hash (i.e., hashes of contentthat was not duplicated) was not useful to my analyses. Further, it is un-likely that the absolute value of a unique hash would be interesting to anyuseful analysis of the data, since the hashes are essentially random bits.As an optimization, I was able to post process the data to eliminate thecosts of storing unique hashes. The novel algorithm that I present here isefficient, because it requires only two linear passes through the entire datasetto eliminate nearly every unique hash. I added this algorithm as a step inmy database import tool. During the first of the two passes over the data,my import tool created a pair of 2 GB in-memory Bloom filters [Blo70].614.3. Data AnalysisDuring this pass, the tool inserted each hash into the first bloom filter. If itdiscovered a value that was already in the Bloom filter, the value was addedinstead to the second Bloom filter. I then discarded the first Bloom filter.In the second pass through the logs, the import tool compared eachhash to the second Bloom filter only. If the hash value was not found in thesecond filter, I could be certain that the hash had been seen exactly once andcould be omitted from the database. If it was in the second filter, I couldconclude that either the hash value had been seen more than once, or thatits entry in the filter was a collision. I recorded all of these duplicated hashvalues to the database, and skipped over any hash seen just once. Thus myalgorithm is sound, in that it does not impact the results by rejecting anyduplicate hashes. However it is not complete, despite being very effective,in that some non-duplicate hashes may have been added to the databaseeven though they were not useful in the analysis. The inclusion of thesehashes did not affect my results, as there was no later processing step thatconsidered these hashes or depending on the invariant that hashes in thedatabase were not unique.4.3.3 Anonymization and ReleaseIdeally, traces and scans of file system workload and structure would bewidely disseminated. This is the goal of the SNIA IOTTARepository [Ass14a]and SNIA SSSI Workload I/O Capture Program [Ass14b] which offer publicaccess to a variety of traces of varying quality and size. Unfortunately, thevast majority of these traces are very small. In most organizations thereare several logistical barriers to releasing internal data, and many of these624.3. Data Analysisbarriers are ill-suited to technological solutions. However, the most seri-ous concern in the enterprise is generally that some proprietary informationcritical to business operations will leak with the data. I have successfullyaddressed this problem with an anonymization process, to the satisfactionof Microsoft Corporation, which retains a legal team that is known to bequite large. 5 In fact, some of workstations in this publicly released werelocated within the legal group itself.To anonymize trace datasets, regardless of the trace, I am primarilyconcerned with leaking:• Any computer, user, or volume name• Any file name or extension not in widespread use.• Any directory name not in widespread use.• The raw content of any file that is not trivially guessable.Within these limitations it is important to allow as much analysis aspossible. By “widespread use” and “trivially guessable” I only mean toexempt files such as ntfs.dll, which are installed as part of any windowsoperating system installation.While there are many cryptographically secure hashes that would beeasily applied to these fields, such hashes have occasionally been reversed,and even that remote potential threat is sufficient to worry a decision makerwhose only concern is security. To overcome this concern I added the addi-tional security of replacing each unique salted MD5 hash in the source data5Microsoft Legal and Corporate Affairs has operations in 57 offices across 40 countriesand regions [Cor14a].634.3. Data Analysiswith a random serial number permuted by a cryptographically secure hash.I gave different field types (e.g., file name, user name, data content), randomvalues in a different range of serial numbers, such that comparing, for ex-ample file name to data content, could never result in meaningful matches.To make some select file extensions useful, I was able to reserve the optionto release a mapping for select extensions from their encrypted format toplain text.This transformation is made more challenging by large traces or contentstudies that include billions of hashes, because the look-up to determine theunique serial number applied to a particular hash cannot afford a disk seekor network RTT, and is too large to fit in the main memory of most singleworkstations. This makes the analysis poorly suited even for the databaseI have described above.My algorithm for this transformation takes design lessons from map-reduce-style analyses, and avoids disk seeks by taking multiple linear passesover the logs using only in-memory structures. On the largest collection ofhashes in the disk content study, this task completes in under a month usingonly the background processing time of a single desktop workstation with16GB of RAM, an Intel i7 3.4GHz processor, and 2 commodity hard drives.First, non-unique hash entries for each domain were extracted from thescanner database and were merged, sorted, and reduced to their respectiveunique values, and each associated with a unique 59bit pseudo-randomlygenerated serial number. This required splitting the hashes into 21 pools,each of which could fit into main memory, and using 5 high order bits in theserial number to identify the pool. Then processing tools made 21 passes644.3. Data Analysisover the files, each time transforming only the hashes the appear in one of therespective pools. Although I elected to use a single machine for convenience,it is important to note that the fundamental algorithm not only makes gooduse of memory, but would be easy to horizontally scale across a cluster in aMap-Reduce or comparable framework.4.3.4 DiscussionData analysis is a critical component of a file system trace or study. It de-mands nearly as much consideration as the data collection itself. A centralobservation in this thesis is that the data in a study of file system behaviourtypically contains more numerous potential results that may have been ini-tially considered. As an investigator gains familiarity with their results,they will discover new things to investigate. A work-flow such as this bene-fits from investing in support for a rich query interface, and in my experiencean SQL database is one well suited approach.However, large studies that include data content measures, such as mine,have potential performance challenges in SQL databases. I have demon-strated two optimizations for such a dataset. First, I have detailed an al-gorithm for efficient removal of unique hashes to prune the dataset to atractable size. Second, I have shown how some useful transformations suchas anonymization benefit from map-reduce processing to operate in roundsthat avoid disk-seeks.654.4. Summary4.4 SummaryWith the tools I have presented in this chapter, it is possible to conduct alarge scale study of file system behaviour or content without an extensivebackground in programming or file systems. This enables an organization tocollect, retain, and even share large and extremely detailed traces of systemstate and behaviour. This framework is usable with a basic understandingof SQL and Windows system administration.The information these tools make available is not easily found elsewhere.I/O intensive workloads, large data capacities, and rich feature sets haverendered it difficult to define workstation workloads in simple terms. Thisresulting complexity obscures many potential optimizations and improve-ments, and hides waste. For organizations trying to better understand theirown environments and how to best service their workloads, my tracing andanalysis framework provides immediate benefit. Users can determine, forexample, the degree of data duplication in their environment, and associateindividual processes with I/O operations. In the following three chapters,I will further detail the benefits of rich tracing and analysis in three casestudies addressing capacity management, eliminating waste in workloads,and developing new features to enhance performance in existing systems.66Chapter 5Capacity Management:DeduplicationEven with the significant declines in commodity storage device cost per GB,many organizations have seen dramatic increases in total storage systemcosts. There is considerable interest in reducing these costs, which has givenrise to deduplication techniques, both in the academic community [CAVL09]and as commercial offerings [DDL+11, DGH+09, LEB+09, ZLP08]. How-ever, there is no one widely accepted approach to deduplication that isconsidered better than others. In fact, as I will show, the parameter spaceis quite large and the ramifications to end users are significant in terms ofperformance and capacity savings. Unfortunately, users have little insightinto the opportunities provided by different deduplication algorithms, orhow those differences will influence the results on their own data.Initially, the interest in deduplication had centered on its use in “em-barrassingly compressible” scenarios, such as regular full backups [BELL09,QD02] or virtual desktops [CAVL09, JM09]. In these cases, exact or near-exact copies of data are repeatedly created and retained. For such cases,there are many methods by which it is relatively easy to reach high lev-67Chapter 5. Capacity Management: Deduplicationels of space savings, such as using snapshots to represent virtual machineimages or virtual-full backups which don’t naively copy unmodified data.However, even as such alternatives are widely available, they are not widelyexplored by deduplication vendors or academic research, which frequently re-port deduplication rates as naive multipliers (e.g., 20-time capacity savings)on storage efficiency without detailing how workload and dataset selectionrespectively lead to such results. Despite the interest in deduplication andthe range of applications and solutions, there exist few comparisons of therelative benefits of different deduplication approaches and workloads.However unexplored, the impact of different approaches to deduplica-tion is measureable, with tracing and analysis. This chapter is a case studydemonstrating the value of file system analysis as a tool towards more intelli-gent management of disk capacity than is commonly done today. SpecificallyI have sought to provide a well-founded measure of duplication rates andcompare the efficacy of different parameters and methods of deduplication.This contribution serves to better inform IT professionals as to where dif-ferent choices in deduplication may yield acceptable results with less systemcomplexity and performance overheads.I also report on real-world disk capacity and utilization. I provide adataset that can be used to guide the implementation of efficient storagesystems, by focusing on the effectiveness of their solution’s balance of realrequirements, as opposed to focusing on the highest compression rate pos-sible.This chapter is divided into 5 sections:68Chapter 5. Capacity Management: Deduplication• In Section 5.1 I provide a very brief deduplication overview, describingthe techniques involved and discussing the relative overhead of variousapproaches to deduplication.• In Section 5.2 I present a general characterization of the contempo-rary capacity and utilization of files and file systems, as seen throughan analysis of my datasets. The results in this section pertain both tofile systems generally, and to deduplication specifically.• In Section 5.3 I consider the potential for deduplication in primarystorage against a point-in-time view of a file system or systems. Iconsider the advantage of different deduplication algorithms and theirparameters. I also consider how file system data characteristics suchas the size and number of systems under consideration, the file typescontained, and the portion of sparse files provide less-complex alter-natives to more aggressive deduplication.• In Section 5.4 I consider the overall deduplicability of storage systemsin a backup storage scenario, by analyzing deduplication rates acrossone month of my dataset. As before I consider how different alterna-tives to the most costly deduplication algorithm provide comparableresults.• In Section 5.5 I summarize this chapter and discuss its implications.695.1. Deduplication Overview5.1 Deduplication OverviewFile systems often contain redundant copies of information: identical filesor sub-file regions, possibly stored on a single host, on a shared storagecluster, or backed-up to secondary storage. The more data that is includedin a system the more potential for such redundancy exists, and the morepotential benefit in reducing data sizes.Deduplicating storage systems eliminate this redundancy in order to re-duce the underlying space needed to contain the file systems (or backupimages thereof). Deduplication can work at either the sub-file [DGH+09,UAA+10] or whole-file [BCGD00] level. More fine-grained deduplicationcreates more opportunities for space savings, but necessarily reduces thesequential layout of some files, which may have significant performance im-pacts and in some cases necessitates complicated techniques to improve per-formance [ZLP08]. Alternatively, whole-file deduplication is simpler andeliminates file fragmentation concerns, though at the cost of some otherwisereclaimable storage.Deduplication systems function by identifying distinct chunks of datawith identical content. They then store a single copy of the chunk alongwith metadata about how to reconstruct the original files from the chunks.Chunks may be of a predefined size and alignment, but are more commonlyof variable size determined by the content itself.The canonical algorithm for variable-sized content-defined blocks is Ra-bin Fingerprints [Rab81]. Briefly, Rabin Fingerprints uses an efficient hashof a sliding window over the data, and declares chunk boundaries when the705.1. Deduplication Overviewlow order bits of the hash value are equal to some sentinel. By decidingchunk boundaries based on content, files that contain identical content thatis shifted (say because of insertions or deletions) will still result in (some)identical chunks. Rabin-based algorithms are typically configured with aminimum (4KB in my datasets) and maximum (128KB) chunk size, as wellas an expected chunk size which is determined by the number of low orderbits upon which a boundary is declared.There are essentially two possible choices of sentinel value. Many propri-etary deduplication implementations use a pre-selected random string uponwhich to declare a boundary. Alternatively, some intentionally select the 0string, because small sequences of zeros that appears in a file will hash tothis zero sentinel, which will result in a boundary as soon as the minimumchunk size is reached. In the analysis in this chapter I have opted for the lat-ter approach for two reasons. First, it provides the most information aboutthe prevalence of zeros in the dataset. Second, because zeros are relativelycommon, this will provide the most possible advantage to the Rabin-baseddeduplication, relative to whole-file deduplication. Since one of my goalswas to investigate whether whole-file deduplication could reach the perfor-mance of Rabin-based methods, this is the conservative choice. This maysomewhat impair the ability to compare my deduplication results directly tosome other implementations; however, such a comparison would be difficultanyway, because many commercial implementations already use complicatedheuristics to favour performance over compression [ZLP08, KDLT04].Managing the overheads introduced by a deduplication system is chal-lenging. Naively, each chunk’s fingerprint needs to be compared to that of715.1. Deduplication Overviewall other chunks. While techniques such as caches and Bloom filters canmitigate overheads, the performance of deduplication systems remains atopic of research interest [KU10]. The I/O system also poses a performancechallenge. In addition to the layer of indirection required by deduplica-tion, deduplication has the effect of de-linearizing data placement, whichis at odds with many data placement optimizations, particularly on hard-disk based storage where the cost for non-sequential access can be orders ofmagnitude greater than that of sequential access. Other more establishedtechniques to reduce storage consumption are simpler and have smaller per-formance impact. Sparse file support exists in many file systems includingNTFS [Cor10a] and XFS [WA93] and is relatively simple to implement. Ina sparse file a chunk of zeros is stored notationally by marking its existencein the file metadata, removing the need to physically store it. Whole filededuplication systems, such as the Windows SIS facility [BCGD00] operateby finding entire files that are duplicates and replacing them by copy-on-write links. Although SIS does not reduce storage consumption as much asa modern deduplication system, it avoids file allocation concerns and is farless computationally expensive than more exhaustive deduplication.Because the magnetic disk technology trend is toward reduced per-bytecost with little or no improvement in random access speed, its not clearthat trading away sequentiality for space savings makes sense, at least inprimary storage. In the next section I will discuss the current state ofstorage system capacity and utilization as seen through my datasets. Thiswill better contextualize my analysis of deduplication effectiveness in thesection that follows.725.2. The Capacity and Utilization of File Systems5.2 The Capacity and Utilization of File SystemsThis section provides a general analysis of file systems utilization and capac-ity, including the amount of space available and consumed. Features such ascompression and deduplication depend on an understanding of common datasizes and the typical pressure they apply to disk capacity, and they benefitfrom specific knowledge of where and how storage is consumed, as I willshow. In addition, storage system designs benefit from an understanding ofrealistic data sizes and consumption. Therefore, this section is intended tostand alone as an independent contribution, in addition to providing specificcontext for my deduplication analysis.5.2.1 Raw CapacityFigure 5.1 shows a cumulative distribution function of the capacities of allfile systems in my 2009 study. On the same graph I have plotted results fromthe similar metadata-only studies performed in 2000 and 2004 respectively,which are collectively the only large scale published studies of the size ofdeployed file systems.One can see a significant increase in the range of commonly observed filesystem sizes and the emergence of a noticeable step function in the capac-ities. Both of these trends follow from the approximately annual doublingof physical drive capacity. I expect that this file system capacity rangewill continue to increase, anchored by smaller solid state disks and user-created partitions dedicated to special tasks on the left. This range willcontinue step wise towards larger shingled magnetic devices [LSAH11] on735.2. The Capacity and Utilization of File SystemsFigure 5.1: CDF of file systems by file system capacity.the right, which will either force file systems to perform acceptably on anincreasingly wide range of media, or push users towards more highly tunedspecial purpose file systems. The mean file system capacity in 2009 was194GB, and the median was 149GB, as compared to 32GB in 2004 and 4GBin 2000. This broadening also suggests that research into large scale filesystems [KLZ+07, McK03] will have increasing relevance in modern enter-prise workstations. This specifically impacts deduplication as well, becausededuplication rates increase with the size of the dataset.5.2.2 File Sizes and Space ConsumedFile counts and sizes impact how data for files is best allocated and struc-tured on disk, and can also provide hints at future access patterns. Myresults show that the overall distribution of file sizes is largely unchanged.Most notably, the median file size in my data set is 4KB, which is the same745.2. The Capacity and Utilization of File Systemsin 2004 and 2000, and also true in every other study of file systems I amaware of, going back to at least 1981 [Sat81]).755.2. The Capacity and Utilization of File SystemsFigure 5.2 shows the histogram of the occurrences of files of differentsizes in 2009, 2004, and 2000. In my 2009 study there is a decrease in therelative occurrence of files between 32B and 256B, and also between 8KBand 64KB. Correspondingly, files between 1KB and 4KB are more common,as are files larger than 512KB.Figure 5.2: Histogram of files by size.Increasingly, the distribution of file sizes shows a positive skew. Priorwork on modelling realistic file sizes utilized a log-normal distribution [AADAD09].These distribution models were used for accurate benchmarking, and in 2004showed very low error rates. The results from 2009 suggest that they shouldremain relevant, provided that a skew term is added and they can be up-dated with current data. I anticipate that future studies will see this trendtowards skewed normal distributions in file sizes exacerbated.Although the distribution of file sizes shows relative similarity across765.2. The Capacity and Utilization of File SystemsFigure 5.3: CDF of total bytes consumed by containing file size.years, users do have many more files in their file systems. Next, I considerhow the growth in the number of files leads to more dramatic changes in theconsumption of space in today’s file systems. Figure 5.3 shows a CDF of thetotal bytes across all files in my study versus the size of file that containsthat data. A smooth trend can be observed, as each year the larger portionof data in very large files draws more of the relative consumption comparedto smaller files. In 2009, half the data in all file systems are in files smallerthan 32MB, versus 8MB in 2004 and 2MB in the year 2000.775.2. The Capacity and Utilization of File SystemsA second aspect of the trend towards file system consumption being at-tributable to very large files can be seen in better detail in Figure 5.4, whichplots the histogram of bytes by containing file size. Here, a trend predictedby Aggrawal et al. in 2007 [ABDL07], that a change towards bi-modal filedistribution was coming, is well validated by my results. Indeed it seemsthat bi-modality has continued and is likely to become more pronounced.Further, a third mode above 16GB is now appearing.Figure 5.4: Histogram of total bytes consumed by containing file size.The figure shows that more capacity usage has shifted to the larger files.This is apparent even though there are still few such files in the system be-cause these large files are so very massive – note that Figure 5.4 is logarithmicin the X-axis. This suggests that optimizing for the capacity consumptionof large files (if not their workloads) will be increasingly important.To better place these findings in context, I next consider the type offiles in the 2009 study, as shown by the file extension. Figure 5.5 shows the785.2. The Capacity and Utilization of File Systemstotal bytes consumed by files in my study versus the file extension of filescontaining those bytes. The ten largest (measured in bytes) file extensionsare shown for year 2009, 2004, and 2000.Figure 5.5: Total bytes consumed versus file extension.795.2. The Capacity and Utilization of File SystemsSeveral changes from prior years are apparent in Figure 5.5. Overall,the ten most capacity-hungry file system extensions consume more than50% of the overall bytes consumed by all files, reversing the trend in prioryears towards heterogeneity in file system types, as expressed by bytes. AsI will show in Section 7.1.2, the opposite trend can be seen in analyzing fileextension in terms of number of files. The portion of storage space consumedby the top extensions has increased by nearly 15% from previous years. Evenmore dramatically, files with the null extension have moved from the 10thlargest consumers of storage to the first. The null extension denotes a file forwhich no extension is present, as is typical for application-created files thata user is not meant to interact with directly.6 These files occupy over 10%of the total space consumed in this study. Replacing the null extension atnumber 10 is .iso files, which are (usually) large image files that are copiesof optical media (CD-ROM and DVD). A similar format, VHD files, area Microsoft image format used for disk drives, usually for virtual machineimages. On Windows PCs, this format is common among users of VirtualPC software, a hardware virtualization product. Although VHD has fallenbehind .lib (library files) and .dll (dynamically linked library files), owingto the large number of developers in my study, their absolute and relativecontribution is higher than previous years.6For privacy reasons, I agreed not to decode full file names and paths in this dataset.My inferences about the NULL file extension is drawn from an ad hoc inspection of severalfile systems under my own direct control.805.2. The Capacity and Utilization of File SystemsFigure 5.6: Histogram of bytes by containing file extension. Several exten-sions shown.These image files, and potentially the null extension files, tend to bevery large. To understand the size of these consumers of storage, Figure 5.6shows a histogram of bytes by containing files, similar to Figure 5.4, butwith separate plots for .vhd, .lib, .dll, and the null extension. As expected,vhd files tend to be the largest with a distribution centering around 16GB.Library files are relatively smaller, with distributions primarily between 1MBand 16MB. Less intuitively, the less predictable null extension files tend tobe quite large, centering around 3GB. Since null extensions are created by arange of applications (and sometimes users) it is impossible to characterizethem simply. However, this result makes clear that the majority of thebytes consumed by files with the null extension are large data repositories,possibly database files.These opaque image files can be particularly challenging because even815.2. The Capacity and Utilization of File Systemsthough they appear to the file system as single objects, they are not. Imagefile formats like VHDs and database files have complex internal structureswith difficult to predict access patterns.Semantic knowledge to exploit complex opaque files, or file system in-terfaces that explicitly support them may be required to optimize for thisclass of data. Some systems propose semantically aware optimizations suchas file-type specific compression [SPGR08] and workload aware data-layoutoptimization [YBG+10, emIC+05].5.2.3 Disk UtilizationAlthough capacity has increased by nearly two orders of magnitude since2000, the growth in large files has ensured that utilization has dropped onlyslightly, as shown in Figure 5.7. Mean utilization is 43%, only somewhat lessthan the 53% found in 2000. One must assume that this is the result of bothusers adapting to their available space and hard drive manufacturers trackingthe growth in data sizes. The CDF shows a nearly linear relationship, with50% of users having drives no more than 40% full, 70% at less than 60%utilization, and 90% at less than 80%.This information is relevant to systems that attempt to take advantageof the unused capacity of file systems. Such mechanisms will be more re-silient to the scaling of file system capacities when they assume a constantamount of free space or a scaling of free space proportional to the capacityof the system. This is the case in Borg [BGU+09]) where Bhadkamkar etal. proposed to reorganize on-disk data using a fixed portion of free space.In contrast, in the FS2 project [HHS05], Huang et al. attempted to use825.2. The Capacity and Utilization of File SystemsFigure 5.7: CDF of file systems by file system fullness.free disk space to create linear copies of all common access patterns andpersist them to disk in order to improve disk read performance. With thislatter approach, the amount of free space required scales with the number ofworkloads applied to the system. It is not clear if the number of workloadsare proportional to the total size of the storage system, or how this valuewill scale over time. For these reasons, based on the results of this studyit is possible that capacity may become a limiting factor for some systemsthat might otherwise benefit from the FS2 approach.System designers and programmers also must take care not to ignore thesignificant contingent (15%) of all users with disks more than 75% full. Fulldisk conditions are challenging to recover from, frequently untested, andoften poorly managed by many applications and operating systems [Fou13,Cor13a].That capacity has slightly outpaced utilization suggests that capacity835.3. Deduplication in Primary Storagemanagement in large drives is unlikely to be the most valuable optimiza-tion, particularly as disks show no signs of significant throughput or latencyimprovement. However, there still exists significant interest in cost savingsthrough capacity management, and some storage systems favour the use ofsmaller capacity drives to increase the ratio of throughput and latency tototal system capacity. For these reasons, capacity management remains rel-evant, and so the obvious question is: How might we best manage capacityin a storage system? I address this question in the following sections.5.3 Deduplication in Primary StorageIn this section, I will measure the efficacy of deduplication in primary storageas shown by my study of file system contents at Microsoft. I chose the weekof September 18, 2009 from this dataset to analyze, which means that resultswere collected on each machine Thursday night of that week. Although thereexists a slight variation in the time that scans were started, I treat them allas a single point in time. This dataset includes hashes of on-disk content inboth variable and fixed size chunks of varying sizes.5.3.1 The Deduplication Parameter SpaceThere are two primary parameters that I can vary in processing this data:the deduplication algorithm/parameters, and the set of file systems (calledthe deduplication domain) within which duplicates can be found. Duplicatesin separate domains are considered to be unique contents.First, I consider the effects of varying the chunking size of the deduplica-845.3. Deduplication in Primary StorageFraction of Space Deduplicated00.10.20.30.40.50.60.70.80.91Chunk Size64KB 32KB 16KB 8KBWhole File Fixed RabinFigure 5.8: Deduplication vs. chunk size. Whole file deduplication, fixedblock deduplication (8KB), and Rabin-based deduplication (averaging 8KB)included.tion algorithm. The size of chunks selected by the deduplication algorithmhas been the source of considerable research towards improving the per-formance of deduplication systems by selecting larger chunks [KU10], andthereby limiting the number of seeks incurred when processing workloadson the system. Figure 5.8 shows the space deduplicated, measured as:1− Number unique chunks(Number total chunks) (5.1)Figure 5.8 plots these results against the average chunk size for fixed-chunk and Rabin-based [Rab81] deduplication algorithms, against wholefile deduplication which doesn’t depend on chunk size, and so varies onlyslightly due to differences in the number of zeroes found and due to variationsin which file systems scans copied properly; see Section 4.1. This graphassumes that all file systems are in a single deduplication domain; the shape855.3. Deduplication in Primary Storageof the curve is similar for smaller domains, through the space savings areproportionally reduced for all algorithms. As expected, the Rabin algorithmachieves the highest level of deduplication, followed by fixed-chunk, and thenwhole file deduplication. More interestingly, the difference in moving from64KB to 8KB chunk sizes is quite small for the chunking algorithms, roughly5% across the entire range. Depending on the coldness of data and cost ofstorage per byte, this may be practical. However, I expect that for mostprimary storage systems the relatively high access frequency and need for lowlatency would drive users towards the increased performance of the largerchunk sizes, or to whole file deduplication. Also notable is the fact that thedifferent approaches are roughly 10% apart. Moving from the most costlybut highest compression 8KB Rabin algorithm to the least costly but lowestcompression whole file deduplication yields just under 20% additional spacededuplicated.Next, I consider the effect of varying the deduplication domain size oneach of the three deduplication algorithms. Rather than presenting a threedimensional graph varying all parameters, I show a set of slices through thesurface, considering the extremes of 8KB and 64KB chunking to show therange of compression achievable.The set of file systems included corresponds to the size of the file server(s)holding the machines’ file systems. A value of 1 indicates deduplicationrunning independently on each desktop machine. Whole Set means thatall 857 file systems are stored together in a single deduplication domain. Iconsidered all power-of-two domain sizes between 1 and 857. For domainsizes other than 1 or 857, I had to choose which file systems to include865.3. Deduplication in Primary Storagetogether into particular domains and which to exclude when the numberof file systems didn’t divide evenly by the size of the domain. I did thisby using a cryptographically secure random number generator. I generatedsets for each domain size ten times and report the mean of the ten runs.The standard deviation of the results was less than 2% for each of the datapoints, so I don’t believe that I would have gained much more precision byrunning more trials. As it was, it took about 8 machine-months to performthese analyses.Fraction of Space Deduplicated00.10.20.30.40.50.60.70.80.91Deduplication Domain Size (file systems)1 2 4 8 16 32 64 128 256 512 Whole SetWhole File 64 KB Fixed 8 KB Fixed 64 KB Rabin 8 KB RabinFigure 5.9: Deduplication vs. deduplication domain size.Figure 5.9 shows the effects of varying deduplication domain size. Here,a much larger range of results can be observed. Single machine performancecan vary dramatically from 20% space recovery to nearly 45% dependingon the method chosen. However, as the size of the deduplication domainincreases, the following trends appear:First, the space that can be deduplicated by every algorithm increasessignificantly. The change end-to-end in my dataset is roughly 30%, notably875.3. Deduplication in Primary Storagelarger than the difference between the most extreme choices of algorithm inFigure 5.8. Second, the relative advantage of fine-grained deduplication re-mains roughly constant, at about 20%, regardless of the group size. Finally,as described in Section 5.1, the performance cost of deduplication increaseswith larger group size, both because there are more total chunks in the sys-tem, but also because the size of the index of all chunks must be grownproportionately, which introduces more seeks during the deduplication pro-cess itself.Together, this suggests that in large deduplication domains, the relativevalue of fine grained deduplication algorithms will be less important than thesize of the domain itself, with higher performance approaches such as wholefile deduplication representing a more efficient space in the balance betweenperformance and compression. As with algorithm choice, a key decidingfactor is the relative hot or coldness of the data and the cost of storage perbyte. For some workload and cost, any position in this space may be welljustified. However, for primary storage it seems unlikely that low latencyand relatively small working sets are a good match for the most aggressivededuplication techniques, if indeed any deduplication is appropriate at all.885.3. Deduplication in Primary Storage5.3.2 A Comparison of Chunking AlgorithmsIntuitively, one expects Rabin Fingerprinting to outperform, say, whole filededuplication, because it will catch smaller sub-sequences of bytes thatare shared between files. However, these improvements are not necessar-ily spread evenly across all files.Extension 8KB Fixed % Extension 8KB Rabin %vhd 3.60% vhd 5.24%pch 0.56% lib 1.55%dll 0.55% obj 0.84%pdb 0.44% pdb 0.56%lib 0.40% pch 0.55%wma 0.33% iso 0.52%pst 0.32% dll 0.51%<none> 0.29% avhd 0.46%avhd 0.27% wma 0.37%mp3 0.25% wim 0.35%pds 0.23% zip 0.34%iso 0.22% pst 0.34%Total 7.27% 11.64%Table 5.1: Non-whole file, non-zero duplicate data as a fraction of file systemsize by file extension, 8KB fixed and Rabin chunking.To better explain how specific types of data deduplicate differently, Ta-ble 5.1 shows the space deduplicated by 8KB Fixed and 8KB Rabin algo-rithms versus the extension of the file containing those deduplicated chunks.To understand the context in which chunking algorithms outperform wholefile deduplication, the deduplicated space in this table excludes chunks ofall zeros, as well as chunks that appear in whole file duplicates. Thus onecan see where much of the 20% and 10% improvements in Rabin and fixed-chunk deduplication originate from, respectively. In both cases, VHD files895.3. Deduplication in Primary Storageare a significant contributor. Recall that VHD files are themselves entire filesystems containing many small files, stored for the purposes of OperatingSystem Virtualization.Since these VHD files likely contain Windows operating systems, theyshare many of the same library and system files as other file system in thededuplication domain, and also are a significant contributor to total storage.Because whole file deduplication does not currently penetrate that opaquetype, it is unlikely to find any space reclamation in a pair of VHD files,unless a file is copied and not modified. Cumulatively, the top 12 types forboth Fixed and Rabin-based chunking in Table 5.1 comprise more than halfthe advantage to each of those methods.905.3. Deduplication in Primary Storagevhd vhd pch lib dll obj pdb pdb lib pch wma iso pst dll ø  avhd avhd wma mo3 wim 0% 10% 20% 30% 40% 50% 60% 70% 8K Fixed 8K Rabin Figure 5.10: Percentage contribution to deduplication by file type. Relativecontribution of the top 10 file types to 8KB fixed and 8KB Rabin-baseddeduplication.915.3. Deduplication in Primary StorageThe results above suggest a simple approach to hybridizing deduplica-tion, in which whole file deduplication is used for all but the most wellknown file types, where a more aggressive deduplication technique can beemployed. Figure 5.10 shows the the relative contribution of the 10 typesthat contribute most to chunk-based deduplication, as a percentage of thetotal advantage provided by the most aggressive Rabin-based approaches.Thus, if one employed whole file deduplication combined with 8KB fixedchunk deduplication on the 10 types shown in the “8KB Fixed” column ofFigure 5.10, they might reclaim as much as 40% of the gap between wholefile and 8KB Rabin fingerprinting over the workstations seen in my study.5.3.3 Characterizing Whole File ChunksNext, I consider factors that influence the compression rates of whole filededuplication. This is useful to consider because even whole file deduplica-tion carries with it overheads, and in many cases it may be advantageous tobe selective in where it is applied. The benefits of whole file deduplicationare not evenly distributed. In my data set, half of all file systems achieveapproximately 15% savings, but some benefit significantly more, reachingand exceeding the average level of compressability in individual file systemswith fine-grained deduplication (roughly 40%).There is a considerable difference between the average size of a file andthe size of file that contributes to the space savings in a whole file dedupli-cated system. Figure 5.11 shows a skew towards deduplicable bytes beingstored in mid-sized files, particularly between 2MB and 128MB, as thoseare areas where the increase in cumulative bytes in duplicate files outpaces925.3. Deduplication in Primary Storage00.10.20.30.40.50.60.70.80.91Containing File Size (bytes)16KB 64KB 256KB 1MB 4MB 16MB 64MB 256MB 1GB 4GB 16GBAll Files DuplicatesFigure 5.11: Bytes by containing file size. CDF of bytes by containing filesize of whole file duplicates and for all files.that of all files. Not present in this graph are files larger than 16 GB whichwould appear to the right of the graph area shown, because they are sorarely duplicated. I discuss the presence of very large files in file systems inChapter 7.To understand the composition of these mid-sized files, Table 5.2 liststhe top 15 file types ranked by the percentage of duplicate space they occupyacross all workstations in the study. Mean file size is included in this table togive a feel for the size, but note that the large files of each type will add much935.3. Deduplication in Primary Storagemore to the duplicate space than the small files. The DLL and the relatedlib extension are the most common, together responsible for nearly a third ofprimary storage duplicate space. In total, the top 15 types alone account for75% of the duplicate space, while only contributing just over half the totalspace in the systems measured. More promising is that they together formless than a third of the total files in the system. This discontinuity points toa potential opportunity for a whole file deduplication heuristic that favoursthe file types most likely to be deduplicated, either at the exclusion of othertypes, or as a priority optimization. It takes 47 types cumulatively to reach90% of the whole file duplicate space in the data in my study.Extension % of % of % of Mean FileDuplicate Space Total Space Files Size (bytes)dll 20% 10% 6% 521KBlib 11% 7% 2% 1080KBpdb 11% 7% 1% 2MB<none> 7% 13% 5% 277KBexe 6% 4% 2% 572KBcab 4% 2% 0% 4MBmsp 3% 2% 0% 15MBmsi 3% 1% 0% 5MBiso 2% 2% 0% 436MB<a guid> 1% 1% 0% 604KBhxs 1% 1% 0% 2MBxml 1% 1% 5% 49KBjpg 1% 1% 2% 147KBwim 1% 1% 0% 16MBh 1% 1% 8% 23KBTotal 75% 53% 32% N/ATable 5.2: Whole file duplicates by extension.945.3. Deduplication in Primary Storage5.3.4 The Impact of Sparse FilesMy study includes fine-grained 4KB regions of sparse data – that is, datacontaining only 0s. Many file systems, including Windows NTFS, the sub-ject of this study, include the ability to represent sparse regions in metadatain order to save space. Somewhat surprisingly, the results of my file systemmetadata study (see Chapter 7) suggest that this feature is seldom used,occurring in less than 1% of files.In Table 5.3 I have plotted the total storage system utilization after wholefile deduplication in 3 scenarios. First, without any sparse file support, thenwith aligned 4KB sparse regions, and finally with 4KB sparse regions alignedto Rabin-based boundaries. This simulates what a file system may be ableto reclaim with sparse support.Group Size Whole File Whole File w/ aligned Whole File w/ RabinDedup. 4KB Sparse Regions 4KB Sparse Regions1 79% 77% 77%2 77% 75% 75%4 74% 72% 72%8 71% 69% 68%16 67% 66% 65%32 64% 62% 62%64 62% 60% 60%128 60% 58% 58%256 57% 56% 55%512 60% 58% 57%Whole Set 52% 51% 50%Table 5.3: The impact of sparseness on capacity consumption. Utilizationis shown as a fraction of raw-data.The results show that sparseness has a surprisingly small impact be-yond whole-file deduplication, usually 1%-2%. This small advantage may955.4. Deduplication in Backup Storagebe worthwhile, for example, when data is not frequently re-written. In thatcase the cost of sparseness is mostly negligible. Further, there are other per-formance advantages to sparseness. Sparse regions can be more efficient towrite to disk because it only requires a metadata update. However, in termsof capacity savings, the benefits of eliminating 4KB sparse regions is verysmall. Furthermore, allowing unaligned sparse file regions does not appearto be particularly useful in most cases, providing a less than 1% benefit.5.4 Deduplication in Backup StorageI now turn from analyzing point in time deduplication rates to consideringthe deduplication rates of a set of file systems that are being backed up.In practice, some backup solutions are incremental (or differential), stor-ing deltas between files, while others use full backups which completelyreplicate the file system. Often, highly reliable backup policies use a mix ofboth, performing frequent incremental backups, with occasional full back-ups to limit the potential for loss due to corruption. Most of the publishedperformance measurement of deduplication to date has relied on workloadsconsisting of daily full backups [KU10, ZLP08]. Certainly these workloadsrepresent the most attractive scenario for deduplication, because the con-tent of the file systems is replicated far more frequently than it is modified.My dataset did not allow us to consider daily backups, so I considered onlyweekly ones. This is still a very aggressive, if not simply more realistic,backup schedule for a large enterprise dataset [Cor13c, Cor13d]. In practiceboth full and incremental backups are also expired according to a policy,965.4. Deduplication in Backup Storagewhich may call for retaining backups for a period ranging from a few daysto several years [WDQ+12].With frequent and persistent full-system backups, the linear growth ofhistorical data sizes will usually out-pace that of the running system. Fur-thermore, secondary storage is generally less latency sensitive than primarystorage, so the reduced sequentiality of a block-level deduplicated store isof lesser concern. However the performance costs of data fragmentation in-curred when recovering from a deduplicated backup are increasingly wellunderstood to be a problem [LEB13].It is worth noting that the most common application of deduplicationfor backup systems is to deploy an inline stream-based deduplication sys-tem, in which data is streamed from primary storage to secondary storageat some specified frequency, and is deduplicated in the process. In addition,recall that the backups themselves may be incremental. In this context, themeaning of whole-file deduplication in a backup store is not immediatelyobvious. I ran the analysis as if the backups were stored as simple back-ups of the original files in each file system. To copy the file metadata inaddition to the data, as would be common in a backup workload, I usedthe Win32 BackupRead [Cor13b] API. For my purposes, imagine that thebackup format finds whole file duplicates and stores pointers to them in thebackup file. This would result in a garbage collection problem when files aredeleted, but those details are beyond the scope of my study and are likelyto be simpler than the analogous mechanism in a block-level deduplicatingstore.I considered the 483 file systems for which four continuous weeks of975.4. Deduplication in Backup Storagecomplete scans were available, starting with September 18, 2009. My backupanalysis considers each file system as a separate deduplication domain. Iexpect that combining multiple backups into larger domains would have asimilar effect as seen in primary storage, but I did not run the analysis dueto resource constraints.0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Whole File 64K Rabin 8K Rabin Deduplicated Space Figure 5.12: Backup deduplication options compared. The comparativespace savings of 8KB Rabin, whole file, and 64KB Rabin on a set of backupdata.Using the Rabin chunking algorithm with an 8KB expected chunk size,block-level deduplication reclaimed 83% of the total space. Whole file dedu-plication, on the other hand, yielded 72%. These results are shown in Fig-ure 5.12, along with results for 64KB Rabin-based deduplication. Necessar-ily, the 64KB results are not directly comparable, because they are drawnfrom a different collection of machines – the machines for which I have con-sistent 64KB data across all weeks. In all cases, these numbers are highlysensitive to the number of weeks of scans used in the study; it’s no accident985.5. Summary and Discussionthat the results were around 3/4 of the space being claimed when there werefour weeks of backups. In fact, the most significant predictor of deduplica-tion efficiency, by the commonly applied measurements, is the policy underwhich backup deduplication data is stored. This result was later echoed byWallace, Douglis, et al. who showed a detailed analysis of deduplicationrates by policy and data type [WDQ+12].In considering Figure 5.12, one should not assume that because 72% ofthe space was reclaimed by whole file deduplication that only 3% of thebytes were in files that changed. The amount of change was larger thanthat, but the deduplicator found redundancy within a week as well and thetwo effects offset.5.5 Summary and DiscussionThe relative merits of deduplication depend on the innate potential to dedu-plicate a user’s data, the frequency of data access, the performance impactof seek requests to access that data, and the cost of storage. My analysisprimarily addresses this first concern. For some combinations of those char-acteristics, for example, when many workstations contain similar data thatis very cold, 7 aggressive deduplication will be extremely effective. For otherdatasets, where data is accessed frequently and performance is important,very simplified or targeted deduplication may be more appropriate. Tracing7Some of my results (not included elsewhere in this thesis for the sake of brevity) providelimited evidence that this may be the case for some minority of enterprise workstationdata. More than 30% of files in enterprise workstations are not modified over the lifetimeof the file system [MB11]. However, since the Windows operating system does not tracklast access time, it is not clear how frequently these files are read.995.5. Summary and Discussionand analysis can provide guidance as to how to employ deduplication in themost effective and lightweight manner to reach an organization’s capacitymanagement goals.Since publishing some of these results in 2010 [MB11], attention hascome to the longer term performance implications of aggressive dedupli-cation. Lillibridge et al. found in 2013 that data fragmentation incurredin deduplication can degrade the performance of restoring a deduplicatedbackup to a problematic degree [LEB13]. Their approach to this problem,like prior efforts to lower the run-time overheads of deduplication [BELL09],is to trade a small measure of compression for higher performance by dedu-plicating less often or at coarser granularity. My analysis, presented here,shows empirical evidence that these hybrid approaches can be effective, butalso shows that simple heuristics based on file type and size may present asimpler approach to deciding when to deduplicate, and what type of dedu-plication is appropriate. These results demonstrate how analysis of livesystems can reveal simple alternatives for efficient and practical solutions tostorage management problems.My dataset also suggests that while sparse files are somewhat ineffec-tive, other simpler deduplication techniques, such as whole file and blockdeduplication, are relatively good at compressing data. Although each tierof deduplication complexity offers progressively different trade-offs, otheraspects of the dataset, such as its size, have a larger impact. I have alsofound that selectively targeting the files most likely to impact overall com-pression rates has the potential to further limit the cost of deduplicationwith a marginal effect on compression rate. These results are directly appli-1005.5. Summary and Discussioncable to the design of space efficient storage systems today, and testify to theability to use file system analysis to better inform administrators looking tounderstand their own options for effective management of their respectivedata.101Chapter 6Workload IntrospectionStorage system administrators face a wide range of concerns about theirsystems, which may relate to performance, provisioning, utilization, or con-figuration. In this chapter I propose that while some of these issues areunquestionably important and demand attention, others can be largely un-derstood and addressed with measurement. What administrators lack inmany cases is just this – the simple ability to measure and understand whattheir systems and workloads are doing. With that ability to introspect, so-lutions to common problems or the elimination of the concern entirely isrelatively simple. Put another way, in this chapter I show that there aresignificant areas of misplaced effort and concern in enterprise workstationworkloads, which tracing and analysis can identify and eliminate.I provide three discrete examples drawn from the measurement of enter-prise systems under live deployment. The common thread between each isan administrative concern that is difficult or impossible to satisfy withoutintrospection, the lack of measurement available prior to the publication ofmy work, and the simplicity with which measurement satisfies the concernor leads obviously to an effective solution.My first example is performance oriented, and addresses wasted effort in102Chapter 6. Workload Introspectionclient workloads, due to multiple clients doing the same work on the samehardware. I present one approach to understand, measure, and eliminatethis effort, as was motivated by live system tracing. My second example sim-ilarly addresses wasted effort in a workload, but in the form of requests that,although frequent, have no compelling reason to be issued to storage at all.My third example is different in that I use file system analysis to study frag-mentation, which is a common concern among system researchers, designers,and administrators. However, in this case, I will show that fragmentationis largely a solved problem and needn’t be considered further in most cases.The sections in this chapter each consider one of these examples, as follows:• In Section 6.1 I present findings from my investigation into the per-formance of a VDI installation at UBC and describe how, as a casestudy, tracing was used to inform and evaluate the design of a simplebut effective shared cache for VDI storage systems. This cache servesto eliminate waste in the form of identical requests being issued bydifferent clients on the same hardware.• In Section 6.2 I describe how tracing has helped to identify a numberof instances of wasted effort in client workloads, both in a researchsetting and in a live deployment. These elements of waste in workloadsare strictly unnecessary and can in most cases simply be turned off forthe benefit of the storage system.• In Section 6.3 I present analysis of on-disk fragmentation which ar-gues that disk fragmentation, as a concern for storage systems, islargely solved.1036.1. Performance Analysis of VDI Workloads• In Section 6.4 I summarize and discuss the contents of this chapter.The first two of these examples focus on on enterprise workstations de-ployed though a Virtual Desktop Infrastructure (VDI), as was used in myUBC trace. Although there are many different storage workloads one couldconsider, I chose this environment specifically because I had a strong part-ner in UBC, the VDI space is relatively under-examined in the academicliterature, and the sample size is comparable to most published studies offile system workload. VDI systems are also interesting because they includea storage stack that is particularly deep and complex, which presents manyopportunities for unnoticed inefficiencies to hide. The final example drawsfrom relevant results in my 2009 study at Microsoft.In this chapter, my application of tracing and analysis demonstratesthat in the complexity of a real workload and cluster storage architecture,important aspects of the system’s behaviour are obscured. Leveraging mea-surement tools to elucidate those characteristics in each case yields oppor-tunities to either deploy a simple solution, or to entirely eliminate an areaof concern. I begin by discussing and measuring the VDI environment.6.1 Performance Analysis of VDI WorkloadsVirtual desktops represent the latest round in a decades-long oscillation be-tween thin- and thick-client computing models. VDI systems have emergedas a means of serving desktop computers from central, virtualized hardwareand are being touted as a new compromise in a history of largely unsuccess-ful attempts to migrate desktop users onto thin clients. The approach does1046.1. Performance Analysis of VDI Workloadsprovide a number of new benefits. Giving users private virtual machinespreserves their ability to customize their environment and interact with thesystem as they would a normal desktop computer. From the administra-tion perspective, consolidating VMs onto central compute resources has thepotential to reduce power consumption, allow location-transparent access,better protect private data, and ease software upgrades and maintenance.However, the consolidation of users onto fewer hardware resources inVDI deployments puts enormous pressure on storage because it consolidatessources of I/O load. This section provides both technical background anda performance analysis of VDI environments. From this basis, I present anempirical argument for a a host-side cache called Capo, which was publishedin 2010 [SMW+10]. Capo is a simple but effective solution to mitigate thesepressures. I will begin by describing the environment that Capo operatesin. This overview is important because virtualized storage in enterpriseenvironments both enables and obscures many of the insights that tracingand analysis can provide.6.1.1 VDI OverviewToday, the two major vendors of VDI systems, Citrix and VMWare, individ-ually describe numerous case studies of active virtual desktop deploymentsof over 10,000 users. From a storage perspective, VDI systems have facedimmediate challenges around space overheads and the ability to deploy andupgrade desktops over time. I will now briefly describe how these problemsare typically solved in existing architectures, as illustrated in Figure 6.1.1056.1. Performance Analysis of VDI WorkloadsVMVMVMSystemGoldMaster“Weekly” DurableOn upgrade, goldmaster and allchildren are deletedand replaced with a new version. Read-onlyBase ImagesLinked clones.  Sparse,private, read/write.DurableUser data is maintained acrossupgrades.  Linkedin to the gold master’s file system.UserDataTemplateAll writes stored in private clones.  Sparse reads pass to base image.Figure 6.1: Typical image management in VDI systems.Copy-on-Write and Linked ClonesVDI deployments are organized around the storage of operating system im-ages, which are each entire virtual disks, often tens of gigabytes in size. Anaive approach to supporting hundreds or thousands of virtual machinesresults in two immediate storage scalability problems. First, VMs musthave isolated disk images, but maintaining individual copies of every singledisk is impractical and consumes an enormous amount of space. Second,to preserve the major administrative advantage of VDI – very flexible de-ployment, adding desktops requires that images can be quickly duplicatedwithout taking the time to perform a complete copy.Storage systems today have usually addressed this problem in the formof using Copy-on-Write (CoW) files to represent disk images [MAC+08]. Assuch, this observation is not new, and it has been a recurring challenge in1066.1. Performance Analysis of VDI Workloadsvirtualization. Existing VDI systems make use of VM-specific file formatssuch as Microsoft’s VHD [Mic09] and VMware’s VMDK [VMW10]. Bothallow a sparse overlay image to be “chained” to a read-only base image (orgold master). As shown in Figure 6.1, modifications are written to private,per-VM overlays, and any data not in the overlay is read from the baseimage. In this manner, large numbers of virtual disks may share a singlegold master. This approach consolidates common initial image data, andnew images may be quickly cloned from a single gold master.Image Updates and Periodic RollbackImage chaining saves space and allows new images to be cloned from a goldmaster almost instantaneously. It is not a panacea though. Chained imagesimmediately begin to diverge from the master version as VMs issue writesto them. One immediate problem with this divergence is the consumptionof independent extra storage on a per-image basis. This divergence problemfor storage consumption may be addressed through the use of data dedupli-cation, as discussed in Chapter 5.For VDI, wasted storage is not the most pressing concern: block-levelchaining means that patches and upgrades cannot be applied to the baseimage in a manner that merges and reconciles with the diverged clones. Thismeans the ability to deploy new software or upgrades to a large number ofVMs, which was initially provided from the single gold master is immediatelylost.The leading VDI offerings all solve this problem in a very similar way:They disallow users from persisting long term changes to the system image.1076.1. Performance Analysis of VDI WorkloadsWhen gold master images are first created and clones are deployed, the VDIsystem arranges images to isolate private user data (documents, settings,etc.) on separate storage from the system disk itself. As suggested initially inthe Collective project [CZSL05], this approach allows a new gold master withupdated software to be prepared and deployed to VMs simply by replacingthe gold master, creating new (empty) clones, and throwing away the oldversion of the system disk along with all changes. This approach effectively“freshens” the underlying system image of all users periodically and ensuresthat all users are using a similar well-configured desktop. For the most part,it also means that users are unable to install additional long-lived softwarewithin VDI images without support from administrators.DiscussionStorage for VDI systems presents a number of interesting new solutions whilealso creating new problems. For system efficiency, the depth of the virtu-alized storage stack creates two immediate obstacles. First, administratorscan not easily see what client systems are doing because of the opaque vir-tualization layers. Further, the depth of the stack means that assumptionsmade at one layer (e.g., the linearity of the underlying address space) areless likely to be correct at lower levels. These observations together suggestthat there will be many opportunities for inefficiencies to remain unnoticed.In the remainder of this section (and the next as well) I will use detailedtracing at the client-level to elucidate some of these inefficiencies.1086.1. Performance Analysis of VDI Workloads6.1.2 Temporal Access Patterns - Peaks and ValleysHaving explained the structure of virtualized environments as they pertainto storage, I next turn my attention to a specific analysis of VDI systemperformance. I will explore where significant load comes from in a VDIsystem, detail the nature and magnitude of the load and discuss how it canbe mitigated. First I consider when the load occurs.In Figure 6.2 I present the access patterns in IOPS across different timesof day for a complete week of trace. I have plotted two series on each day.The first tracks peak workloads, which, for the sake of clarity requires somecareful selection of processing parameters. First, to preserve the peaks inthe workload, without adding so much noise as to make the graph unread-able, I processed the workload to calculate the average IOPS within each10 second period. IOPS were measured as was submitted to central storageby the clients, and as was measured by the client’s own clocks. Next, toeliminate the remaining noise in the graph while emphasizing the peaks inthe workload, I have further taken the maximum of each of these 10 secondsperiods every 5 minutes to plot. I chose this transformation to strike a fairbalance between depicting the periods of high peak burst, without creatingso much jitter in the graph so as to make it unreadable without smooth-ing via an average, or eliminating valleys via taking the maximum acrosstoo-coarse time-scales. Had I chosen a smaller averaging window than 10seconds, one would see much higher peaks, because even within 10 secondwindows the workload is notably bursty. As such, the absolute value ofIOPS in this graph shown should be regarded as being of a lesser fidelity1096.1. Performance Analysis of VDI Workloadsand thus importance than the relative size and location of peaks and valleys.To provide an additional, less-complicated view of the workload, I have alsoincluded a simple average of IOPS over 5 minute time scales in the figure.1106.1. Performance Analysis of VDI WorkloadsIOPS vs timeThursday0AM 2AM 4AM 6AM 8AM 10AM 12N 2PM 4PM 6PM 8PM 10PM 0AMIOPS      0    500   1000   1500   2000   2500 Average PeakFriday0AM 2AM 4AM 6AM 8AM 10AM 12N 2PM 4PM 6PM 8PM 10PM 0AMIOPS      0    500   1000   1500   2000   2500 Average PeakSaturday0AM 2AM 4AM 6AM 8AM 10AM 12N 2PM 4PM 6PM 8PM 10PM 0AMIOPS      0    500   1000   1500   2000   2500 Average PeakSunday0AM 2AM 4AM 6AM 8AM 10AM 12N 2PM 4PM 6PM 8PM 10PM 0AMIOPS      0    500   1000   1500   2000   2500 Average PeakMonday0AM 2AM 4AM 6AM 8AM 10AM 12N 2PM 4PM 6PM 8PM 10PM 0AMIOPS      0    500   1000   1500   2000   2500 Average PeakTuesday0AM 2AM 4AM 6AM 8AM 10AM 12N 2PM 4PM 6PM 8PM 10PM 0AMIOPS      0    500   1000   1500   2000   2500 Average PeakWednesday0AM 2AM 4AM 6AM 8AM 10AM 12N 2PM 4PM 6PM 8PM 10PM 0AMIOPS      0    500   1000   1500   2000   2500 Average Peak152346Figure 6.2: Activity measured in I/O operations over each of the 7 days.1116.1. Performance Analysis of VDI WorkloadsViewed this way, the diurnal aspects of the workload are clear. There isa distinct peak load period between 8:30 and 9:30 every weekday morning,as employees arrive at work. Lunch, dinner, and late nights are periods ofrelative inactivity, as are the weekends. Late afternoon peaks are sporadicand difficult to predict, but reach loads nearly as high as in the morning.These peaks are large. Peaks are frequently twice as large as the av-erage IOPS over the range of time in which the peak occurs, which posesa significant problem for storage administrators. If they purchase central-ized storage to serve the average or median request load, their users willfrequently face periods where load overwhelms storage-capacity by a factorof more than 2:1. Instead, many storage administrators will choose to dowhat has been done by the administrators of the system studied here - over-provision storage (most of the time) so that peak loads can be served atan acceptable level. Thus the absolute peaks in loads such as these is aninteresting problem.For the purpose of illustration, I have selected 6 interesting peak loadsfrom the trace, and numbered those locations on Figure 6.2. The selectionwas chosen so as to capture periods where the load is quite high even com-pared to other peaks, and to sample several different days and times of dayin order to capture a breadth of behaviour. Those workloads are explicitlylisted in Table 6.1 and will be referenced in the following sections, which willfocus on the composition of these peak loads periods and how to efficientlyservice them.1126.1. Performance Analysis of VDI WorkloadsTime Period1 Mon. 8:30am-8:45am2 Fri. 8:30am-9:00am3 Tue. 9:15am-9:30am4 Mon. 2:00pm-2:30pm5 Tue. 2:40pm-3:00pm6 Wed. 4:00pm-4:15pmTable 6.1: Workload peaks selected for more detailed analysis.6.1.3 VM Contributors to LoadIn considering how to best service peak loads, I will first analyze the compo-sition of those peaks in terms of the individual VMs from which the workloadis generated. Recall that my VDI-based workload is comprised of a numberof virtual desktops issuing requests to a central storage system.Number of VMs0 5 10 15 20 25  0% 20% 40% 60% 80%100%1 Mon. 8:30am−8:45am2 Fri. 8:30am−9:30am3 Tue. 9:15am−9:30am4 Mon. 2:00pm−2:30pm5 Tue. 2:40pm−3:00pm6 Wed. 4:00pm−4:15pmFigure 6.3: Contributors to each of the major workload peaks.Figure 6.3 shows a CDF of these VM desktops by their contributionto the total workload for each peak, which tells us if peaks are caused by1136.1. Performance Analysis of VDI Workloadsmultiple VMs generating workload equitably, or if peaks are cause by a fewoutliers during periods of heavy use. In most cases, it is the former; however,the peak in slice 4 was caused primarily by just 4 VMs.To view this result in more relative terms, I took the 28 virtual desktopswhich are in regular day-to-day use and calculated the percentage of VMswhich contributed at least 5% of the peak workload. As listed in Table 6.1,periods 1,2,3, and 5, and 6 each showed 26%-29% of VMs contributing morethan 5% of the total load in their respective periods. From this I concludethat in most cases peak loads tend to be caused by many VMs each expe-riencing storage-oriented workload at the same time, as opposed to a smallnumber of VMs inequitably creating load. 8 This observation is generallyinteresting in the characterization of peak loads in VDI systems, and I willleverage it specifically in Section 6.1.5 where I discuss the use of a sharedcache in mitigating peak loads.6.1.4 Disk DivergenceSince it is clear that multiple clients are contributing to peak load periodsat the same time, it is natural to wonder if the VMs are engaged in similarbehaviour. Since VMs typically use disk images chained from a gold master,one way to consider this question is to measure the rate at which the overlayimage diverges from the original image. If divergence occurs quickly andcompletely then there may ultimately be little similarity between the manyVMs contributing to a workload.8This finding would be further magnified in a larger VDI deployment, where the relativeload on storage is much higher relative to the power of each individual workstation, andso individual desktops would simply unable to saturate a network resource.1146.1. Performance Analysis of VDI WorkloadsTime (Hours)0 20 40 60 80 100 120 140 160 180Divergence (GB)0.000.480.951.431.912.382.863.343.8195%conf.MaxMinFigure 6.4: Bytes of disk diverging from the gold master.Based on this question, I calculated the frequency at which the traceobserved the first write to a sector. Figure 6.4 plots this data for the aver-age VM, as well as the most and least divergent VM, over the entire study.Within 24 hours, most VMs hit a near plateau in their divergence, around1GB. Over time this does increase, but slowly. A smaller set of VMs dodiverge more quickly and significantly, but they are far from the 95% con-fidence interval. I conclude that there is usually significant shared databetween VMs, even after several days of potential divergence.6.1.5 Capo: Caching to Alleviate Peak WorkloadsBased on the results above, my colleagues and I have developed a prototypeclient-side cache for virtual machine environments. The system is calledCapo and has been analyzed, evaluated, and published [SMW+10] based onmy 2010 UBC trace, and the resulting insights about the degree of duplica-tion in VDI workloads.1156.1. Performance Analysis of VDI WorkloadsNote that one could perform the analysis I present in this chapter withtracing at typical fidelity (i.e. tracing individual requests and their blockoffsets). What distinguishes the measurement in this section is the analysisthat considers the impact of the virtual environment on disk divergence andthe CoW structure of virtual disks. I am aware of no other analytical re-search that considers measurement of live systems with these considerations.In both the next section and the following chapter, I consider the advantagesof considering extended details available only in my tracing framework.Capo ArchitectureVM VM VM VMVMMHost 0Durability MapLocal PersistentCacheTransparent MultihostPrefetchSharedStorageVM VM VM VMVMMHost nDurability Map Capo:Per-host persistentcaching and request management layer.Capo:Cluster-wide requestinterposition layer.Local PersistentCacheFigure 6.5: The architecture of Capo, a cache for VDI servers.1166.1. Performance Analysis of VDI WorkloadsCapo is a client-side cache for a virtualized server that hosts some ormany virtual machines that share gold-master images. The cache is de-signed to eliminate redundant reads and writes, and Capo also features abroadcast protocol used to further increase sharing by pre-loading the cachewith accesses from other servers. Designing a cache on the client side of thestorage system removes scalability concerns with caching at the central stor-age system directly, while retaining the benefits of potential sharing betweenVMs, as suggested by my trace analysis. The Capo architecture is shown inFigure 6.5. For brevity, I omit a more comprehensive description of Capo,including the multi-host pre-fetching, and a description of the mechanismsincluded in its implementation. A more comprehensive description of Capois available in the published paper describing that system [SMW+10].Based on the finding that peak workloads are distributed among a largenumber of VMs, I sought to evaluate the potential savings using Capo bydetermining how much similarity there is between the workload of eachindividual VM. Intuitively, there may be significant potential to eliminateredundancy if, for example, during initial user login each desktop is accessingthe same login-related libraries and login services from storage.Table 6.2 shows each of the highlighted periods. In the third column,for each period I list the read IOPS and read throughput as a percentageof the total IOPS and total throughput (in bytes) respectively. Like otherenterprise workloads in the literature [NDT+08], the VDI workload in theseperiods shows approximately 2:1 preference for writes, as measured in IOPS,and a 2:1 preference for reads, as measured in throughput. From this Iconclude that both read and write requests will ultimately be important to1176.1. Performance Analysis of VDI WorkloadsTime Period Read % Duplicate Duplicate(IOPS / Thpt) VM Reads Cluster Reads1 Mon. 8:30am-8:45am 50% / 78% 81% 91%2 Fri. 8:30am-9:00am 48% / 78% 88% 97%3 Tue. 9:15am-9:30am 36% / 57% 78% 99%4 Mon. 2:00pm-2:30pm 38% / 59% 59% 99%5 Tue. 2:40pm-3:00pm 31% / 48% 77% 97%6 Wed. 4:00pm-4:15pm 40% / 63% 99% >99%Table 6.2: Read request IOPS and throughput, and the percentage of readsthat are identical across VMs and across the cluster for each of the peakload periods.consider in evaluating the potential benefits of a shared cache. However, forthe purposes of this discussion, I first consider read requests.Caching Read RequestsIn the 4th column I present the percentage of reads to the same range ofbytes that have been previously seen by that VM over the course of thetrace. With a large enough cache, one could potentially absorb all thesereads. However this is not practical, as VDI deployments must scale to alarge number of VMs and cannot devote a large cache to each one if theywish to benefit from economies of scale. However, it is promising that thesevalues are so high, particularly because the trace results shown here aredownstream of the page cache, which itself absorbs many of the duplicaterequests.The right-hand column presents the same measure, but imagines thatcaching could be shared across all VMs in the cluster. This is practical toconsider because VMs share a gold master image for much of their content,and so reads of a virtual disk (as are shown here) can be known to be identical1186.1. Performance Analysis of VDI Workloadswhen accessing a range of bytes that have been previously viewed by anotherVM, provided they have not been modified previously by either VM. Thisfinding shows very strong support for the notion of a sharing a cache amongdifferent virtual machines. Among the selected peak periods, slice 4 standsout for having an unusually low duplicate read rate for VMs, but a veryhigh rate across the cluster as a whole. I investigated and found that twovery active VMs had duplicate read rates of 26% and 30%. By includingthe most beneficial 62%, 85% and 96% of VMs, I could reach duplicate readrates of 40%, 60% and 90% respectively. From this I conclude that even asyou can achieve significant reduction in read requests with caching, possiblyeven by sharing caches, that some benefits may require careful selection ofthe VMs in question.Caching Write RequestsNow I consider the potential benefit of a cache with respect to write requests,which constitute the majority of individual operations. Figure 6.6 showsthe percentage of disk writes that overwrite recently written data, for timeintervals ranging from 10 seconds to a whole day. I have included resultsfrom each of the seven days to underscore how consistent the results are.Friday stands out somewhat as having a lower overwrite percentage, but itis also the case that overall write load on Friday was lower than any otherday.In a short time span, just 10 seconds, 8% of bytes that are writtenare written again. This rate increases to 20%-30% in 10 minute periods andranges between 50%-55% for twenty-four hour periods. From this I conclude1196.1. Performance Analysis of VDI Workloadshours0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324percentage of writes  0% 20% 40% 60% 80%100%ThursdayFridaySaturdaySundayMondayTuesdayWednesdayFigure 6.6: The benefits of delaying writeback. The percentage of bytesthat need to be written to the server if writes are held back for differenttime periods. This is lower than the original volume of writes due to theelimination of rewrites.that considerable system-wide effort is spent on writing data with a highmodification rate. This bears some similarity to the results of a file systemstudy of NTFS from 10 years prior conducted by Vogels [Vog99], who foundthat most files are short lived, and at the time recommended more extensiveuse the temporary file flag. It is worth noting that my analysis suggests thatthe flag is still not being used extensively in either the live UBC trace or (lesssurprisingly, because the file system should avoid writing temporary files todisk) the Microsoft on-disk data trace. This result suggests that delayingwrite-back may be an extremely effective means of lowering peak loads.1206.1. Performance Analysis of VDI Workloads6.1.6 DiscussionCapo was evaluating using a replay of my 2010 trace at UBC, focusing onthe peak periods identified in Table 6.1 [SMW+10], and was found to reducepeak loads to 52.5% of their total, though much of this effect was laterdetermined to be due to increasing request latency. However, Capo was stillable to reduce total workload costs to 76% of those in the observed trace.When Capo was allowed to delay write-back to central storage, the gainswere significantly larger, and the system was able to reduce peak and totalI/O workloads to 38.1% and 28.6% of that of the system with no caching inplace. A more thorough treatment of the Capo system and its evaluationcan be found in the full paper [SMW+10].My findings have two immediate implications. First, they provide animportant characterization of VDI workloads, which are generally useful tothe storage system and research community. Second, they show that trac-ing can reveal relatively simple approaches to improving system efficiency inwidely used systems. Without tracing, the degree to which co-located vir-tual machines share content would be entirely hidden by the storage stack.Even with conventional tracing the degree to which content written by dif-ferent VMs at the same offset is identical would be unclear. However, withdetailed tracing and analysis I was able to largely determine potential for acache before the system itself was built. I was also able to use my trace toultimately evaluate Capo with trace replay. I revisit the further benefits ofmore detailed tracing with respect to Capo in Chapter 7.1216.2. Eliminating Wasted Effort6.2 Eliminating Wasted EffortNext I consider a different application of tracing to provide introspection intovirtual workloads – eliminating unwanted requests. As I have discussed,the storage stack of a virtualized system is deep. My tracing frameworkprovides a novel lens through which a storage administrator can introspectinto the workloads from a client perspective. In addition, the tracing toolsI have developed are unusually aggressive in collecting information abouteach request.Without these features a typical storage administrator has no mechanismto determine which client-side applications are contributing to the load ontheir system. In contrast, by including originating application for mostrequests, I can rank applications by their file system IOPS in each peakworkload listed in Table 6.1.Table 6.3 shows each peak load period in Table 6.1, and for each period,shows the 3 highest contributors to IOPS load in order, excluding requeststhat are serviced by the cache. I also excluded system and scvhost because,while they are generally large contributors, they often aggregate requests onbehalf of many different applications and offer few opportunities for insightor administrative remedy. In addition, these services are frequently back-ground tasks that have less direct impact on user-perceived performance.In every peak Firefox is a significant contributor, and in many caseswrites appear to originate from the creation of small temporary internetfiles. Based on this insight, I discuss how these accesses could be eliminatedor reduced in Section 7.2.1226.2. Eliminating Wasted EffortTime Period Top Applications by IOPS(excludes system and scvhost)1 Mon. 8:30am-8:45am Search Indexer, Firefox, Sophos2 Fri. 8:30am-9:00am Firefox, Search Indexer, Sophos3 Tue. 9:15am-9:30am Defrag, Firefox, Search Indexer4 Mon. 2:00pm-2:30pm Firefox, Pidgin, Sophos5 Tue. 2:40pm-3:00pm Firefox, Defrag, Pidgin6 Wed. 4:00pm-4:15pm Firefox, Pidgin, SophosTable 6.3: Applications running in each of the peak load periods.The second most commonly observed source of application I/O requestsis Sophos, which is typically accessing storage to inspect files. It would beuseful to determine if these I/O requests could be mitigated by performingvirus scans centrally on gold master images, rather than accessing the overlayof every virtual desktop. Virus scanning is notorious in VDI environmentfor causing I/O storms, where otherwise idle machines all awaken and beginI/O intensive processes at the same time. Other frequently active storageconsumers include Thunderbird, Pidgin, and Microsoft Outlook.Two surprising entries in the contributor chart are defragmentation andthe search indexer. Both are commonly listed on Virtualization “Optimiza-tion” guides as candidates to be disabled. The defragmentor was acciden-tally left enabled in these images, a fact I was able to communicate to theVDI team. Defragmentation of a virtual volume is not thought to provideany value, since the use of disk overlays and massive centralization of storageon a large disk array greatly complicates the linearity of access assumptionsthat generally lead one to defragment. The search indexer in this disk im-ages should have been disabled by default, but could still have been invokedmanually on the few machines where I saw it running. Automatic search1236.2. Eliminating Wasted Effortindexing in Windows is a useful feature, but can lead to I/O storms in aVDI cluster. In both cases these observations are obvious with tracing ofthe virtualized guest, but difficult to identify otherwise.Multi-level tracing can also be deployed specifically to determine thecause of repeatable I/O bottlenecks. I have also used this tool to assist acolleague in identifying a bottleneck in the Windows boot process whichwas causing I/O storms 15 minutes after a Windows boot. The activity inthis example was caused by SuperFetch [Cor14c], which must be disabledin the registry. This is straightforward to find when disk traces include thefile and application associated with the request. Even though the I/O fromeach VM is issued as low priority, the collection of I/O activity from allthe VMs proved problematic when attempting a 1000 VM boot up. Again,similar benefits of disk tracing with rich contextual information have beenreported to me by others [Pra11] who have downloaded and deployed mytracing tool. These successes of tracing while aggressively gathering con-textual information for each request all speak to the ability to helpfullyeliminate problematic workloads, and the degree to which such workloadsare hidden by virtualization and the storage stack. However, with tracingone can often eliminate such waste with relative ease. Conventional tracingtools inform users as to when periods of load are happening, but do not listthe files or applications responsible.1246.3. On-Disk Fragmentation6.3 On-Disk FragmentationMy third example of the benefits of workload introspection draws from filesystem analysis of enterprise workstations, instead of virtual desktop work-loads. In this section I consider the linearity of on-disk data layouts.The behaviour and characteristics of magnetic disks continue to be adominant concern in storage system optimization. It has been shown that filesystem performance changes over time, largely due to fragmentation [SS97],and that these effects can have dramatic impacts on system performance.While there is little doubt that these findings were true when they werepublished in 1997, I can find no more recent investigation into the effect andno study into the degree to which it practically impacts live systems today.As a consequence, it is difficult to know how realistic most published mea-surements of storage system performance are, because these measurementsgenerally use synthetic datasets that, to a greater or lesser degree, attemptto simulate this effect.My investigation into the file system structure of Windows desktopsat Microsoft in 2009 can be used to investigate whether this concern insignificant. Overall, I find fragmentation to be quite rare, occurring in only4% of files. This lack of fragmentation in Windows desktops is due to thefact that a large fraction of files are not written after they are created anddue to the defragmentor, which runs weekly by default. This is true for allof my scans other than the 17 that came from machines running WindowsServer 2003.I measure fragmentation in my dataset by recording the files’ retrieval1256.3. On-Disk Fragmentation00.10.20.30.40.50.60.70.80.91Fragments/File, size one bins1.E+00 2.E+01 3.E+02 4.E+03 5.E+04 6.E+05CDF HistogramFigure 6.7: File Fragmentation. CDF and histogram of percentage of totalfragments versus number of fragments per file.pointers, which point to NTFS’s data blocks. Retrieval pointers that arenon-linear indicate a fragmented file. Figure 6.7 shows a CDF of the totalnumber of fragments each measured as one of these non-linear breaks in afile’s offset to Logical Block Address (LBA) mapping table.The CDF shows that more than half of file fragments are in files thathave exactly one fragment. Furthermore, even though fragmentation is notcommon, my results show that among files containing multiple fragments,fragments are very common. In fact, 25% of fragments are in files containingmore than 170 fragments. The most highly fragmented files have a “.log”extension. From this I conjecture that these files are log files, which (ifmanaged naively) may create a new fragment for each appending write. Inthis case, fragmentation is not likely to be impactful, because log files arenot routinely read.1266.4. SummaryHowever, as I have shown in the prior chapter, file systems do con-tain many large disk image files which are internally complete file systems.This result says nothing about their internal structure. For these and otheropaque file types fragmentation can occur at many different layers. If de-fragmentators are disabled, as is commonly recommended for virtual ma-chines, virtual disks may have significant internal file system fragmentation.Furthermore, VHD formats themselves perform allocation from the files inlinear 2 MB chunks. As disk formats become increasingly detached fromthe underlying media, the effect of fragmentation at these levels is poorlyunderstod and warrants further study.6.4 SummaryIn a complex system, it is challenging to understand the behaviours of aworkload and their effect on system performance. In this chapter I haveinvestigated how efficiency lost as a result of this complexity can be betterreclaimed though detailed tracing and analysis of system behaviour andhave described how designers and administrators can use tracing to identifyworkload problems in specific systems or challenge untested assumptionsin storage systems generally. In VDI systems, I have shown specificallyhow a simple cache can significantly increase the performance of a storagesystem, and how applications running hidden within a VM container canlead to wasted system effort. I have also shown that fragmentation at thefile system layer of desktop workstations is not a significant concern today,even as there may be reason for increased concerns with fragmentation inside1276.4. Summaryopaque file types. In every case, workload introspection makes solutions (orin the case of fragmentation, the efficacy of existing solutions) obvious, wherepreviously they were hidden.More recently, Le, Huang, and Wang have noted that the nesting of filesystems common in VDI and other virtualized environments can lead tosurprising levels of write amplification and degraded performance [LHW12].Like the observations in this chapter, this finding is simple and impactful,but hidden by the layers of a complex storage system. Tools for analysisand tracing that make wasted effort or optimization opportunities obviousprovide a mechanism for administrators to routinely investigate their ownworkloads to ensure their efficiency. Such tools also present a resource fordesigners by making knowledge of the behaviours of real systems more widelyavailable.128Chapter 7Namespace ModellingEnterprise workstation file systems are typically organized in a hierarchicalstructure of subdirectories and files contained within directories, up to thefile system root. Within that hierarchy, some directories, for example a user’shome directory, have special meaning or application within the context ofthe larger operating system. However, in most storage systems there is nospecial attention paid to the organization and structure of the namespace.From the storage system’s perspective the contents of a user directory areusually no different from data elsewhere in the system. In contrast, I arguethat more detailed analytic study and modelling of namespaces reveals newopportunities to improve storage system design.In this chapter I argue this position from two perspectives. First, Icharacterize the organization of desktop workstation namespaces generallyand discuss how they have evolved since the year 2000. I argue that thisis helpful to the design of new storage features by using my results to testthree assumptions made of namespaces in research prior to this work. I thenconsider a novel application of leveraging a model of file system structurefrom within a storage system operating over live data.The sections in this chapter are as follows:1297.1. File System Namespace Characterization• In Section 7.1 I present findings from my 2009 file system study tocharacterize the structure and contents of file system namespaces. Icompare these results to assumptions made in prior research.• In Section 7.2 I describe a technique for using knowledge of file systemnamespace topology to decrease the overheads associated with per-forming writeback in a storage system cache. I also demonstrate oneapplication of including file system namespace information in work-loads, by using my 2010 study of VDI workloads at UBC to validatethis system.• In Section 7.3 I summarize and discuss the contents of this chapter.These examples both use tracing and analysis to improve our understand-ing of file system namespaces. In the former example, I apply knowledgeof the namespaces to guide researchers and system designers. In the latterexample I apply knowledge of the namespace to optimize the performance ofa live system. These show how building models of the topology of file systemnamespaces is a particularly interesting subject for analysis, can be appliedtowards a better understanding of how our file systems are evolving, andcan be leveraged as a tool to better optimize and tune our storage systems.I begin with a characterization of enterprise file system namespaces.7.1 File System Namespace CharacterizationThe designers of file systems and file system features routinely make as-sumptions about the structure and organization of file system namespaces.1307.1. File System Namespace CharacterizationSometimes these assumptions are implicit and other times not. For exam-ple, Huang et al have built a file system query mechanism that depends onsampling the namespace with a small number of depth first traversals. Thestastical soundness of this technique depends on whether that small num-ber of traversals can capture a representative sample of the namespace as awhole [HZW+11]. If the namespace is very heterogeneous it its hierarchy,then the sampling method must be made correspondingly more complicated,which runs counter to the project’s objective of sampling the namespace withvery low overhead.Other examples include Murphy and Seltzer, who have questioned themerits of traditional hierarchical file system presentations entirely [SM09],based in part on the argument that file systems namespaces are growingmore complex. In this work the authors argue that presenting a file systemnamespace as a hierarchy (irrespective of the underlying on-disk layout,which generally employs some hierarchy by necessity) is simply too confusingfor end-users to manage. However, the exact measure and definition ofcomplexity in this case is difficult to define accurately without quantitativemeasure.Finally, Peek has addressed storage system treatment of file extensions,assuming a very long tail distribution of supported file types [Pee10]. Hiswork argues that the growing problem of supporting infrequently used filetypes demands automated file system support for extracting file type-specificmetadata. However, the underlying assumption that there is a growing long-tail of distributions was tested only on a dataset of limited size.Quantifying these assumptions against a general dataset of significant1317.1. File System Namespace Characterizationsize, as I do in this section, provides valuable evidence in support of orcontradictory to these claims and others, which will help designers buildbetter, more useful features. I report on file organization and type in thefollowing two subsections, relating each to the research assumptions above,and then summarizing both at the end of the section.7.1.1 Namespace ComplexityI begin by characterizing the namespace topology in order to address theconcerns of Murphy and Seltzer, that namespaces are becoming more com-plex. There is no single widely accepted definition of namespace complexity.It could be seen, for example, as a larger number of files, or a less homo-geneous topology, or some combination of these and other factors. In thissection I enumerate a series of observations drawn from my dataset to helpdefine and serve as evidence for or against namespace complexity.First, I consider the organization of files within directories, starting withthe number of files and directories. A CDF of total directory counts perfile system is shown in Figure 7.1, and shows both an increase in the totalnumber of directories and a widening of the distribution of directory counts.The observed mean was approximately 36,000 directories per file system.Similarly, a CDF of total file counts per file system is shown in Figure 7.2,which shows a very similar but slightly less dramatic trend. The mean filecount was approximately 225,000 files, but a non-trivial fraction roughly(1:20) of systems I observed contained over a million files and the mostpopulated file system served 3.4 million files.In both cases, it is clear that in addition to the increase in file sizes shown1327.1. File System Namespace Characterizationin Chapter 5, there are marked increases in the size of desktop workstationnamespaces. This increase is largely unsurprising, but the measurement ofthe increase is nonetheless valuable, and places the remaining results of thissection into better context.1337.1. File System Namespace CharacterizationFigure 7.1: CDF of file systems versus directories.Figure 7.2: CDF of file systems versus files.1347.1. File System Namespace CharacterizationNext I consider the organization of files and directories within directories.Figure 7.3 shows the number of files per directory. While the change is small,it is clear that even as users in 2009 have many more files, they have fewerfiles per directory, with a mean of 6.25 files per directory.Directories can also hold subdirectories. Figure 7.4 shows the distribu-tion of subdirectories per directory. Since each directory is both a directoryand subdirectory, the mean subdirectories per directory is necessarily one.9This means the fact that the distribution is slightly more skewed towardsmaller sizes indicates that the directory structure is deeper with a smallerbranching factor.To add some context to this result, Figure 7.5 shows the histogram andCDF of files by directory depth for my on-disk data study. Similar resultshave not been published in any prior work to my knowledge, so this resultcannot be compared to prior years. The histogram in this case is somewhatmore informative than the CDF, as it shows a roughly normal distributionaround a depth of 7 directories, with a spike at 3. The tail of files in deepdirectories is heavy, with 10% of files above depth 10 and some as deep astwice that. These findings further indicate that namespaces contain non-trivial amounts of data in a relatively small number of very deep paths.9Ignoring that the root directory is not a member of any directory1357.1. File System Namespace CharacterizationFigure 7.3: Histogram of number of files per directory.Figure 7.4: Histogram of number of subdirectories per directory.1367.1. File System Namespace CharacterizationFigure 7.5: CDF of number of files versus depth in file system hierarchy.Figure 7.6: CDF of bytes versus depth in file system hierarchy.1377.1. File System Namespace CharacterizationFinally, I consider the how storage consumption is distributed amongthe storage hierarchy. Figure 7.6 shows the histogram and CDF of bytes bydirectory depth. This graph, combined with the previous result, shows thatby capacity, larger files are higher in the namespace, with the many smallfiles appearing at all depths of the namespace.Overall, my dataset suggests that the namespace is changing in 4 ways:1. There are significantly more files and directories overall.2. The namespace branching factor is growing smaller.3. The file system hierarchy is becoming deeper.4. There are somewhat fewer files per directory.These increases in namespace size and depth do lend some evidence toMurphy and Seltzer’s argument that namespaces are growing more complex,which could argue for a change to file system organization, particularly ifone assumes that the more rapid growth in directory counts is an attemptto manage the increasing file system count.It is tempting to conclude that the exponential increase in file systemcount is causing an even larger increase in directory counts in an attempt tobring the burgeoning complexity back under control. This is a particularlyattractive conclusion, because the effect appears to be uniformly distributedacross the various depths of the file system namespace, which suggests thatboth users and software that autonomously populates the hierarchy are re-sponding similarly. At the same time, fewer files per directory is most ob-viously a sign of decreased (or unchanging) complexity. I conclude that1387.1. File System Namespace Characterizationwhile much of the weight of evidence does suggest an increase in namespacecomplexity, my results are mixed overall.In addition, my results suggest that real namespaces contain data in arelatively small number of very deep paths, which contradicts the assump-tion of Huang et al of a nearly complete tree structure that can be sampledwith a tractable number of depth-first traversals [HZW+11]. Figure 7.6shows, similarly, that files beyond the average depth contain roughly 20%of the data in a typical file system.7.1.2 File TypesI now turn my attention to the general popularity of file extensions, in anattempt to characterize this aspect of real namespaces and to assess Peek’sassumption of a very long tail distribution of supported file types [Pee10].Figure 7.7 shows a CDF of the number of files in the 2009 on-disk datastudy versus the number of unique extensions. Since 2000, there is a steadytrend towards more diversity in file extensions. Still, the most common fileextensions are used by most files in the system. In 2009 half of files had oneof 14 extensions, whereas in 2000 half of files had one of 10 extensions.Figure 7.8 details the ten most common extensions and their contributiontowards the total number of files in the study. It is primary a testament tothe large number of developer workstations in my dataset, but within thatcontext reveals some interesting trends. GIF files (a web image format)and HTM files (the extension used for many web pages), particularly thosecreated on Windows PCs, have both dropped dramatically in popularityboth in the relative and absolute sense. At the same time, XML (the struc-1397.1. File System Namespace CharacterizationFigure 7.7: CDF of files versus extensions. Total file count versus numberof file extensions for years 2009, 2004, and 2000 shown.tured document layout format) and MANIFEST files, which is a specificXML-based format, have risen to meet dll (dynamic loaded library) and hfiles (C/C++ programming header) in popularity. Meanwhile, although cpp(C++ source files) remain as popular as ever, C programming source fileshave dropped off the list and been replaced with cs files (C# source files).These findings, particularly the uniform increase in the rarity of fileextensions shown in Figure 7.7, show that there is indeed a heavy tail of fileextensions. It also shows that the trend is becoming more pronounced overtime. These findings together do seem to support Peek’s motivation [Pee10]for addressing support for rare file types.1407.1. File System Namespace CharacterizationFigure 7.8: Percentage of files versus extension. Total file counts and fileextensions for years 2009, 2004, and 2000 shown.1417.1. File System Namespace Characterization7.1.3 DiscussionMy analysis of desktop workstation namespaces is a key resource for re-searchers who must frequently make assumptions based on characterizationsof file system namespace structure. In a specific example, the long tail of di-rectory depth relative to the average depth of files suggests that approachesto querying that use file system sampling [HZW+11] are likely to need verylarge sample sizes or must otherwise find ways to sample the deepest pathsthrough the namespace. This result is also seen in other heavily skeweddistributions, for example that of file size. I can confirm the assumptionsof Peek [Pee10], that there is an increasingly long tail of file types. Fi-nally, though the evidence is mixed, there is good support of Murphy andSeltzer’s [SM09] view that namespaces are becoming more complex.I have quantified the precise ways in which storage system namespacesare becoming more complex. My findings are that there are more files anddirectories, deeper namespaces, and more diverse file types. Only the num-ber of files and subdirectories per directory suggest any simplification instructure (or at least a lack of increase in complexity). This informationis important for enterprise storage designers and administrators, who mustboth contend with the management and performance challenges that comefrom deep and complex hierarchies and a wider array of file types. If thistrend is to continue it seems likely that more support for understandingand searching file systems will be necessary. In the past, proposals suchas Semantic File Systems [GJSO91], and query-based namespaces [GM99]have drawn research interest but little traction in general purpose file sys-1427.2. Leveraging Namespaces with Differential Durabilitytems. It may be time to revisit some of these ideas in the context of modernworkloads.In the following section I consider the another application of modellingnamespaces in desktop workstations – that of using the namespace to pri-oritize data placement.7.2 Leveraging Namespaces with DifferentialDurabilityLive storage systems typically pay little or no regard to the topology ofnamespaces. This section details my second application of modelling work-station namespaces, which is to use knowledge of the namespace to increaseperformance by prioritizing different data reliability policies. I call this ap-proach Differential Durability, and have used it to reduce I/O load on acentral storage system of up to 28.4% under simulation.Differential Durability was inspired by an analysis of my 2010 study ofVDI performance, which is presented below. I also describe how the featurewas added to the Capo cache discussed in Chapter 6. Finally I use theresults of that study as well as several new micro-benchmarks to evaluatethe effectiveness of the feature.Workload CharacterizationRecall that during the trace collection of my 2010 study of VDI performance,I took the step to associate each disk-bound block-level request with thename and path of the accessed file. This multi-level analysis has proven1437.2. Leveraging Namespaces with Differential Durabilityextremely valuable in understanding VDI-based workloads. In Figure 7.9I show aggregate read and write operation count and throughput for theentire trace, which is very nearly exactly 2:1 write heavy in operation countand 2:1 read heavy in throughput. I also show where those requests arelocated inside the file system namespace. I chose the categories shown tohighlight different uses of the file system, under the presumption that usersview files in their personal directories different than they do, for example,the pagefile, or Program Files that are only written by an installer.  0% 20%  40%  60%  80% 100%Thpt.Thpt.IopsIops290.9GB17M Iops440.5GB9M IopsWritesReadsUserWindowsProgram FilesTemppagefile.sysMetadataOtherUnknownFigure 7.9: Size and amount of block-level writes by file system path.The file-level information associated with the trace shows that metadataoperations account for large portion of the requests. My current trace driveris unable to determine where within the namespace these operations occur,but it can determine the namespace location of most file-level operations.Files within directories typically managed by the operating system, such as\Windows and \Program files are also very frequently accessed. There are1447.2. Leveraging Namespaces with Differential Durabilityfewer accesses to files in user directories and temporary files; most of thelatter are to \Temporary Internet Files, as opposed to \Windows\Temp.These findings strongly contrast those of Vogels who’s older study showedthat 93% of file-level modification occurred in \User directories [Vog99].From this I conclude that while a wide range of the namespace is accessed,it is not accessed uniformly, and access to data directly managed by users israre. These findings are important because they suggest that applying dataretention policies specific to the type of data being accessed may provide apotential benefit, as I will show.Namespace DivergenceSimilar to the per-VM total disk divergence measurement in Figure 6.4, onemight wonder if the shared file system namespace diverges from the baseimage at different rates in different regions of the namespace.Figure 7.10 plots the cumulative divergence for each VM in the clusterunder observation, and divides that total among various components of thenamespace. One can observe that the pagefile diverges immediately, thenremains a constant size over time, as does the system metadata. Both thesefiles are bounded in size. Meanwhile writes to \Windows and areas of thedisk I cannot associate with any file continue to grow over the full week ofthe study. This finding supports my previous presumption that differentlocations within the namespace are accessed differently. I conclude thatwhile writes occur everywhere in the namespace, they exhibit meaningfultrends when categorized according to the destination.1457.2. Leveraging Namespaces with Differential DurabilityTotal System Divergence (GB)0.009.5419.0728.6138.1547.6857.2266.7676.29TimeW. 15:00W. 21:00T. 03:00T. 09:00T. 15:00T. 21:00F. 03:00F. 09:00F. 15:00F. 21:00S. 03:00S. 09:00S. 15:00S. 21:00S. 03:00S. 09:00S. 15:00S. 21:00M. 03:00M. 09:00M. 15:00M. 21:00T. 03:00T. 09:00T. 15:00T. 21:00W. 23:00W. 05:00W. 11:00W. 17:00UnknownOtherMetadataPagefileTempProg. FilesWindowsUserFigure 7.10: Total divergence versus time for each namespace category.7.2.1 Differential DurabilityDrawing from my findings on the non-uniformity of namespace access above,I propose a new model for delaying file system writeback as a performanceoptimization. Under a Differential Durability writeback policy, file-level ac-cesses are assigned a policy that operates over the location of the modifi-cation within the namespace. A cache, like the Capo system described inSection 6.1.5, then uses these policies to determine how long to delay write-back, in order to lower both the peak and average load on central storage.Recall that all major VDI providers have adopted the software updatestrategy proposed in The Collective [CZSL05], where user directories areisolated from the rest of the file system. Meanwhile modifications to thesystem image are performed on all VMs in one step by completely replacingthe system images in the entire pool, leaving the user’s data unmodified.1467.2. Leveraging Namespaces with Differential DurabilityThis impacts durability—any writes to the system portion (e.g., by updat-ing the registry or installing software) will be lost. Differential Durabilityextends this notion to optimize for a write-heavy workload by allowing dif-ferent write-back policies for different locations in the file system namespace.Reducing Write-Back CostsAs a thought experiment, consider a user who installs Microsoft Word anduses it to create a short document. The install process creates temporary.cab files, which are deleted after installation. The installation itself writesto the Windows Registry and Program Files folder. Finally, the user’sdocument will be saved to their desktop.One might ask - In the event of an unusual failure and crash, what levelof loss is acceptable in exchange for efficiency? To which one would almostcertainly answer - It depends. The temporary files by their very nature areephemeral and could be lost with no penalty. Contrastingly, it is reasonableto assume that the user document represents original work and should neverbe lost. Between these two extremes, the installation itself represents somework on behalf of the user, but a perfunctory installation could be repeatedif necessary. It might be acceptable if the loss of such effort was limited to,for example, an hour or even a day. I call these three classes of durabilityrequirements: none, full, and partial.Based on this example, consider that there are some system files thatneed not be durably stored at all. These include files that are discarded onsystem restarts or can easily be reconstructed if lost. Writes to the pagefile,for example, represent nearly a tenth of the total throughput to central-1477.2. Leveraging Namespaces with Differential Durabilityized storage in my workload, as shown in Figure 7.9. These writes consumevaluable storage and network bandwidth, but since the pagefile is discardedon system restart, durably storing this data provides no benefit. The ad-ditional durability obtained by transmitting these writes over a congestednetwork to store them on highly redundant centralized storage provides novalue because this data fate-shares with the local host machine and its disk.Many temporary files are used in the same way, requiring persistent storageonly as long as the VM is running.Differential Durability policies designate this data to local disk only,assigning it a write-back cache policy with an infinitely long write-backperiod. In the event of a hardware crash on a physical host, the VM will beforced to reboot, and the data can be discarded.Other writes such as the modifications to the Program Files directorycan benefit from delaying write back, even for relatively short periods. Thesedelayed writebacks can occur during periods of relative inactivity, whilemany re-writes can be absorbed entirely, as I showed in Figure 6.6. In theunlikely event of a local disk failure, a user’s experience of losing a singleday’s modification to the Program Files folder looks no different from theweekly reset to the base image, and in practice it can eliminate a significantnumber of writes to central storage.Differential Durability on User FilesModifications made to files in the user directories must be durable; usersdepend on these changes. The Differential Durability policy therefore useswrite-through caching on these accesses, propagating all changed blocks im-1487.2. Leveraging Namespaces with Differential Durabilitymediately to the centralized storage servers. However, this is not as simpleas coarsely partitioning the file system at the root directory.While much of the data on the User volume is important to the userand must have maximum durability, Windows, in particular, places somefiles containing system data in the User volume. Examples include log files,the user portion of the Windows registry, and the local and roaming pro-files containing per-application configuration settings. Table 7.1 shows somepaths on User volumes in Windows that can reasonably be cached with awrite-back policy and a relatively long write-back period. Obviously, con-sideration must be used in selecting these policies so as to preserve usereffort.DesignInitially, I approached the problem of mapping these policies to write re-quests as one of request tagging, in which a driver installed on each virtualdesktop would provide hints to the local cache about each write. While thisapproach is flexible and powerful, maintaining the correct consistency be-tween file and file system metadata (much of which appears as opaque writesto the Master File Table in NTFS) under different policies is challenging.Furthermore, the tagging requires many file name lookups, which increasesmemory overhead for every open file. Instead, I have developed a simplerand better performing approach using existing file system features.The path-based policies I use in my experiments can be seen in Table7.1; naturally, these may be customized by an administrator. This list isprovided (and used) as an example of conservative, well reasoned policies. I1497.2. Leveraging Namespaces with Differential Durabilitydid not attempt to tailor these policies to this specific workload, in fact, asthe evaluation will show, several opportunities for improving performancefurther still exist. I provide these policies to a disk optimization tool thatI run when creating a virtual machine image. The optimization tool alsotakes a populated and configured base disk image. For each of the two less-durable policies, it takes the given path and moves the existing data to oneof two newly-created NTFS file systems dedicated to that policy. It thenreplaces the path in the original file system with a reparse point (Window’sanalogue of a symbolic link) to the migrated data. This transforms thesingle file system into three file systems with the same original logical view.Each of the three file systems are placed on a volume with the appropriatepolicy provided by the local disk cache. This technique is similar to the viewsynthesis in Ventana [PGR06], though I am the first to apply the techniquewith a local cache to optimize performance.1507.2.LeveragingNamespaceswithDifferentialDurabilityPath Policy\Program Files\ write-back\Windows\ write-back\Users\ProgramData\VMware\VDM\logs write-back\Users\$USER$\ntuser.dat write-back\Users\$USER$\AppData\local write-back\Users\$USER$\AppData\roaming write-back\pagefile.sys no-write-back\ProgramData\Sophos no-write-back\Temp\ no-write-back\Users\$USER$\AppData\Local\Microsoft\Windows\Temporary Internet Files no-write-backEverything else, including user data and file system metadata write-throughTable 7.1: Sample cache-coherency policies applied as part of Differential Durability optimization.1517.2. Leveraging Namespaces with Differential DurabilityI appreciate that applying different consistency policies to files in a singlelogical file system may be controversial. The risk in doing so is that a crash orhardware failure results in a dependency between a file that is preserved anda file that is lost. Such a state could lead to instability; however, I are awareof no dependencies crossing from files with high durability requirements tothose with lower durability requirements in practice. Further, I observe thatthis threat already exists in the production environment I studied, whichoverwrites system images with a common shared image on a weekly basis.EvaluationThis section describes several micro-benchmarks that evaluate the effective-ness of Differential Durability in isolation of other features and provide aclearer mapping of end-user activity to observed writes. I applied the policiesin Table 7.1 to several realistic desktop workloads. For each, I measured theportion of write requests that would fall under each policy category. I usesynthetic user generated workloads here to highlight both the scale of, andthe reason for, Differential Durability’s improvement in the overall work-load. A breakdown of writes by their associated policy for each workloadis shown in Figure 7.11. Differential Durability was also incorporated intoCapo and evaluated against the full workload, which I discuss subsequently.Web WorkloadMy web workload is intended to capture a short burst of web activity. Theuser is made to open Facebook 10 with Microsoft Internet Explorer, log in,10http:\\www.facebook.com1527.2. Leveraging Namespaces with Differential DurabilityFacebook MicrosoftOutlookMicrosoftWordPercentage of writes in workload  0% 20% 40% 60% 80%100%w−thrw−thrw−thrw−bckw−bckw−bckn−wbn−wbn−wbUnknownOtherMetadataUserLocal Config.Windowspagefile.sysTempWrite Through Write Back No Write BackFigure 7.11: Percentage of writes by cache-coherence policy. Writes in eachof the three micro-benchmarks are organized by governing cache-coherencepolicy.and post a brief message to their account. They then log off and close thebrowser. The entire task lasts less than a minute. The workload consistsof 8MB (43.6% by count) of writes and 25.3MB (56.4% by count) of reads.Recall that in Section 6.2 I showed that web browsing activity such as thisis a significant contributor to the peak load periods I have been exploring.In this short workload only a small but non-trivial improvement can bemade. Local configuration changes such as registry, temp file, and cacheupdates are buffered for a short time, removing or delaying just over 20%of the operations.1537.2. Leveraging Namespaces with Differential DurabilityEmail WorkloadMy email workload is based on Microsoft Outlook. The user sends emailsto a server I have configured to automatically reply to every message bysending back an identical message. Ten emails are sent and received insuccession before the test ends. The workload consists of 63MB (39% bycount) of writes and 148MB (61% by count) of reads matching well thecharacteristics observed in the trace.Here the improvement is much more substantial. Although very fewwrites can be stored to local disk indefinitely, over half can be delayed inwriting to centralized storage. This is due to Outlook’s caching behaviour,which makes heavy use of the System and Application Data folders. Uservisible email files are all stored in .pst files included in the user categoryand safely stored, which I was able to verify by forcing a shutdown beforedelayed write-back. It is worth noting that many files in the Windows andApplication Data categories are obvious temp files, but did not match anyof the policies in Table 7.1. With more careful tuning, the policies could befurther optimized for this workload.Application WorkloadMy application workload is intended to simulate a simple editing task. Theuser opens Microsoft Word and creates a new document. The user also opensWikipedia 11 in Microsoft Internet Explorer. She then proceeds to navigateto 10 random Wikipedia pages in turn, and copy the first paragraph of each11http:\\www.wikipedia.org1547.3. Summary and Discussioninto the word document, saving the document each time. Finally, the usercloses both programs. The workload consists of 120MB (20.0% by count) ofwrites and 406MB (80.0% by count) of reads. In addition to simulating atypical simple multi-tasking activity, this test is intended to dirty memory.Viewing many small pages creates a large number of small writes totemporary files and memory pressure12 increases the pagefile usage. Bothprograms write significantly to System folders, leaving less than 36% of theworkload to be issued as write-through.Trace-Driven WorkloadI also modified the disk image used in the trace replay of the Capo systemin Section 6.1.5. With Differential Durability, Capo was able to reduce bothpeak and total loads by half, to 50.1% and 47.6% of their original totals.This is a respective improvement of 2.4% and 28.4% over a version of Capowithout Differential Durability. A more detailed discussion is provided inthe FAST 2011 Capo paper [SMW+10].7.3 Summary and DiscussionThe structure and organization of file system namespaces contain valuableinformation that is too often unavailable. From the perspective of designersand implementers who seek to understand file system structure there arevery few public repositories from which to model namespaces. Outside ofuser-interaction studies, which are far more limited in scope, the charac-12The guest was running Windows Vista with 1GB of RAM, 25% higher than theXenDesktop recommended minimum.1557.3. Summary and Discussionterization in Section 7.1 of this chapter form the only such resource madeavailable in the past decade.Applying models of desktop workstation namespaces to unchecked as-sumption in prior research yields a range of results. In some cases, newfeatures do appear to be designed on assumptions which my datasets sup-port well, but others do not. Sourcing relevant results in the study of realnamespaces would help guide researchers and designers towards solving realproblems in realistic ways.As an example, I have showed that it is possible to leverage knowledge offile system structure in a live system to make trade offs between performanceand durability. In this latter case, I was also able to evaluate the system’seffectiveness on a live dataset because my 2010 trace of file system workloadincluded multi-layer metadata, allowing me to correlate block-level accessbelow the cache with file names and paths. Differential Durability is rela-tively easy to implement and simple to deploy once the insight to separatedurability based on namespace is made clear by trace analysis. Further, itis quite effective at reducing complexity of storage systems as measured byrequests made to central storage. An evaluation based on my traced work-load showed a decrease in total request load of 28.4% due to DifferentialDurability alone.156Chapter 8ConclusionIn this thesis I have detailed several advances in the measurement of datastorage systems. In Chapter 4 I described the architecture of my data col-lection and analysis framework, and described two case studies in which itwas applied to gather information about large scale systems in live deploy-ment. Based on these case studies I have presented findings in three areasof storage system management and development.In Chapter 5 I described how tracing and analysis can be applied tomanage storage system capacities more effectively and characterized theduplication inherent in a general purpose dataset for the first time. Amongmy results, I find that while all forms of deduplication work, most of theadvantage of performance intensive deduplication can be had by simplerapproaches. I also find that selectively targeting the specific files wherecomplex approaches work more effectively is likely to close the gap evenfurther.In Chapter 6 I explored three applications of workload introspection,which I define as the ability to gain insight into workloads via tracing andanalysis. I showed how tracing was instrumental in the design, motivation,and evaluation of the Capo cache. I also list a number of cases in which more157Chapter 8. Conclusiondetailed tracing than is typical has aided in their identification of problem-atic workloads. Finally I describe how I have used file system analysis topresent the first empirical measurement of commercial defragmentation.In Chapter 7 I described how modelling of file system namespaces can beused to better characterize our storage systems and the impacts on existingresearch, and also how it can be leveraged to enable new alternatives inbalancing performance and durability. With Differential Durability writesare buffered in a cache for a controlled period of time in order to mitigatesystem load.In addition to my analysis and datasets, the tools I have created are in-tended to be used for an organization to perform regular, persistent analysisof their workloads and systems, in order to facilitate simple and straight-forward approaches to data management and mitigating the effects of grow-ing system size and feature complexity. The suite of tools also presentsa useful starting point for other researchers wishing to perform their ownstudies of system behaviour, in order to corroborate my findings, or moreusefully, extend the relatively small body of existing public datasets on stor-age system workloads, metadata, and content.158Bibliography[AAADAD11] Nitin Agrawal, Leo Arulraj, Andrea C. Arpaci-Dusseau, andRemzi H. Arpaci-Dusseau. Emulating goliath storage sys-tems with David. In Proceedings of the 9th USENIX Confer-ence on File and Storage Technologies, FAST’11, Berkeley,CA, USA, February 2011. USENIX Association.[AADAD09] Nitin Agrawal, Andrea C. Arpaci-Dusseau, and Remzi H.Arpaci-Dusseau. Generating realistic impressions for file-system benchmarking. In Proceedings of the 7th USENIXConference on File and Storage Technologies, FAST’09,pages 125–138, Berkeley, CA, USA, February 2009. USENIXAssociation.[ABDL07] Nitin Agrawal, William J. Bolosky, John R. Douceur, andJacob R. Lorch. A five-year study of file-system metadata.In Proceedings of the 5th USENIX Conference on File andStorage Technologies, FAST’07, Berkeley, CA, USA, Febru-ary 2007. USENIX Association.[ALR+12] Cristina L. Abad, Huong Luu, Nathan Roberts, Kihwal Lee,Yi Lu, and Roy H. Campbell. Metadata traces and work-159Bibliographyload models for evaluating big storage systems. In Proceed-ings of the 2012 IEEE/ACM Fifth International Conferenceon Utility and Cloud Computing, UCC ’12, pages 125–132,Washington, DC, USA, 2012. IEEE Computer Society.[AMW+03] Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener,Patrick Reynolds, and Athicha Muthitacharoen. Perfor-mance debugging for distributed systems of black boxes.In Proceedings of the 19st ACM Symposium on OperatingSystems Principles, SOSP’03, pages 74–89, New York, NY,USA, 2003. ACM.[And09] Eric Anderson. Capture, conversion, and analysis of an in-tense NFS workload. In Proccedings of the 7th USENIX Con-ference on File and Storage Technologies, FAST ’09, pages139–152, Berkeley, CA, USA, February 2009. USENIX As-sociation.[Ass13] Storage Networking Industry Association. Traces, October2013. http://iotta.snia.org/traces/.[Ass14a] Storage Networking Industry Association. IOTTA reposi-tory, 2014. http://iotta.snia.org/traces.[Ass14b] Storage Networking Industry Association. SSSI workloadI/O capture program, 2014. http://snia.org/forums/sssi/wiocp.160Bibliography[AWZ04] A. Aranya, C. P. Wright, and E. Zadok. Tracefs: A file sys-tem to trace them all. In Proceedings of the 3rd USENIXConference on File and Storage Technologies, FAST’04,pages 129–143, Berkeley, CA, USA, March 2004. USENIXAssociation.[BBU+09] Shivnath Babu, Nedyalko Borisov, Sandeep Uttamchandani,Ramani Routray, and Aameek Singh. DIADS: Addressingthe ”my-problem-or-yours” syndrome with integrated SANand database diagnosis. In Proceedings of the 7th USENIXConference on File and Storage Technologies, FAST ’09,pages 57–70, Berkeley, CA, USA, February 2009. USENIXAssociation.[BCG+07] Paramvir Bahl, Ranveer Chandra, Albert Greenberg,Srikanth Kandula, David A. Maltz, and Ming Zhang. To-wards highly reliable enterprise network services via infer-ence of multi-level dependencies. In Proceedings of the2007 Conference on Applications, Technologies, Architec-tures, and Protocols for Computer Communications, SIG-COMM ’07, pages 13–24, New York, NY, USA, 2007. ACM.[BCGD00] William J. Bolosky, Scott Corbin, David Goebel, andJohn R. Douceur. Single Instance Storage in Windows 2000.In Proceedings of the 4th conference on USENIX WindowsSystems Symposium, pages 2–2, Berkeley, CA, USA, 2000.USENIX Association.161Bibliography[BDIM04] Paul Barham, Austin Donnelly, Rebecca Isaacs, and RichardMortier. Using Magpie for request extraction and workloadmodelling. In Proceedings of the 6th Symposium on Oper-ating Systems Design and Implementation, OSDI’04, pages18–18, Berkeley, CA, USA, 2004. USENIX Association.[BELL09] Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, andMark Lillibridge. Extreme binning: Scalable, parallel dedu-plication for chunk-based file backup. In Proceedings of the20th International Symposium on Modeling, Analysis andSimulation of Computer and Telecommunication Systems,MASCOTS ’99, pages 1–9, Washington, DC, USA, 2009.IEEE Computer Society.[BGU+09] Medha Bhadkamkar, Jorge Guerra, Luis Useche, Sam Bur-nett, Jason Liptak, Raju Rangaswami, and Vagelis Hristidis.BORG: Block-ReORGanization for self-optimizing storagesystems. In Proceedings of the 7th USENIX Conferenceon File and Storage Technologies, FAST’09, pages 183–196,Berkeley, CA, USA, February 2009. USENIX Association.[BHK+91] Mary G. Baker, John H. Hartmart, Michael D. Kupfer,Ken W. Shirriff, and John K. Ousterhout. Measurementsof a distributed file system, 1991.[BKL+10] Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel,and Peter Vajgel. Finding a needle in Haystack: Facebook’s162Bibliographyphoto storage. In Proceedings of the 9th Symposium on Oper-ating Systems Design and Implementation, OSDI’10, pages1–8, Berkeley, CA, USA, 2010. USENIX Association.[Blo70] Burton H. Bloom. Space/time trade-offs in hash coding withallowable errors. Communications of the ACM, 13:422–426,1970.[CAK+04] Mike Y. Chen, Anthony Accardi, Emre K.c.man, Jim Lloyd,Dave Patterson, Armando Fox, and Eric Brewer. Path-basedfailure and evolution management. In Proceedings of the 1stUSENIX Symposium on Networked Systems Design and Im-plementation, NSDI’04, pages 309–322, Berkeley, CA, USA,March 2004. USENIX Association.[CAVL09] A.T. Clements, I. Ahmad, M. Vilayannur, and J. Li. De-centralized deduplication in SAN cluster file systems. InProceedings of the USENIX Annual Technical Conference,ATEC’09, page 8, Berkeley, CA, USA, June 2009. USENIXAssociation.[CCZ07] Anupam Chanda, Alan L. Cox, andWilly Zwaenepoel. Who-dunit: Transactional profiling for multi-tier applications. InProceedings of the 2nd ACM SIGOPS/EuroSys EuropeanConference on Computer Systems 2007, EuroSys’07, pages17–30, New York, NY, USA, 2007. ACM.[CGC11] Cornel Constantinescu, Joseph S. Glider, and David D.163BibliographyChambliss. Mixing deduplication and compression on ac-tive data sets. In Data Compression Conference, DCC’11,pages 393–402, 2011.[Cor10a] Microsoft Corp. File systems, 2010.[Cor10b] Microsoft Corp. Volume Shadow Copy Service, 2010.[Cor13a] Microsoft Corp. Computer down not crash when..., October2013. http://support.microsoft.com/kb/2492505.[Cor13b] Microsoft Corporation. BackupRead function, Novem-ber 2013. http://msdn.microsoft.com/en-us/library/aa362509(VS.85).aspx.[Cor13c] Microsoft Corporation. Recommended backup schedulefor system center 2012 - operations manager, November2013. http://technet.microsoft.com/en-us/library/hh278860.aspx.[Cor13d] Symantec Corporation. Recommended basic backup con-figurations when protecting Enterprise Vault 9.0 using theNetBackup 7.0 or later Enterprise Vault Agent., September2013. http://www.symantec.com/docs/TECH150237.[Cor14a] Microsoft Corporation. Legal and corporate affairs, 2014.www.microsoft.com/en-us/legal/Default.aspx.[Cor14b] Microsoft Corporation. Load order groups and altitudes164Bibliographyfor minifilter drivers, 2014. http://msdn.microsoft.com/en-us/library/windows/hardware/ff549689.aspx.[Cor14c] Microsoft Corporation. Windows 7 & SSD: Defragmen-tation, SuperFetch, prefetch, 2014. http://support.microsoft.com/kb/2727880.[CWM+14] Brendan Cully, Jake Wires, Dutch Meyer, Kevin Jamieson,Keir Fraser, Tim Deegan, Daniel Stodden, Geoffre Lefeb-vre, Daniel Ferstay, and Andrew Warfield. Strata: High-performance scalable storage on virtualized non-volatilememory. In Proceedings of the 12th USENIX Conferenceon File and Storage Technologies, FAST’14, pages 17–31,Berkeley, CA, USA, February 2014. USENIX Association.[CWO+11] Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakan-tan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, ShashwatSrivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas,Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards,Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agar-wal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, DeepaliBhardwaj, Sowmya Dayanand, Anitha Adusumilli, Mar-vin McNett, Sriram Sankaran, Kavitha Manivannan, andLeonidas Rigas. Windows Azure Storage: A highly availablecloud storage service with strong consistency. In Proceedingsof the 23th ACM Symposium on Operating Systems Princi-165Bibliographyples, SOSP’11, pages 143–157, New York, NY, USA, 2011.ACM.[CZSL05] Ramesh Chandra, Nickolai Zeldovich, Constantine Sa-puntzakis, and Monica S. Lam. The Collective: A cache-based system management architecture. In Proceedings ofthe 2nd USENIX Symposium on Networked Systems Designand Implementation, NSDI’05, pages 259–272, Berkeley, CA,USA, May 2005. USENIX Association.[DB99] John R. Douceur andWilliam J. Bolosky. A large-scale studyof file-system contents. Proceedings of the 1999 ACM SIG-METRICS International Conference on Measurement andModeling of Computer Systems, 27:59–70, May 1999.[DDL+11] Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Saz-zala Reddy, and Philip Shilane. Tradeoffs in scalable datarouting for deduplication clusters. In Proceedings of the9th USENIX Conference on File and Storage Technologies,FAST’11, Berkeley, CA, USA, 2011. USENIX Association.[DGH+09] Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kacz-marczyk, Wojciech Kilian, Przemyslaw Strzelczak, JerzySzczepkowski, Cristian Ungureanu, and Michal Welnicki.HYDRAstor: a scalable secondary storage. In Proceedings ofthe 7th USENIX Conference on File and Storage Technolo-166Bibliographygies, FAST’09, pages 197–210, Berkeley, CA, USA, February2009. USENIX Association.[ELMS03] Daniel Ellard, Jonathan Ledlie, Pia Malkani, and MargoSeltzer. Passive NFS tracing of email and research work-loads. In Proceedings of the 2nd USENIX Conference onFile and Storage Technologies, FAST ’03, pages 203–216,Berkeley, CA, USA, March 2003. USENIX Association.[emIC+05] Michael Abd el malek, William V. Courtright Ii, ChuckCranor, Gregory R. Ganger, James Hendricks, Andrew J.Klosterman, Michael Mesnier, Manish Prasad, On Salmon,Raja R. Sambasivan, Shafeeq Sinnamohideen, John D.Strunk, Eno Thereska, Matthew Wachs, and Jay J. Wylie.Ursa Minor: Versatile cluster-based storage. In Proceedingsof the 4th USENIX Conference on File and Storage Tech-nologies, FAST’05, pages 59–72, Berkeley, CA, USA, De-cember 2005. USENIX Association.[Fou13] The Apache Software Foundation. ZooKeeper does not re-cover from crash when disk was full, 2013. https://issues.apache.org/jira/browse/ZOOKEEPER-1621.[GAM+07] Dennis Geels, Gautam Altekar, Petros Maniatis, TimothyRoscoe, and Ion Stoica. Friday: Global comprehension fordistributed replay. In Proceedings of the 4th USENIX Sym-posium on Networked Systems Design and Implementation,167BibliographyNSDI’07, Berkeley, CA, USA, April 2007. USENIX Associ-ation.[GGtL03] Sanjay Ghemawat, Howard Gobioff, and Shun tak Leung.The Google file system. In Proceedings of the 19st ACMSymposium on Operating Systems Principles, SOSP’03,pages 29–43, New York, NY, USA, 2003. ACM.[GJSO91] David K. Gifford, Pierre Jouvelot, Mark A. Sheldon, andJames W. O’Toole, Jr. Semantic file systems. In Proceedingsof the 13th ACM Symposium on Operating Systems Princi-ples, SOSP’91, pages 16–25, New York, NY, USA, 1991.ACM.[GKA09] Ajay Gulati, Chethan Kumar, and Irfan Ahmad. Storageworkload characterization and consolidation in virtualizedenvironments. In 2nd International Workshop on Virtual-ization Performance: Analysis, Characterization, and Tools,VPACT’09, 2009.[GM99] Burra Gopal and Udi Manber. Integrating content-basedaccess mechanisms with hierarchical file systems. In Pro-ceedings of the 3rd Symposium on Operating Systems Designand Implementation, OSDI’99, pages 265–278, Berkeley, CA,USA, 1999. USENIX Association.[Gur14] Scott Gurthrie. Windows Partner Conference (WPC).Keynote Presentation, July 2014.168Bibliography[HBD+14] Tyler Harter, Dhruba Borthakur, Siying Dong, AmitanandAiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H.Arpaci-Dusseau. Analysis of HDFS under HBase: A Face-book messages case study. In Proceedings of the 12thUSENIX Conference on File and Storage Technologies,FAST’14, pages 199–212, Berkeley, CA, USA, February2014. USENIX Association.[HDV+12] Tyler Harter, Chris Dragga, Michael Vaughn, Andrea C.Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. A file isnot a file: Understanding the I/O behavior of Apple desk-top applications. ACM Transactions on Computer Systems(ToCS), 30(3):10:1–10:39, August 2012.[HHS05] Hai Huang, Wanda Hung, and Kang G. Shin. FS2: Dy-namic data replication in free disk space for improving diskperformance and energy consumption. SIGOPS OperatingSystems Review, 39(5):263–276, 2005.[HMN+12] Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov,and Gil Vernik. Estimation of deduplication ratios in largedata sets. In Proceedings of the 28st IEEE Conference onMass Storage Systems and Technologies, MSST’12, pages 1–11, 2012.[HS03] W. Hsu and A. J. Smith. Characteristics of I/O traffic in per-169Bibliographysonal computer and server workloads. IBM Systems Journal,42(2):347–372, April 2003.[HSY05] Windsor W. Hsu, Alan Jay Smith, and Honesty C. Young.The automatic improvement of locality in storage systems.ACM Transactions on Computer Systems (TOCS), 23(4),November 2005.[HZW+11] H. Howie Huang, Nan Zhang, Wei Wang, Gautam Das, andAlexander S. Szalay. Just-in-time analytics on large file sys-tems. In Proceedings of the 9th USENIX Conference on Fileand Storage Technologies, FAST’11, pages 217–230, Berke-ley, CA, USA, February 2011. USENIX Association.[JHP+09] Weihang Jiang, Chongfeng Hu, Shankar Pasupathy, ArkadyKanevsky, Zhenmin Li, and Yuanyuan Zhou. Understandingcustomer problem troubleshooting from storage system logs.In Proceedings of the 7th USENIX Conference on File andStorage Technologies, FAST ’09, pages 43–56, Berkeley, CA,USA, February 2009. USENIX Association.[JM09] Keren Jin and Ethan L. Miller. The effectiveness of dedu-plication on virtual machine disk images. In Proceedingsof SYSTOR 2009: The Israeli Experimental Systems Con-ference, SYSTOR’09, pages 7:1–7:12, New York, NY, USA,2009. ACM.[JWZ05] N. Joukov, T. Wong, and E. Zadok. Accurate and effi-170Bibliographycient replaying of file system traces. In Proceedings of the4th USENIX Conference on File and Storage Technologies,FAST’05, pages 337–350, Berkeley, CA, USA, December2005. USENIX Association.[KDLT04] Purushottam Kulkarni, Fred Douglis, Jason LaVoie, andJohn M. Tracey. Redundancy elimination within large col-lections of files. In Proceedings of the USENIX Annual Tech-nical Conference, ATEC’04, Berkeley, CA, USA, June 2004.USENIX Association.[KJ08] Eric Koskinen and John Jannotti. BorderPatrol: Isolatingevents for black-box tracing. In Proceedings of the 3rd ACMSIGOPS/EuroSys European Conference on Computer Sys-tems 2008, Eurosys’08, pages 191–203, New York, NY, USA,2008. ACM.[KLZ+07] Jonathan Koren, Andrew Leung, Yi Zhang, CarlosMaltzahn, Sasha Ames, and Ethan Miller. Searching andnavigating petabyte-scale file systems based on facets. InProceedings of the 2nd International Workshop on PetascaleData Storage, PDSW ’07, pages 21–25, New York, NY, USA,2007. ACM.[KR10] Ricardo Koller and Raju Rangaswami. I/O deduplication:Utilizing content similarity to improve i/o performance. InProceedings of the 8th USENIX Conference on File and Stor-171Bibliographyage Technologies, FAST’10, Berkeley, CA, USA, February2010. USENIX Association.[KTGN10] Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, and PriyaNarasimhan. Black-box problem diagnosis in parallel file sys-tems. In Proceedings of the 8th USENIX Conference on Fileand Storage Technologies, FAST’10, pages 43–56, Berkeley,CA, USA, February 2010. USENIX Association.[KU10] Erik Kruus and Cristian Ungureanu. Bimodal content de-fined chunking for backup streams. In Proceedings of the8th USENIX Conference on File and Storage Technologies,FAST’10, Berkeley, CA, USA, February 2010. USENIX As-sociation.[LCGC12] Maohua Lu, David D. Chambliss, Joseph S. Glider, and Cor-nel Constantinescu. Insights for data reduction in primarystorage: a practical analysis. In Proceedings of the 5th An-nual International Systems and Storage Conference, SYS-TOR’12, page 17, New York, NY, USA, June 2012. ACM.[LCSZ04] Zhenmin Li, Zhifeng Chen, Sudarshan M. Srinivasan, andYuanyuan Zhou. C-Miner: Mining block correlations in stor-age systems. In Proceedings of the 3rd USENIX Conferenceon File and Storage Technologies, FAST’04, Berkeley, CA,USA, March 2004. USENIX Association.[LEB+09] Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay172BibliographyDeolalikar, Greg Trezise, and Peter Camble. Sparse index-ing: Large scale, inline deduplication using sampling and lo-cality. In Proceedings of the 7th USENIX Conference on Fileand Storage Technologies, FAST’09, pages 111–123, Berke-ley, CA, USA, February 2009. USENIX Association.[LEB13] Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat.Improving restore speed for backup systems that use in-line chunk-based deduplication. In Proceedings of the11th USENIX Conference on File and Storage Technologies,FAST’13, Berkeley, CA, USA, February 2013. USENIX As-sociation.[LGW+08] Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, XiaochenLian, Jian Tang, Ming Wu, M. Frans Kaashoek, and ZhengZhang. D3S: Debugging deployed distributed systems. InProceedings of the 5th USENIX Symposium on NetworkedSystems Design and Implementation, NSDI’08, pages 423–437, Berkeley, CA, USA, April 2008. USENIX Association.[LHW12] Duy Le, Hai Huang, and Haining Wang. Understanding per-formance implications of nested file systems in a virtualizedenvironment. In Proceedings of the 10th USENIX Confer-ence on File and Storage Technologies, FAST’12, page 8,Berkeley, CA, USA, February 2012. USENIX Association.[LPGM08] Andrew W. Leung, Shankar Pasupathy, Garth Goodson,173Bibliographyand Ethan L. Miller. Measurement and analysis of large-scale network file system workloads. In Proceedings of theUSENIX Annual Technical Conference, ATEC’08, pages213–226, Berkeley, CA, USA, June 2008. USENIX Associa-tion.[LSAH11] Quoc M. Le, Kumar SathyanarayanaRaju, Ahmed Amer,and JoAnne Holliday. Workload impact on shingled writedisks: All-writes can be alright. In Proceedings of the 2011IEEE 19th Annual International Symposium on Modelling,Analysis, and Simulation of Computer and Telecommunica-tion Systems, MASCOTS ’11, pages 444–446, Washington,DC, USA, 2011. IEEE Computer Society.[MAC+08] Dutch T. Meyer, Gitika Aggarwal, Brendan Cully, GeoffreyLefebvre, Michael J. Feeley, Norman C. Hutchinson, and An-drew Warfield. Parallax: Virtual disks for virtual machines.In Proceedings of the 3rd ACM SIGOPS/EuroSys EuropeanConference on Computer Systems 2008, Eurosys’08, NewYork, NY, USA, 2008. ACM.[MB11] Dutch T. Meyer andWilliam J. Bolosky. A study of practicaldeduplication. In Proceedings of the 9th USENIX Conferenceon File and Storage Technologies, FAST’11, Berkeley, CA,USA, February 2011. USENIX Association.[McK03] Marshall Kirk McKusick. Enhancements to the fast filesys-174Bibliographytem to support multi-terabyte storage systems. In Pro-ceedings of the BSD Conference 2003, BSDC’03, pages 9–9,Berkeley, CA, USA, 2003. USENIX Association.[Mic09] Microsoft. Virtual hard disk image format specification,February 2009. http://technet.microsoft.com/en-us/virtualserver/bb676673.aspx.[MJLF84] Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler,and Robert S. Fabry. A fast file system for UNIX. ACMTransactions on Computer Systems (TOCS), 2:181–197, aug1984.[NDR08] Dushyanth Narayanan, Austin Donnelly, and Antony Row-stron. Write off-loading: Practical power management forenterprise storage. In Proceedings of the 6th USENIX Con-ference on File and Storage Technologies, FAST’08, pages17:1–17:15, Berkeley, CA, USA, February 2008. USENIX As-sociation.[NDT+08] D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, andA. Rowstron. Everest: Scaling down peak loads through i/ooff-loading. In Proceedings of the 8th Symposium on Oper-ating Systems Design and Implementation, OSDI’08, pages15–28, Berkeley, CA, USA, 2008. USENIX Association.[OCH+85] John Ousterhout, Herve Da Costa, David Harrison, John A.Kunze, Mike Kupfer, and James G. Thompson. A trace-175Bibliographydriven analysis of the UNIX 4.2 BSD file system. In Pro-ceedings of the 10th ACM Symposium on Operating SystemsPrinciples, SOSP’85, pages 25–34, New York, NY, USA,1985. ACM.[Pee10] Daniel Peek. TrapperKeeper: The case for using virtualiza-tion to add type awareness to file systems. In Proceedingsof the 2nd USENIX Conference on Hot Topics in Storageand File Systems, HotStorage’10, Berkeley, CA, USA, 2010.USENIX Association.[PGR06] Ben Pfaff, Tal Garfinkel, and Mendel Rosenblum. Virtual-ization aware file systems: Getting beyond the limitationsof virtual disks. In Proceedings of the 3rd USENIX Sym-posium on Networked Systems Design and Implementation,NSDI’06, pages 353–366, Berkeley, CA, USA, May 2006.USENIX Association.[PL10] N. Park and D.J. Lilja. Characterizing datasets for datadeduplication in backup applications. In Proceedings of theIEEE International Symposium on Workload Characteriza-tion, IISWC’10, 2010.[PP04] Calicrates Policroniades and Ian Pratt. Alternatives for de-tecting redundancy in storage systems data. In Proceedingsof the USENIX Annual Technical Conference, ATEC’04,Berkeley, CA, USA, June 2004. USENIX Association.176Bibliography[Pra11] Ian Pratt. Co-founder of Bromium. Personnal Communica-tion, Nov 2011.[QD02] Sean Quinlan and Sean Dorward. Venti: A new approach toarchival data storage. In Proceedings of the 1st USENIXConference on File and Storage Technologies, FAST’02,page 7, Berkeley, CA, USA, January 2002. USENIX Associ-ation.[Rab81] Michael Rabin. Fingerprinting by random polynomials. InTech Report TR-CSE-03-01. Center for Research In Com-puting Technology, Harvard University, 1981.[RBM13] Ohad Rodeh, Josef Bacik, and Chris Mason. BTRFS:The linux B-tree filesystem. ACM Transactions on Storage(ToS), 9(3):9:1–9:32, August 2013.[Riv92] R. Rivest. The MD5 message-digest algorithm, 1992.http://tools.ietf.org/rfc/rgc1321.txt.[RKBH13] Kai Ren, YongChul Kwon, Magdalena Balazinska, and BillHowe. Hadoop’s adolescence: An analysis of Hadoop usagein scientific workloads. Proceedings of the VLDB Endow-ment, 6(10):853–864, August 2013.[RLA00] Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson. Acomparison of file system workloads. In Proceedings of theUSENIX Annual Technical Conference, ATEC’00, Berkeley,CA, USA, June 2000. USENIX Association.177Bibliography[RSI12] Mark Russinovich, David Solomon, and Alex Ionescu. De-veloper Reference. Microsoft Press, 2012.[SAF07] Ya-Yunn Su, Mona Attariyan, and Jason Flinn. AutoBash:Improving configuration management with operating systemcausality analysis. In Proceedings of the 21st ACM Sympo-sium on Operating Systems Principles, SOSP’07, pages 237–250, New York, NY, USA, 2007. ACM.[SAGL06] Mirko Steinle, Karl Aberer, Sarunas Girdzijauskas, andChristian Lovis. Mapping moving landscapes by miningmountains of logs: novel techniques for dependency modelgeneration. In Proceedings of the 32nd international confer-ence on very large data bases, VLDB ’06, pages 1093–1102.VLDB Endowment, 2006.[Sat81] M. Satyanarayanan. A study of file sizes and functionallifetimes. In Proceedings of the 8th ACM Symposium onOperating Systems Principles, SOSP’81, pages 96–108, NewYork, NY, USA, 1981. ACM.[Sch10] Erik Scholten. How to: Optimize guests for VMware View,July 2010. http://www.vmguru.nl/wordpress/2010/07/how-to-optimize-guests-for-vmware-view.[SM09] Margo Seltzer and Nicholas Murphy. Hierarchical file sys-tems are dead. In Proceedings of the 12th USENIX Confer-178Bibliographyence on Hot Topics in Operating Systems, HotOS’09, Berke-ley, CA, USA, 2009. USENIX Association.[SMW+10] Mohammad Shamma, Dutch T. Meyer, Jake Wires, MariaIvanova, Norman C. Hutchinson, and Andrew Warfield.Capo: Recapitulating storage for virtual desktops. In Pro-ceedings of the 8th USENIX Conference on File and StorageTechnologies, FAST’10, Berkeley, CA, USA, February 2010.USENIX Association.[SPGR08] Mark A. Smith, Jan Pieper, Daniel Gruhl, and Lucas VillaReal. IZO: Applications of large-window compression to vir-tual machine management. In Proceedings of the 22nd Con-ference on Large Installation System Administration Con-ference, LISA’08, pages 121–132, Berkeley, CA, USA, 2008.USENIX Association.[SS97] Keith A. Smith and Margo I. Seltzer. File system aging.increasing the relevance of file system benchmarks. Proceed-ings of the 1997 ACM SIGMETRICS International Confer-ence on Measurement and Modeling of Computer Systems,25:203–213, June 1997.[SZDR+11] Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa,Elie Krevat, Spencer Whitman, Michael Stroucken, WilliamWang, Lianghong Xu, and Gregory R. Ganger. Diagnosingperformance changes by comparing request flows. In Pro-179Bibliographyceedings of the 8th USENIX Symposium on Networked Sys-tems Design and Implementation, NSDI’11, Berkeley, CA,USA, 2011. USENIX Association.[TBZS11] Vasily Tarasov, Saumitra Bhanage, Erez Zadok, and MargoSeltzer. Benchmarking file system benchmarking: It *is*rocket science. In Proceedings of the 13th USENIX Confer-ence on Hot Topics in Operating Systems, HotOS’11, pages9–9, Berkeley, CA, USA, 2011. USENIX Association.[THKZ13] Vasily Tarasov, Dean Hildebrand, Geoff Kuenning, and ErezZadok. Virtual machine workloads: The case for new bench-marks for NAS. In Proceedings of the 11th USENIX Confer-ence on File and Storage Technologies, FAST’13, Berkeley,CA, USA, February 2013. USENIX Association.[TJWZ] A. Traeger, N. Joukov, C. P. Wright, and E. Zadok. A nineyear study of file system and storage benchmarking. ACMTransactions on Storage (TOS), 4(2):25–80.[TKM+12] V. Tarasov, K. S. Kumar, J. Ma, D. Hildebrand, A. Povzner,G. Kuenning, and E. Zadok. Extracting flexible, replayablemodels from large block traces. In Proceedings of the10th USENIX Conference on File and Storage Technologies,FAST’12, Berkeley, CA, USA, February 2012. USENIX As-sociation.[TMB+12] V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuen-180Bibliographyning, and E. Zadok. Generating realistic datasets for dedu-plication analysis. In Proceedings of the USENIX AnnualTechnical Conference, ATEC’12, Berkeley, CA, USA, June2012. USENIX Association.[UAA+10] Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, SalilGokhale, Stephen Rago, Grzegorz Cakowski, Cezary Dub-nicki, and Aniruddha Bohra. HydraFS: A high-throughputfile system for the HYDRAstor content-addressable stor-age system. In Proceedings of the 8th USENIX Conferenceon File and Storage Technologies, FAST’10, Berkeley, CA,USA, February 2010. USENIX Association.[VKUR10] Akshat Verma, Ricardo Koller, Luis Useche, and Raju Ran-gaswami. SRCMap: Energy proportional storage using dy-namic consolidation. In Proceedings of the 8th USENIX Con-ference on File and Storage Technologies, FAST’10, Berke-ley, CA, USA, February 2010. USENIX Association.[VMW10] VMWare. Virtual machine disk format (VMDK), July2010. http://www.vmware.com/technical-resources/interfaces/vmdk.html.[Vog99] Werner Vogels. File system usage in Windows NT 4.0. InProceedings of the 17st ACM Symposium on Operating Sys-tems Principles, SOSP’99, pages 93–109, New York, NY,USA, 1999. ACM.181Bibliography[WA93] Randolph Y. Wang and Thomas E. Anderson. xFS: A widearea mass storage file system. Technical report, Berkeley,CA, USA, 1993.[WDQ+12] Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shi-lane, Stephen Smaldone, Mark Chamness, and Windsor Hsu.Characteristics of backup workloads in production systems.In Proceedings of the 10th USENIX Conference on File andStorage Technologies, FAST’12, Berkeley, CA, USA, Febru-ary 2012. USENIX Association.[WHADAD13] Zev Weiss, Tyler Harter, Andrea C. Arpaci-Dusseau, andRemzi H. Arpaci-Dusseau. ROOT: Replaying multithreadedtraces with resource-oriented ordering. In Proceedings ofthe 24th ACM Symposium on Operating Systems Principles,SOSP’13, pages 373–387, New York, NY, USA, 2013. ACM.[WJK+05] C. P. Wright, N. Joukov, D. Kulkarni, Y. Miretskiy, andE. Zadok. Auto-pilot: A platform for system software bench-marking. In Proceedings of the USENIX Annual TechnicalConference, ATEC’05, pages 175–187, Berkeley, CA, USA,April 2005. USENIX Association.[WXH+04] Feng Wang, Qin Xin, Bo Hong, Scott A. Brandt, Ethan L.Miller, Darrell D. E. Long, and Tyce T. Mclarty. File systemworkload analysis for large scale scientific computing appli-cations. In Proceedings of the 21st IEEE / 12th NASA God-182Bibliographydard Conference on Mass Storage Systems and Technologies,MSST’04, pages 139–152, 2004.[YBG+10] Neeraja J. Yadwadkar, Chiranjib Bhattacharyya,K. Gopinath, Thirumale Niranjan, and Sai Susarla.Discovery of application workloads from network file traces.In Proceedings of the 8th USENIX Conference on Fileand Storage Technologies, FAST’10, Berkeley, CA, USA,February 2010. USENIX Association.[YMX+10] Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, YuanyuanZhou, and Shankar Pasupathy. SherLog: Error diagnosis byconnecting clues from run-time logs. In Proceedings of the15th Edition of ASPLOS on Architectural Support for Pro-gramming Languages and Operating Systems, ASPLOS’10,pages 143–154, New York, NY, USA, 2010. ACM.[YZP+11] Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, andStefan Savage. Improving software diagnosability via logenhancement. In Proceedings of the 16th Edition of ASPLOSon Architectural Support for Programming Languages andOperating Systems, ASPLOS’11, pages 3–14, New York, NY,USA, 2011. ACM.[ZLP08] Benjamin Zhu, Kai Li, and Hugo Patterson. Avoiding thedisk bottleneck in the Data Domain deduplication file sys-tem. In Proceedings of the 6th USENIX Conference on183File and Storage Technologies, FAST’08, pages 18:1–18:14,Berkeley, CA, USA, February 2008. USENIX Association.[ZS99] Min Zhou and Alan Jay Smith. Analysis of personalcomputer workloads. In Proceedings of the 7th Interna-tional Symposium on Modeling, Analysis and Simulation ofComputer and Telecommunication Systems, MASCOTS ’99,pages 208–, Washington, DC, USA, 1999. IEEE ComputerSociety.184Appendix ASelected Database SQLQueriesA.1 Mean File System Fullnesss e l e c t 100 − (SUM( p e r c e n t f r e e ) / COUNT(∗ ) ) from scansA.2 File Systems by Count of Filess e l e c t FileCount /1000 , COUNT(∗ )from scans group by FileCount /1000order by FileCount /1000A.3 Whole File Duplicates by File Extensions e l e c t top 50a . extens ion , SUM(a . s i z e ) , COUNT(∗ ) , AVG( a . s i z e ) ,AVG( cas t (b . CountOfDuplicates as b i g i n t ) )from f i l e s as a inner j o i n185A.3. Whole File Duplicates by File Extensionf i l e s w i t h w h o l e f i l e d u p l i c a t e s as bon a . F i l e S e r i a l = b . F i l e S e r i a lgroup by exten s i onorder by SUM(a . s i z e ) desc186Appendix BScanner Output Format-=-Per-System Data-=-[Standard Stream] or [Backup Stream]Username (hashed)Hostname (hashed)System Directory LocationCurrent TimeOperating SsytemProcessor ArchitectureProcessor LevelProcessor RevisionNumber of ProcessorsTotal Physical PagesSize of Page FileTotal Virtual MemoryVolume nameVolume serial numberVolume max component length187Appendix B. Scanner Output FormatFile system flags from GetVolumeInformation()File system format from GetVolumeInformation()Sectors per clusterBytes per sectorFree clustersTotal clustersContents of NTFS_VOLUME_DATA_BUFFERFile system SizeFile System Version-=-Per-File Data-=-Directory name:length (hashed)File name:length (hashed)extension:0 hashed (hashed)namespace depthfile sizefile attributes flag (in hex)file idnumber of hard links to filethe reparse flag status (in hex)Creation timelast access timelast modification time{ If the file is non-linear on diskSV:X => X= Location of start Vcn188Appendix B. Scanner Output FormatV:X:L:Y => X=Location of next Vcn, Y=Lcn...V:X:L:Y => X=Location of next Vcn, Y=Lcn}{ If the file is sparseA:X:Y => X=sparse allocation offset, Y=length...A:X:Y => X=sparse allocation offset, Y=length}Hash of data:chunk size...Hash of data:chunk size-=-Log footer format-=-LOGCOMPLETECompletion time from time()189

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0166682/manifest

Comment

Related Items