UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Storage system design for fast nonvolatile memories Wires, Jacob Taylor 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_november_wires_jacob.pdf [ 2.06MB ]
JSON: 24-1.0357126.json
JSON-LD: 24-1.0357126-ld.json
RDF/XML (Pretty): 24-1.0357126-rdf.xml
RDF/JSON: 24-1.0357126-rdf.json
Turtle: 24-1.0357126-turtle.txt
N-Triples: 24-1.0357126-rdf-ntriples.txt
Original Record: 24-1.0357126-source.json
Full Text

Full Text

Storage System Design for Fast Nonvolatile MemoriesbyJacob Taylor WiresM.Sc., Computer Science, The University of British Columbia, 2006B.Sc., Computer Engineering, The University of California at Santa Barbara, 2004A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)October 2017c© Jacob Taylor Wires, 2017AbstractNonvolatile memories are transforming the data center. Over the past decade, enter-prise flash has evolved to provide a thousand times more random-access throughputthan mechanical disks, with a thousand times lower latency and ten times more ca-pacity. These remarkable improvements completely reshape software concerns,allowing storage systems to take a more central role in dynamic resource manage-ment, but demanding that they do so with significantly lower overheads.This thesis presents several novel software techniques for managing high-densitystorage systems. In particular, it describes a probabilistic approach to workloadmodeling that provides guaranteed error bounds while dramatically reducing mem-ory overheads relative to existing state-of-the-art algorithms. It also documentsthe design and implementation of a storage controller that leverages dynamic con-straint satisfaction techniques to continually optimize data and network flow place-ment for performance, efficiency, and scale.These advances are presented within a broader design framework that provides aflexible and robust platform for managing all aspects of storage resource alloca-tion. Informed by experiences and insights gained over six years of building anenterprise scale-out storage appliance, it is based on three key ideas: light-weightabstraction to decouple logical resources from physical hardware, online analysisto capture workload requirements, and dynamic actuation to adjust allocations asrequirements change. Together, these capabilities allow storage software to dy-namically adapt to changing workload behavior and allow stored data to play amore active role in data center computing.iiLay SummaryMost data center storage systems were originally designed to manage mechanicaldisks, which are some of the slowest hardware components in general comput-ing. Enterprise flash devices and other nonvolatile memories have emerged overthe past decade that are so much faster than disks that existing storage softwaresimply cannot keep up. These devices call for new design approaches that provideefficient request processing to avoid costly performance penalties while also sup-porting dynamic resource management to ensure high hardware utilization. Thisthesis describes a system architecture and several novel software techniques thattogether provide this efficiency and dynamism, allowing application software tofully leverage the impressive capabilities of these new devices.iiiPrefaceChapters 3, 4, and 5 are versions of papers published at peer-reviewed academicconferences. They have been lightly edited for formatting.Chapter 3A version of Chapter 3 was published at FAST, the Usenix Conference on File andStorage Technologies, in 2014 [34]. This was a joint work with several authors. Asthe second author, I made significant contributions in building the system, evaluat-ing the results, and composing the manuscript.Chapter 4A version of Chapter 4 was published at OSDI, the Usenix Conference on Op-erating Systems Design and Implementation, also in 2014 [122]. As the primaryauthor, I set the research agenda, contributed to the implementation, evaluated theresults, and presented the work. My coauthors contributed to the implementationand manuscript composition.Chapter 5A version of Chapter 5 was published at FAST in 2017 [120]. I was the primaryauthor and researcher, responsible for all aspects of implementation and evaluation.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Nonvolatile Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Strata: Scalable High-Performance Storage on Virtualized Non-VolatileMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14v3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Scope of this Work . . . . . . . . . . . . . . . . . . . . . 193.3 Data Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.1 The Virtual Address Map . . . . . . . . . . . . . . . . . . 223.3.2 Dispatch . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.3 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Network Attached Disks . . . . . . . . . . . . . . . . . . . . . . 253.4.1 Network Integration . . . . . . . . . . . . . . . . . . . . 263.5 Online Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . 263.5.1 Object Reconfiguration . . . . . . . . . . . . . . . . . . . 273.5.2 System Reconfiguration . . . . . . . . . . . . . . . . . . 293.6 Storage Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6.1 Scalable NFS . . . . . . . . . . . . . . . . . . . . . . . . 323.6.2 SDN Protocol Scaling . . . . . . . . . . . . . . . . . . . . 333.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.7.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . 343.7.2 Baseline Performance . . . . . . . . . . . . . . . . . . . 353.7.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 353.7.4 Node Failure . . . . . . . . . . . . . . . . . . . . . . . . 383.7.5 Protocol Overhead . . . . . . . . . . . . . . . . . . . . . 393.7.6 Effect of CPU on Performance . . . . . . . . . . . . . . . 403.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Characterizing Storage Workloads with Counter Stacks . . . . . . . 444.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Counter Stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.2 LRU Stack Distances . . . . . . . . . . . . . . . . . . . . 504.4 Practical Counter Stacks . . . . . . . . . . . . . . . . . . . . . . 524.4.1 Downsampling . . . . . . . . . . . . . . . . . . . . . . . 53vi4.4.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.3 Probabilistic Counters . . . . . . . . . . . . . . . . . . . 544.4.4 LRU Stack Distances . . . . . . . . . . . . . . . . . . . . 554.5 The Counter Stack API . . . . . . . . . . . . . . . . . . . . . . . 564.5.1 On-disk Streams . . . . . . . . . . . . . . . . . . . . . . 564.5.2 Compute Queries . . . . . . . . . . . . . . . . . . . . . . 564.5.3 Time Slicing and Shifting . . . . . . . . . . . . . . . . . 574.5.4 Joining . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.6 Error and Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . 604.6.1 Counter Error . . . . . . . . . . . . . . . . . . . . . . . . 604.6.2 Downsampling Uncertainty . . . . . . . . . . . . . . . . 614.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.7.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 624.7.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 644.8 Workload Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 654.8.1 Combined Workloads . . . . . . . . . . . . . . . . . . . . 654.8.2 Erratic Workloads . . . . . . . . . . . . . . . . . . . . . 684.8.3 Conflicting Workloads . . . . . . . . . . . . . . . . . . . 684.8.4 Periodic Workloads . . . . . . . . . . . . . . . . . . . . . 694.8.5 Zipfian Workloads . . . . . . . . . . . . . . . . . . . . . 714.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 Mirador: An Active Control Plane for Datacenter Storage . . . . . . 755.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 A Control Plane for Datacenter Storage . . . . . . . . . . . . . . 775.3 Mirador . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.3.1 Observation . . . . . . . . . . . . . . . . . . . . . . . . . 815.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 815.3.3 Actuation . . . . . . . . . . . . . . . . . . . . . . . . . . 885.3.4 Platform Support . . . . . . . . . . . . . . . . . . . . . . 885.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.4.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 91vii5.4.2 Actuation . . . . . . . . . . . . . . . . . . . . . . . . . . 915.4.3 Resource Objectives . . . . . . . . . . . . . . . . . . . . 925.4.4 Workload Objectives . . . . . . . . . . . . . . . . . . . . 965.5 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113viiiList of TablesTable 3.1 Random IO performance on Strata versus KNFS . . . . . . . . 35Table 3.2 Random IO performance with various CPU models . . . . . . . 40Table 4.1 Counter stack resource requirements . . . . . . . . . . . . . . 63Table 4.2 Modelling hit rates . . . . . . . . . . . . . . . . . . . . . . . . 68Table 5.1 Mirador objective functions . . . . . . . . . . . . . . . . . . . 89Table 5.2 Greedy solver runtime . . . . . . . . . . . . . . . . . . . . . . 90ixList of FiguresFigure 1.1 Schematic overview . . . . . . . . . . . . . . . . . . . . . . 8Figure 3.1 Strata network storage architecture . . . . . . . . . . . . . . . 15Figure 3.2 Hardware view of a Strata deployment . . . . . . . . . . . . . 19Figure 3.3 Virtual object to physical object range mapping . . . . . . . . 22Figure 3.4 IOPS over time, read-only workload . . . . . . . . . . . . . . 36Figure 3.5 IOPS over time, 80/20 R/W workload . . . . . . . . . . . . . 37Figure 3.6 IOPS over time, random placement . . . . . . . . . . . . . . . 38Figure 3.7 Aggregate bandwidth during failover and recovery . . . . . . 39Figure 4.1 The counter stack library architecture . . . . . . . . . . . . . 57Figure 4.2 The counter stack join operation . . . . . . . . . . . . . . . . 59Figure 4.3 Computing stack distances . . . . . . . . . . . . . . . . . . . 62Figure 4.4 Sample miss ratio curves . . . . . . . . . . . . . . . . . . . . 66Figure 4.5 Combined miss ratio curves . . . . . . . . . . . . . . . . . . 67Figure 4.6 Counter stack fidelity . . . . . . . . . . . . . . . . . . . . . . 67Figure 4.7 Time-sliced miss ratio curves . . . . . . . . . . . . . . . . . . 69Figure 4.8 Modelling shared caches . . . . . . . . . . . . . . . . . . . . 70Figure 4.9 Modelling workload footprints . . . . . . . . . . . . . . . . . 71Figure 4.10 Synthetic miss ratio curves . . . . . . . . . . . . . . . . . . . 72Figure 5.1 The Mirador system architecture and rebalance pipeline . . . 79Figure 5.2 Rebuilding replicas after a device failure . . . . . . . . . . . . . 93Figure 5.3 Performance under three different placement policies . . . . . 94xFigure 5.4 Performance during cluster reconfiguration . . . . . . . . . . 95Figure 5.5 Footprint-aware placement . . . . . . . . . . . . . . . . . . . 98Figure 5.6 Noisy neighbor isolation . . . . . . . . . . . . . . . . . . . . 100Figure 5.7 Workload co-scheduling . . . . . . . . . . . . . . . . . . . . 101Figure 5.8 Optimization time versus objects inspected . . . . . . . . . . . . 102Figure 5.9 Violations observed versus objects inspected . . . . . . . . . 103xiList of ListingsListing 5.1 Load balancing rule . . . . . . . . . . . . . . . . . . . . . . 84Listing 5.2 Hardware redundancy rule . . . . . . . . . . . . . . . . . . . 84Listing 5.3 Greedy solver implementation . . . . . . . . . . . . . . . . . 87xiiGlossaryAPI application programming interfaceCPU central processing unitCRUD create, read, update, deleteDIMM dual inline memory moduleFCoE fibre channel over ethernetFIO flexible IO testerHLL hyperloglog cardinality sketchIO input/outputIOPS IO operations per secondIP internet protocolIPMI intelligent platform management interfaceiSCSI internet SCSIJVM java virtual machineLRU least-recently used cache replacement algorithmLSN log serial numberxiiiM.2 specification for internally mounted expansion cards and connectorsMRC miss ratio curveNAD network attached diskNAND negative-and logic gateNFS network file system protocolNIC network interface controllerNUMA non-uniform memory accessNVDIMM nonvolatile DIMMNVMe nonvolatile memory expressOSD object storage devicePCIe peripheral component interconnect expressPCM phase change memorypNFS parallel network file system protocolRAM random access memoryRPC remote procedure callRTT round-trip timeSAN storage area networkSAS serial attached SCSISATA serial AT attachmentSCSI small computer systems interfaceSDN software-defined networkingSLA service level agreementxivSMART self-monitoring, analysis, and reporting technologySMB server message block, aka common internet file systemSSD solid state driveTCP transmission control protocolVLAN virtual local area networkVM virtual machineVMDK virtual machine diskVMM virtual machine monitorxvAcknowledgmentsNone of this work would have been possible without the help of my advisor, An-drew Warfield, who has taught me a great many things about technology, business,and life. His generosity, optimism, and enthusiasm are a wonderful inspiration,and I look forward to many more years of collaboration together.I owe many thanks to my committee members Norm Hutchinson and Bill Aiello,who offered invaluable advice and support throughout my studies.Finally, I would like to thank Stephen Ingram, Nick Harvey, Daniel Stodden, KalanMacrow, Mihir Nanavati, and all my past and present colleagues at Coho Data. Thelaughter and discoveries shared over the years working with this team have madeall the effort worthwhile.xviDedicationTo my family, for all their love and support.xviiChapter 1IntroductionThis thesis documents more than five years of experience building an enterprisestorage appliance. The period it describes was one of remarkable hardware innova-tion, in which nonvolatile memories improved on the performance of mechanicaldisks by more than three orders of magnitude. It is a rare and exciting thing forcomputer scientists to witness such a dramatic transformation in such a short time.In this case, however, the advances came with a new set of problems, as existingsoftware techniques proved ill-equipped to fully leverage new hardware capabili-ties. This work details some of the key challenges we faced in building a storagesystem designed specifically for fast, nonvolatile memories. In particular, it de-scribes a system architecture suitable for new low-latency devices, and it presentsseveral novel software techniques for obtaining high utility from these devices. Itadditionally presents a model for system design that is broadly applicable to manyservices within the data center. This model can be summarized by three key capa-bilities: light-weight abstraction of hardware to provide programmatic control ofresource allocation, online analysis of workload behavior to provide insight intoperformance requirements, and dynamic actuation to adjust resource allocationsas requirements change. Together, these capabilities yield robust, flexible systems.The chapters that follow, which are edited versions of published conference papers,detail how these capabilities are implemented in our production storage system.1For much of the history of computing, persistent storage has been provided by me-chanical devices. From index cards to magnetic tape and finally rotating platters,the operating constraints of persistent media are limited by the physics of mo-tion in a way that transistor-based technologies like RAM and CPUs are not. Thedifference is significant: while it takes roughly 100 nanoseconds to read a bytefrom main memory in a modern system, it takes nearly 10 milliseconds (i.e., about100,000 times longer) to read the same byte from an enterprise-class disk. Thisvast disparity is commonly known as the IO gap, and it underpins many of thedesign assumptions in modern systems, affecting everything from the caching andprefetching strategies of operating systems to the on-disk layouts of file systems.In the context of data center storage systems, this disparity has motivated aggre-gated designs that place many disks behind a single network-accessible controller.While this approach is suitable for spinning disks, it is a poor fit for new nonvolatilememories. Enterprise NVMe flash drives today can serve sequential read workloadsat rates of 2.8 GiB per second, and they will only grow faster as hardware paral-lelism continues to increase. At these rates, the PCIe channel capacity provided bymodern CPUs becomes a limiting factor, making single-head controllers impracti-cal. At the same time, commodity NVMe drives exhibit access latencies as low as20 microseconds, and NVDIMM-based alternatives like Intel’s 3DXPoint reduce la-tencies even further. This exposes significant software inefficiencies as overheadsthat were once negligible compared to the millisecond response times of mechani-cal disks dominate overall request processing times on modern hardware.Chapter 3 details our solution to these problems. Strata is a network-attached stor-age system that eliminates network and controller bottlenecks by presenting flashdevices directly over the network and distributing controller logic across multiplecompute nodes. Strata employs three key techniques to make this possible. First,it virtualizes storage devices, providing an idealized software interface that ex-poses sparse, virtual address spaces, allowing safe sharing among multiple clients.Second, it cleanly separates addressing from placement. Address resolution isdelegated to clients via a lightweight library that provides direct access to virtualaddress spaces, while a centralized placement service controls the assignment ofvirtual address spaces to physical devices. This provides a decentralized, low-2overhead data path while allowing coordinated responses to load imbalances andhardware configuration changes. Finally, Strata leverages software defined net-working to provide a scalable protocol presentation layer in support of legacy datacenter clients.Strata provides the infrastructure required to support well-balanced deployments ofstorage, network, and compute, which is key to preventing any one of these com-ponents from becoming a bottleneck. It also augments native hardware interfaceswith just enough software functionality to support safe multiplexing and dynamicresource allocation without introducing prohibitive overheads, much in the sameway that CPU virtualization makes it possible to efficiently share expensive proces-sors among multiple independent compute tasks. As a result, Strata performancescales linearly with the number of available devices. However, the decentralizedarchitecture required to achieve this scalability adds significant engineering com-plexity, introducing the need for robust consensus and fencing mechanisms, amongother things. Building this functionality into an enterprise storage product was aconsiderable undertaking that took hundreds of developer-years to complete.By design, Strata eliminates performance bottlenecks through balanced hardwareprovisioning. The monetary cost of this provisioning, however, is beyond Strata’scontrol, and the market price of NVMe devices is such that a single card can costas much as the combined network and compute resources with which it is pack-aged. In other words, because the manufacturing processes of different componentsadvance at different rates, provisioning systems with balanced performance capa-bilities can lead to significant imbalances in component costs. The best way tomitigate the effect of this imbalance on the overall cost of the system is to avoidover-provisioning expensive devices and to ensure that they achieve high utiliza-tion. This turns out to be a challenging problem for storage systems because uti-lization must be measured across many orthogonal dimensions, including storageand network capacity, processing power, and device queue depths.To take just one example, storage workloads frequently exhibit access patterns thatare heavily skewed towards a small proportion of their overall data sets. Serv-ing such workloads entirely from expensive solid state devices incurs a high op-3portunity cost, because much of the devices’ capabilities may be wasted storingidle data. In these cases, it may be preferable to place cold, infrequently-accesseddata on cheaper, more capacious media in order to make space for hot data fromother workloads. The ability to split workloads across heterogeneous devices inthis manner makes it possible to allocate performance and capacity resources in-dependently, which in turn presents opportunities for improving the utilization ofhigh-performance media. In fact, these opportunities exist across the entire mem-ory hierarchy, which will continue to combine devices with dramatically differentperformance, capacity, and cost characteristics long after mechanical disks becomea thing of the past. For example, technologies like NVDIMM, 3D XPoint, and phasechange memory present trade-offs in price, latency, and capacity that are almost assubstantial as those between SSDs and disks.Determining the optimal allocation for a particular workload is not trivial, however.To begin with, it is difficult to arrive at a wholly satisfying definition of ‘optimal’in this context. But setting that aside for the moment, even simply identifying hotdata can be computationally expensive. Given modern hardware, a single stor-age workload may be capable of generating hundreds of thousands of requests persecond across billions of unique addresses – and it may remain active for weeks,months, or even years on end. The effort involved in analyzing such voluminousrequest streams can be immense. Indeed, shortly after we began investigating howwe might improve flash utilization in Strata, we ran into exactly this problem: ap-plying classical stack distance analysis techniques to a week-long storage trace ofjust a handful of machines required roughly an hour of compute time and 92 GiBof RAM. This was inconvenient for our research, but downright prohibitive for usein production.Chapter 4 presents a novel locality analysis technique we developed that is compu-tationally tractable even for very large workloads. The technique leverages prob-abilistic data structures called counter stacks to enable approximate LRU stackdistance analysis with sublinear memory overheads. This represents a significantimprovement over the previous state of the art, allowing us to analyze the above-mentioned week-long trace in under twenty minutes with just 80 MiB of RAM (lessthan a tenth of a percent of what was previously required). Counter stacks improve4on existing analysis techniques in two additional ways. First, they make it possi-ble to analyze how workload locality changes over time, revealing phase changesand other temporal patterns that can be exploited to improve performance and ef-ficiency. Second, they allow us to model how workloads interact with each otheron shared storage. For example, we can quickly calculate the degree to whichunrelated workloads would interfere with each other if placed on the same de-vice, allowing us to make informed decisions when distributing data across storagenodes.Counter stacks get their name from the cardinality counters they rely on to trackdata references over time. By combining knowledge of the unique addresses ac-cessed over various time windows with a record of the total number of requestsover the same windows, counter stacks give an indication of a workload’s temporallocality. For the sake of practicality, counter stacks use approximate counters thatbelong to a class of streaming algorithms and data structures designed for process-ing very large data sets. Streaming algorithms often trade accuracy for efficiencyand are well-suited for scenarios where imperfect results can be tolerated. Forexample, they enable efficient estimation of join sizes when optimizing databasequeries, and they support anomaly detection of traffic patterns in large networks.We suspect that with a bit of creativity, they will prove useful for analyzing andtuning many other aspects of system performance in unforeseen ways; indeed,cardinality sketches form the basis of another data structure we recently devel-oped for measuring implicit sharing among copy-on-write snapshots. Investigat-ing further opportunities for integrating streaming algorithms into high-throughput,performance-sensitive systems is a promising direction for future research.Counter stacks enable a degree of continuous workload analysis that was previ-ously impractical. The visibility they provide into capacity and performance re-quirements help to inform decisions about how to distribute data across heteroge-neous devices. However, these are just two of a large number of criteria that mustbe considered when allocating system resources. Other salient examples includethe need to maintain hardware redundancy for replicated data and the desire to bal-ance network load across available links. In fact, in a disaggregated, heterogeneoussystem like Strata, deciding how best to accommodate all of these objectives is in5itself a challenging problem, especially since, in some circumstances, one objectivemay directly contradict another. This is an important problem, however, becausepoor decisions can have calamitous effects on system performance, reliability, andefficiency.Chapter 5 describes Mirador, a dynamic resource management service designedfor network-attached storage systems. Informed by the detailed profiling and anal-ysis made possible by counter stacks and leveraging the device and protocol virtu-alization provided by Strata, Mirador strives to achieve high hardware utilizationand system availability by dynamically migrating workloads in response to chang-ing requirements. In many ways, it can be likened to the centralized controllersof traditional aggregated storage systems: it maintains a global view of resourceutilization and workload behavior and it provides a unified control path for manag-ing resource allocation. It is more sophisticated than typical controllers, however,because its purview includes the entire back-end storage network: it controls howclient connections are routed to storage servers as well as how data is placed onavailable devices. Moreover, it takes an unconventionally proactive approach toresource management, continually seeking adjustments that might improve per-formance and utility. This approach is made feasible by the high random-accessbandwidth of solid state devices, which dramatically lowers the performance costof migrating data relative to spinning disks.Managing the many moving parts of a large system like Strata is complicated. Evensimply formulating an allocation policy that is suitable for all possible contingen-cies is a challenging task. The configuration space is large and multidimensional,and attempting to anticipate every potential corner case is time-consuming anderror-prone. Mirador addresses this complexity by providing a framework for cod-ifying policies as a collection of simple, independent objective functions, each ofwhich describes an allocation strategy for a single resource. Objective functionsare assigned numerical costs that induce a priority ordering for situations wherenot all goals can be met. Mirador combines these objective functions with es-tablished optimization techniques to efficiently search the configuration space forpreferable alternatives while maintaining the invariants necessary to guarantee re-siliency. This approach has a number of appealing properties. It allows domain ex-6perts to define specific allocation goals without prescribing how the system shouldbehave as a whole. It naturally supports incremental updates to allocation policiesso that new classes of workloads and hardware can be more easily accommodated.And it makes the system more robust to workload hot spots and hardware faults byfacilitating continuous, dynamic optimization.Strata’s design eliminates network and controller hardware as performance bottle-necks, but it cannot eliminate the more general problem of resource scarcity. Forexample, as deployments expand across racks, top-of-rack switching becomes alimited resource that must be allocated frugally. Mirador addresses this particularproblem by leveraging its knowledge of the relative availability of local and remotebandwidth (as codified by policy objective functions) to coordinate the migrationof data and client connections in order to avoid cross-rack traffic. But more im-portant than the specific balance that Strata and Mirador strike with the currentgeneration of hardware is the support they provide for dynamically responding toresource scarcity and workload imbalances in general.Figure 1.1 presents a schematic overview of how the three components describedin this thesis work together to achieve this level of dynamism. By providingcarefully-considered software abstractions – both in virtualizing hardware to de-couple logical resources from the underlying physical devices, and in cleanly sepa-rating control- and data-path logic – Strata provides flexible, programmatic controlof storage and network resources. By enabling efficient, accurate working set anal-ysis techniques, counter stacks provide insight into the performance and capacityrequirements of client workloads in live systems. And by leveraging these capabil-ities to actuate system-level responses to shifting resource consumption, Miradoris able to continuously optimize for performance, efficiency, and reliability.Indeed, the central claim of this thesis is that data center storage systems – andmost data center services in general – should be carefully designed to enable flex-ible, robust, and dynamic responses to changes in both workload behavior andhardware capabilities. In a large, multi-tenant environment like a data center, di-verse and varying workload behavior is inevitable. And as evidenced by the revo-lutionary advances of storage devices over the past few years, even long-standing7DATA PATH CONTROL PATHStrata (FAST '14) - decentralized architecture - client-side addressing - coordinated placement - scalable protocol layerCounter Stacks (OSDI '14) - light-weight profiling - efficient workload modelling - accurate, compact record    of working set size over timeMirador (FAST '17) - storage network controller - programmatic, multi-   dimensional policy goals - continuous optimization via   dynamic searchStrata (FAST '14) - virtualized NVMe - isolated address spacesDecibel (NSDI '17) - baseline 20s write latency - performance SLOsFigure 1.1: Schematic overview of system contributions (see § 1.1 and Chapter 6 formore details about Decibel)assumptions about the relative performance of hardware components must be re-considered from time to time. Consequently, robust systems should be capable ofautomatically adapting to changes across all of these dimensions, both at the scaleof scheduling epochs and hardware life cycles. We demonstrate in the follow-ing chapters how abstraction, analysis, and actuation can be combined to providethis responsiveness in a decentralized storage system with exacting performancerequirements. More generally, we believe that these capabilities provide a soundframework for a broad class of data center services as workloads and hardwarecontinue to evolve.1.1 PublicationsThe work presented in this thesis is based on a selection of three closely-relatedpublications. Below I present the complete list of research papers to which I con-tributed over the course of my studies.8Strata is our scale-out storage architecture for fast nonvolatile memories [34].Mirador codifies allocation policies as individual objective functions and uses es-tablished constraint satisfaction techniques to continually optimize the placementof data and network flows in Strata [120].Decibel extends the work of Strata to present a new volume abstraction that man-ages compute and network resources to provide contention-free request processingfor ultra low-latency devices [85].Counter Stacks are a novel probabilistic data structure for recording working setsover time. They enable the calculation of miss ratio curves with sublinear memoryoverheads, a dramatic improvement over the previous state of the art [122].Approximating Hit Rate Curves Using Streaming Algorithms is a companionpaper that presents a thorough analysis of the accuracy and computational com-plexity of counter stacks [41].Ownership Sketches are a novel data structure, inspired by counter stacks, thatenable efficient tracking of implicit sharing in copy on write snapshots [123].MapFS explores the possibility of exposing file system address space mappingsto userspace by providing efficient splice operations on file data [121]. This workhelped motivate the flexible addressing schemes provided by Strata.Capo demonstrates how local disks can be used as client-side caches to reduceload on shared storage servers [101]. It also stands as an early example of thedata-driven design approach that ultimately led to counter stacks.Dovetail presents a framework for safely upgrading on-disk data structures in cloudstorage platforms while minimizing the impact on client workloads [82]. Its sup-port for non-disruptive system reconfiguration was a precursor to the dynamic re-source management provided by Mirador.Block Mason is a virtual block device framework that supports modular, stackableuserspace implementations for enhanced flexibility and portability [81]. Its com-posable data path served as a model for Strata’s request dispatching architecture.9Chapter 2Nonvolatile MemoryThe reign of spinning disks as the predominant technology in enterprise storage iscoming to an end. While hard drives have stagnated because of physical limitationson rotational speed, nonvolatile technologies like NAND flash and phase changememory (PCM) have flourished, bridging the gap between RAM and disk. NANDflash, long common in cameras and mobile phones, has recently become a viablealternative to magnetic disks thanks to dramatic improvements in performance, re-liability, and affordability: enterprise flash devices today provide random-accessthroughput that is a thousand times greater than mechanical disks at latencies thatare a thousand times lower, while remaining cost-competitive with their rotatingcounterparts. Emerging technologies like phase change and spin-transfer torquememories avoid transistor scaling difficulties by using entirely different physical-chemical mechanisms to provide bit storage, promising additional performanceand endurance improvements. The impact of these innovations can be broadly cat-egorized according to three trends, each of which is changing the data center inimportant ways. First, increased performance density has effectively inverted theIO gap, violating many of the assumptions behind conventional storage designs.Second, increased capacity density is placing new demands on device connectiv-ity and raising serious concerns about failure recovery times. And third, reducedpower and space requirements are altering the physical and logistical constraints10of hardware deployment. These advances solve many long-standing problems instorage design, but they also present new challenges.Perhaps the most remarkable characteristic of nonvolatile memories is their radicalperformance density. The difference relative to magnetic disks in this regard is re-ally one of kind rather than degree. By eliminating the mechanical component ofstorage hardware, solid state technologies remove the single largest contributor tothe IO gap. This reduces access times in absolute terms, but, more importantly, itdoes away with the armature movement and rotational latency that together imposeadditive, millisecond-granularity penalties under random-access workloads. Thismakes spatial locality much less important than it used to be, allowing nonvolatilememory to be virtualized without significant performance degradation. Indeed,flash firmware does just this in the translation layers that manage erase cycles andprovide wear levelling, as does storage software in providing deduplication to in-crease effective capacity. High random-access throughput also makes it feasibleto migrate data to balance load and improve efficiency without affecting primaryworkloads, allowing stored data to play a much more active role in computation.This increase in performance density has been enabled in part by the migration ofstorage devices onto wider, faster interfaces as nonvolatile memories have movedfrom SAS and SATA buses to PCIe and DIMM alternatives that provide higher through-put at much lower latency. In fact, the latency of modern PCIe flash devices is solow that avoiding processing overhead has become a significant challenge. A simi-lar trend has emerged in networking software as commodity Ethernet transmissionrates have increased by a thousandfold, from 100 megabits per second in 1995 to100 gigabits per second today. But the transformation has been more profound forstorage systems, which have become a million times faster over the same period.This has important consequences for storage software, which can no longer as-sume that processing is effectively free but must instead contend with nanosecondrequest deadlines.In addition to offering lower latency, the increased parallelism of PCIe devicesmakes it easier to share hardware among many workloads. The NVMe specifica-tion allows up to 65,000 queues per device, and while few vendors actually provide11anywhere near this many, most offer enough to make it possible to completely parti-tion data structures, interrupt handling, and request processing across cores withoutany need for thread synchronization. Exposing this parallelism directly to softwarelayers inverts the traditional model of a single elevator scheduler mediating accessto multiple spindles, but it is crucial for achieving full device saturation.The increasing capacity density of nonvolatile memories, while perhaps less spec-tacular than the coincident performance improvements, is also changing the datacenter in important ways. Although magnetic disk capacity has plateaued at about10 TB, with shingled magnetic recording offering further marginal improvementssuitable primarily for read-mostly workloads, nonvolatile memory capacity is in-creasing at a steady pace. Transistor-based media like NAND flash continue to ben-efit from die process improvements that lead to smaller, denser memories at a rateroughly in line with Moore’s Law, even as processors have already pushed thesegains close to the limit. Additional innovations like multi-level cells on NAND flashand the three-dimensional circuit layouts of 3D XPoint lead to even further capac-ity gains, although often at the expense of increased bit error rates and reducedwrite performance. Thanks to these advances, the single-device capacity of non-volatile memories will soon surpass that of spinning disks by more than an orderof magnitude.This capacity density poses new challenges, however. Modern enterprise CPU mi-croarchitectures like Intel’s Skylake series provide around 40 PCIe lanes per con-troller hub. In a typical network-facing server, roughly half of these might bededicated to NICs, leaving only twenty lanes to connect to storage hardware. Thismeans that PCIe switching is needed to host even a moderate number of NVMedevices (which consume four lanes each) in a single machine. Under these con-straints, PCIe throughput quickly becomes an issue as device capacity scales. Giventheir compact size, it is not unreasonable to imagine hosting upwards of 32 flashcards in a single 1U server; at 128 TB per device, this would mean exposing 4 PBof storage over PCIe lanes with a combined throughput of just 200 GB/sec, severelylimiting overall data access rates. This problem extends beyond individual hostsas well: links between top-of-rack switches are typically oversubscribed at a ratioof three or four to one, presenting another potential bottleneck. Capacious devices12also present new difficulties in maintaining data resiliency. Repairing redundancywhen multi-terabyte devices fail can be time-consuming, increasing exposure topermanent data loss. These factors put new demands on storage software, whichmust place data carefully to mitigate transport bottlenecks and arrange for fast re-covery times.These performance and capacity gains come with significant reductions in powerconsumption and physical size. Because nonvolatile memories contain no mov-ing parts, they consume only around half a watt when idle, and twice that underload. Rotating disks, on the other hand, consume around 5 watts per spindle whenidle, and twice that again under load. Furthermore, opportunistic efforts to reducepower consumption by powering down idle disks are generally impractical withoutadvanced knowledge of workload patterns, because disks take seconds to spin backup. Combined with the fact that large arrays of spindles are typically required toachieve even moderate random-access performance, this places tremendous bur-dens on data center power and cooling systems. At the same time, nonvolatilememories can be packaged much more compactly than spinning platters. For ex-ample, thanks to new form factors like M.2, it is reasonable to imagine a single 2Uenclosure providing adequate flash storage for an entire rack of machines, replacingmany hundreds of disks in so doing. Along with the performance and capacity den-sity described above, this physical compactness makes disaggregated architecturesa much more natural fit for nonvolatile memories than traditional SAN designs.Nonvolatile memories provide orders of magnitude more performance and capac-ity than mechanical disks while requiring substantially less power and physicalspace. In so doing, they completely reshape storage software concerns. Ratherthan batching requests to avoid seeks, software must restrict processing times tomicroseconds or less. Rather than aggregating disk arrays to increase parallelism,systems need to expose individual device queues with minimal cross-core syn-chronization. Rather than uniformly distributing data across spindles, controllersshould consider migrating data in response to load imbalances and hot spots. Inshort, storage software needs to become significantly more efficient, flexible, anddynamic if it is to fully realize the potential of these exciting new hardware tech-nologies.13Chapter 3Strata: ScalableHigh-Performance Storage onVirtualized Non-Volatile MemoryA version of this chapter was published at the 12th USENIX Conference on Fileand Storage Technologies in 2014 [34].3.1 IntroductionFlash-based storage devices are fast, expensive, and demanding: a single deviceis capable of saturating a 10Gb/s network link (even for random IO), consumingsignificant CPU resources in the process. That same device may cost as much as(or more than) the server in which it is installed1. The cost and performance char-acteristics of fast, non-volatile media have changed the calculus of storage systemdesign and present new challenges for building efficient and high-performance dat-acenter storage.1Enterprise-class PCIe flash drives in the 1TB capacity range currently carry list prices in therange of $3-5K USD. Large-capacity, high-performance cards are available for list prices of up to$160K.14This chapter describes the architecture of a commercial flash-based network-attachedstorage system, built using commodity hardware. In designing the system aroundPCIe flash, we begin with two observations about the effects of high-performancedrives on large-scale storage systems. First, these devices are fast enough that inmost environments, many concurrent workloads are needed to fully saturate them,and even a small degree of processing overhead will prevent full utilization. Thus,we must change our approach to the media from aggregation to virtualization. Sec-ond, aggregation is still necessary to achieve properties such as redundancy andscale. However, it must avoid the performance bottleneck that would result fromthe monolithic controller approach of a traditional storage array, which is designedaround the obsolete assumption that media is the slowest component in the system.Further, to be practical in existing datacenter environments, we must remain com-patible with existing client-side storage interfaces and support standard enterprisefeatures like snapshots and deduplication.Device Virtualization Layer (§4)Network Attached Disks (NADs) Responsibility: Virtualize a PCIe ash device into multiple address spaces and allow direct client access with controlled sharing. Protocol Virtualization Layer (§6)Scalable Protocol Presentation Responsibility: Allow the transparently scalable implementation of traditional IP- and Ethernet-based storage protocols.  Scalable NFSv3 Presents a single external NFS IP address, integrates with SDNswitch to transparently scale and manage connections acrosscontroller instances hosted on each microArray.CLOS (Coho Log-structured Object Store) Implements a !at object store, virtualizing the PCIe !ash device’s address space and presents an OSD-like interface toclients.libDataPath NFSv3 instance on each microarray links as a dispatch library.Data path descriptions are read from a cluster-wide registryand instantiated as dispatch state machines.  NFS forwards requests through these SMs, interacting directly with NADs.Central services update data paths in the face of failure, etc.Global Address Space Virtualization Layer (§3,5)Delegated Data Paths Responsibility: Compose device level objects into richer storage primitives.  Allow clients to dispatch requests directly to NADs while preserving centralized control over placement, recon!guration, and failure recovery. Layer name, core abstraction, and responsibility: Implementation in Strata: Figure 3.1: Strata network storage architectureIn this chapter we explore the implications of these two observations on the designof a scalable, high-performance NFSv3 implementation for the storage of virtualmachine images. Our system is based on the building blocks of PCIe flash in com-modity x86 servers connected by 10 gigabit switched Ethernet. We describe twobroad technical contributions that form the basis of our design:1. A delegated mapping and request dispatch interface from client data to phys-ical resources through global data address virtualization, which allows clients15to directly address data while still providing the coordination required foronline data movement (e.g., in response to failures or for load balancing).2. SDN-assisted storage protocol virtualization that allows clients to address asingle virtual protocol gateway (e.g., NFS server) that is transparently scaledout across multiple real servers. We have built a scalable NFS server usingthis technique, but it applies to other protocols (such as iSCSI, SMB, andFCoE) as well.At its core, Strata uses device-level object storage and dynamic, global address-space virtualization to achieve a clean and efficient separation between controland data paths in the storage system. Flash devices are split into virtual addressspaces using an object storage-style interface, and clients are then allowed to di-rectly communicate with these address spaces in a safe, low-overhead manner. Inorder to compose richer storage abstractions, a global address space virtualizationlayer allows clients to aggregate multiple per-device address spaces with mappingsthat achieve properties such as striping and replication. These delegated addressspace mappings are coordinated in a way that preserves direct client communi-cations with storage devices, while still allowing dynamic and centralized controlover data placement, migration, scale, and failure response.Serving this storage over traditional protocols like NFS imposes a second scalabil-ity problem: clients of these protocols typically expect a single server IP address,which must be dynamically balanced over multiple servers to avoid being a perfor-mance bottleneck. In order to both scale request processing and to take advantageof full switch bandwidth between clients and storage resources, we developed ascalable protocol presentation layer that acts as a client to the lower layers of ourarchitecture, and that interacts with a software-defined network switch to scale theimplementation of the protocol component of a storage controller across arbitrar-ily many physical servers. By building protocol gateways as clients of the addressvirtualization layer, we preserve the ability to delegate scale-out access to devicestorage without requiring interface changes on the end hosts that consume the stor-age.163.2 ArchitectureThe performance characteristics of emerging storage hardware demand that wecompletely reconsider storage architecture in order to build scalable, low-latencyshared persistent memory. The reality of deployed applications is that interfacesmust stay exactly the same in order for a storage system to have relevance. Strata’sarchitecture aims to take a step toward the first of these goals, while keeping apragmatic focus on the second.Figure 3.1 characterizes the three layers of Strata’s architecture. The goals andabstractions of each layer of the system are on the left-hand column, and the con-crete embodiment of these goals in our implementation is on the right. At the base,we make devices accessible over an object storage interface, which is responsiblefor virtualizing the device’s address space and allowing clients to interact with in-dividual virtual devices. This approach reflects our view that system design forthese storage devices today is similar to that of CPU virtualization ten years ago:devices provide greater performance than is required by most individual work-loads and so require a lightweight interface for controlled sharing in order to allowmulti-tenancy. We implement a per-device object store that allows a device to bevirtualized into an address space of 2128 sparse objects, each of which may be upto 264 bytes in size. Our implementation is similar in intention to the OSD specifi-cation, itself motivated by network attached secure disks [50]. While not broadlydeployed to date, device-level object storage is receiving renewed attention todaythrough pNFS’s use of OSD as a backend, the NVMe namespace abstraction, andin emerging hardware such as Seagate’s Kinetic drives [99]. Our object storageinterface as a whole is not a significant technical contribution, but it does havesome notable interface customizations described in § 3.4. We refer to this layer asa Network Attached Disk, or NAD.The middle layer of our architecture provides a global address space that supportsthe efficient composition of IO processors that translate client requests on a vir-tual object into operations on a set of NAD-level physical objects. We refer to thegraph of IO processors for a particular virtual object as its data path, and we main-tain the description of the data path for every object in a global virtual address17map. Clients use a dispatch library to instantiate the processing graph describedby each data path and perform direct IO on the physical objects at the leaves ofthe graph. The virtual address map is accessed through a coherence protocol thatallows central services to update the data paths for virtual objects while they arein active use by clients. More concretely, data paths allow physical objects to becomposed into richer storage primitives, providing properties such as striping andreplication. The goal of this layer is to strike a balance between scalability andefficiency: it supports direct client access to device-level objects, without sacrific-ing central management of data placement, failure recovery, and more advancedstorage features such as deduplication and snapshots.Finally, the top layer performs protocol virtualization to allow clients to accessstorage over standard protocols (such as NFS) without losing the scalability of di-rect requests from clients to NADs. The presentation layer is tightly integrated witha 10Gb software-defined Ethernet switching fabric, allowing external clients theillusion of connecting to a single TCP endpoint, while transparently and dynami-cally balancing traffic to that single IP address across protocol instances on all ofthe NADs. Each protocol instance is a thin client of the layer below, which maycommunicate with other protocol instances to perform any additional synchroniza-tion required by the protocol (e.g., to maintain NFS namespace consistency).The mapping of these layers onto the hardware that our system uses is shown inFigure 3.2. Requests travel from clients into Strata through an OpenFlow-enabledswitch, which dispatches them according to load to the appropriate protocol han-dler running on a MicroArray (µArray) — a small host configured with flash de-vices and enough network and CPU to saturate them, containing the software stackrepresenting a single NAD. For performance, each of the layers is implemented asa library, allowing a single process to handle the flow of requests from client tomedia. The NFSv3 implementation acts as a client of the underlying dispatch layer,which transforms requests on virtual objects into one or more requests on physi-cal objects, issued through function calls to local physical objects and by RPC toremote objects. While the focus of the rest of this chapter is on this concrete imple-mentation of scale-out NFS, it is worth noting that the design is intended to allowapplications the opportunity to link directly against the same data path library that18VMware ESX HostVMware ESX HostVMware ESX HostVirtual NFS server Protocol Virtualizaiton(Scalable NFSv3) Arrows show NFS connections andassociated requests.Middle host connectionomited for clarity.Global Address Space Virtualization (libDataDispatch)Device Virtualization(CLOS)microArrayNFS InstancelibDataPathCLOSmicroArrayNFS InstancelibDataPathCLOSmicroArrayNFS InstancelibDataPathCLOS10Gb SDN SwitchFigure 3.2: Hardware view of a Strata deploymentthe NFS implementation uses, resulting in a multi-tenant, multi-presentation stor-age system with a minimum of network and device-level overhead.3.2.1 Scope of this WorkThere are three aspects of our design that are not considered in detail within thispresentation. First, we only discuss NFS as a concrete implementation of protocolvirtualization. Strata has been designed to host and support multiple protocols andtenants, but our initial product release is specifically NFSv3 for VMware clients,so we focus on this type of deployment in describing the implementation. Second,Strata was initially designed to be a software layer that is co-located on the samephysical servers that host virtual machines. We have moved to a separate physical19hosting model where we directly build on dedicated hardware, but there is nothingthat prevents the system from being deployed in a more co-located (or “converged”)manner. Finally, our full implementation incorporates a tier of spinning disks oneach of the storage nodes to allow cold data to be stored more economically behindthe flash layer. However, in this chapter we configure and describe a single-tier, all-flash system to simplify the exposition.In the next sections we discuss three relevant aspects of Strata—address spacevirtualization, dynamic reconfiguration, and scalable protocol support—in moredetail. We then describe some specifics of how these three components interact inour NFSv3 implementation for VM image storage before providing a performanceevaluation of the system as a whole.3.3 Data PathsStrata provides a common library interface to data that underlies the higher-level,client-specific protocols described in § 3.6. This library presents a notion of vir-tual objects, which are available cluster-wide and may comprise multiple physicalobjects bundled together for parallel data access, fault tolerance, or other reasons(e.g., data deduplication). The library provides a superset of the object storageinterface provided by the NADs (§ 3.4), with additional interfaces to manage theplacement of objects (and ranges within objects) across NADs, to maintain datainvariants (e.g., replication levels and consistent updates) when object ranges arereplicated or striped, and to coordinate both concurrent access to data and concur-rent manipulation of the virtual address maps describing their layout.To avoid IO bottlenecks, users of the data path interface (which may be nativeclients or protocol gateways such as our NFS server) access data directly. To do so,they map requests from virtual objects to physical objects using the virtual addressmap. This is not simply a pointer from a virtual object (id, range) pair to a setof physical object (id, range) pairs. Rather, each virtual range is associated witha particular processor for that range, along with processor-specific context. Stratauses a dispatch-oriented programming model in which a pipeline of operations is20performed on requests as they are passed from an originating client, through a setof transformations, and eventually to the appropriate storage device(s). Our modelborrows ideas from packet processing systems such as X-Kernel [60], Scout [84],and Click [69], but adapts them to a storage context, in which modules along thepipeline perform translations through a set of layered address spaces, and may forkand/or collect requests and responses as they are passed.The dispatch library provides a collection of request processors, which can standalone or be combined with other processors. Each processor takes a storage re-quest (e.g., a read or write request) as input and produces one or more requeststo its children. NADs expose isolated sparse objects; processors perform transla-tions that allow multiple objects to be combined for some functional purpose, andpresent them as a single object, which may in turn be used by other processors. Theidea of request-based address translation to build storage features has been used inother systems [74, 75, 80], often as the basis for volume management; Strata disen-tangles it from the underlying storage system and treats it as a first-class dispatchabstraction.The composition of dispatch modules bears similarity to Click [69], but the ap-plication in a storage domain carries a number of differences. First, requests aregenerally acknowledged at the point that they reach a storage device, and so as aresult they differ from packet forwarding logic in that they travel both down andthen back up through a dispatch stack; processors contain logic to handle both re-quests and responses. Second, it is common for requests to be split or merged asthey traverse a processor — for example, a replication processor may duplicate arequest and issue it to multiple nodes, and then collect all responses before pass-ing a single response back up to its parent. Finally, while processors describe fast,library-based request dispatching logic, they typically depend on additional facil-ities from the system. Strata allows processor implementations access to APIs forshared, cluster-wide state which may be used on a control path to, for instance,store replica configuration. It additionally provides facilities for background func-tionality such as NAD failure detection and response. The intention of the processororganization is to allow dispatch decisions to be pushed out to client implementa-tions and be made with minimal performance impact, while still benefiting from21common system-wide infrastructure for maintaining the system and responding tofailures. The responsibilities of the dispatch library are described in more detail inthe following subsections.3.3.1 The Virtual Address Map/objects/112:type=regular dispatch={object=111type=dispatch}/objects/111:type=dispatchstripe={stripecount=8 chunksize=5242880={object=103 type=dispatch}1={object=104 type=dispatch}}/objects/103:type=dispatchrpl={policy=mirror storecount=2{storeid=a98f2... state=in-sync}{storeid=fc89f... state=in-sync}}Figure 3.3: Virtual object to physical object range mappingFigure 3.3 shows the relevant information stored in the virtual address map for atypical object. Each object has an identifier, a type, some type-specific context, andmay contain other metadata such as cached size or modification time information(which is not canonical, for reasons discussed below).The entry point into the virtual address map is a regular object. This contains nolocation information on its own, but delegates to a top-level dispatch object. InFigure 3.3, object 112 is a regular object that delegates to a dispatch processorwhose context is identified by object 111 (the IDs are in reverse order here becausethe dispatch graph is created from the bottom up, but traversed from the top down).Thus when a client opens file 112, it instantiates a dispatcher using the data inobject 111 as context. This context informs the dispatcher that it will be delegatingIO through a striped processor, using 2 stripes for the object and a stripe width of512K. The dispatcher in turn instantiates 8 processors (one for each stripe), eachconfigured with the information stored in the object associated with each stripe22(e.g., stripe 0 uses object 103). Finally, when the stripe dispatcher performs IO onstripe 0, it will use the context in the object descriptor for object 103 to instantiatea replicated processor, which mirrors writes to the NADs listed in its replica set,and issues reads to the nearest in sync replica (where distance is currently simplylocal or remote).In addition to the striping and mirroring processors described here, the map cansupport other more advanced processors, such as erasure coding, or byte-rangemappings to arbitrary objects (which supports among other things data deduplica-tion).3.3.2 DispatchIO requests are handled by a chain of dispatchers, each of which has some com-mon functionality. Dispatchers may have to fragment requests into pieces if theyspan the ranges covered by different subprocessors, or clone requests into multiplesubrequests (e.g., for replication), and they must collect the results of subrequestsand deal with partial failures.The replication and striping modules included in the standard library are represen-tative of the ways processors transform requests as they traverse a dispatch stack.The replication processor allows a request to be split and issued concurrently toa set of replica objects. The request address remains unchanged within each ob-ject, and responses are not returned until all replicas have acknowledged a requestas complete. The processor prioritizes reading from local replicas, but forwardsrequests to remote replicas in the event of a failure (either an error response ora timeout). It imposes a global ordering on write requests and streams them toall replicas in parallel. It also periodically commits a light-weight checkpoint toeach replica’s log to maintain a persistent record of synchronization points; thesecheckpoints are used for crash recovery (§ 3.5.1).The striping processor distributes data across a collection of sparse objects. It is pa-rameterized to take a stripe size (in bytes) and a list of objects to act as the orderedstripe set. In the event that a request crosses a stripe boundary, the processor splits23that request into a set of per-stripe requests and issues those asynchronously, col-lecting the responses before returning. Static, address-based striping is a relativelysimple load balancing and data distribution mechanism as compared to placementschemes such as consistent hashing [64]. Our experience has been that the ap-proach is effective, because data placement tends to be reasonably uniform withinan object address space, and because using a reasonably large stripe size (we de-fault to 512KB) preserves locality well enough to keep request fragmentation over-head low in normal operation.3.3.3 CoherenceStrata clients also participate in a simple coordination protocol in order to allowthe virtual address map for a virtual object to be updated even while that objectis in use. Online reconfiguration provides a means for recovering from failures,responding to capacity changes, and even moving objects in response to observedor predicted load (on a device basis — this is distinct from client load balancing,which we also support through a switch-based protocol described in § 3.6.2).The virtual address maps are stored in a distributed, synchronized configurationdatabase implemented over Apache Zookeeper, which is also available for anylow-bandwidth synchronization required by services elsewhere in the softwarestack. The coherence protocol is built on top of the configuration database. Itis currently optimized for a single writer per object, and works as follows: when aclient wishes to write to a virtual object, it first claims a lock for it in the configu-ration database. If the object is already locked, the client requests that the holderrelease it so that the client can claim it. If the holder does not voluntarily releaseit within a reasonable time, the holder is considered unresponsive and fenced fromthe system using the mechanism described in § 3.6.2. This is enough to allowmovement of objects, by first creating new, out of sync physical objects at the de-sired location, then requesting a release of the object’s lock holder if there is one.The user of the object will reacquire the lock on the next write, and in the processdiscover the new out of sync replica and initiate resynchronization. When the newreplica is in sync, the same process may be repeated to delete replicas that are at24undesirable locations.3.4 Network Attached DisksThe unit of storage in Strata is a Network Attached Disk (NAD), consisting of abalanced combination of CPU, network and storage components. In our currenthardware, each NAD has two 10 gigabit Ethernet ports, two PCIe flash cards capa-ble of 10 gigabits of throughput each, and a pair of Xeon processors that can keepup with request load and host additional services alongside the data path. EachNAD provides two distinct services. First, it efficiently multiplexes the raw storagehardware across multiple concurrent users, using an object storage protocol. Sec-ond, it hosts applications that provide higher level services over the cluster. Objectrebalancing (§ 3.5.2) and the NFS protocol interface (§ 3.6.1) are examples of theseservices.At the device level, we multiplex the underlying storage into objects, named by128-bit identifiers and consisting of sparse 264 byte data address spaces. Theseaddress spaces are currently backed by a garbage-collected log-structured objectstore, but the implementation of the object store is opaque to the layers aboveand could be replaced if newer storage technologies made different access patternsmore efficient. We also provide increased capacity by allowing each object to flushlow priority or infrequently used data to disk, but this is again hidden behind theobject interface. The details of disk tiering, garbage collection, and the layout ofthe file system are beyond the scope of this chapter.The physical object interface is for the most part a traditional object-based storagedevice [98, 99] with a CRUD interface for sparse objects, as well as a few exten-sions to assist with our clustering protocol (§ 3.5.1). It is significantly simplerthan existing block device interfaces, such as the SCSI command set, but is also in-tended to be more direct and general purpose than even narrower interfaces such asthose of a key-value store. Providing a low-level hardware abstraction layer allowsthe implementation to be customized to accommodate best practices of individualflash implementations, and also allows more dramatic design changes at the media25interface level as new technologies become available.3.4.1 Network IntegrationAs with any distributed system, we must deal with misbehaving nodes. We addressthis problem by tightly coupling with managed Ethernet switches, which we dis-cuss at more length in § 3.6.2. This approach borrows ideas from systems such asSane [26] and Ethane [27], in which a managed network is used to enforce isolationbetween independent endpoints. The system integrates with both OpenFlow-basedswitches and software switching at the VMM to ensure that Strata objects are onlyaddressable by their authorized clients.Our initial implementation used Ethernet VLANs, because this form of hardware-supported isolation is in common use in enterprise environments. In the currentimplementation, we have moved to OpenFlow, which provides a more flexible tun-neling abstraction for traffic isolation.We also expose an isolated private virtual network for out-of-band control andmanagement operations internal to the cluster. This allows NADs themselves toaccess remote objects for peer-wise resynchronization and reorganization underthe control of a cluster monitor.3.5 Online ReconfigurationThere are two broad categories of events to which Strata must respond in orderto maintain its performance and reliability properties. The first category includesfaults that occur directly on the data path. The dispatch library recovers from suchfaults immediately and automatically by reconfiguring the affected virtual objectson behalf of the client. The second category includes events such as device fail-ures and load imbalance. These are handled by a dedicated cluster monitor whichperforms large-scale reconfiguration tasks to maintain the health of the system asa whole. In all cases, reconfiguration is performed online and has minimal impacton client availability.263.5.1 Object ReconfigurationA number of error recovery mechanisms are built directly into the dispatch library.These mechanisms allow clients to quickly recover from failures by reconfiguringindividual virtual objects on the data path.IO ErrorsThe replication IO processor responds to read errors in the obvious way: by im-mediately resubmitting failed requests to different replicas. In addition, clientsmaintain per-device error counts; if the aggregated error count for a device exceedsa configurable threshold, a background task takes the device offline and coordinatesa system-wide reconfiguration (§ 3.5.2).IO processors respond to write errors by synchronously reconfiguring virtual ob-jects at the time of the failure. This involves three steps. First, the affected replicais marked out of sync in the configuration database. This serves as a global, persis-tent indication that the replica may not be used to serve reads because it containspotentially stale data. Second, a best-effort attempt is made to inform the NADof the error so that it can initiate a background task to resynchronize the affectedreplica. This allows the system to recover from transient failures almost immedi-ately. Finally, the IO processor allocates a special patch object on a separate deviceand adds this to the replica set. Once a replica has been marked out of sync, no fur-ther writes are issued to it until it has been resynchronized; patches prevent devicefailures from impeding progress by providing a temporary buffer to absorb writesunder these degraded conditions. With the patch object allocated, the IO processorcan continue to meet the replication requirements for new writes while out of syncreplicas are repaired in the background. A replica set remains available as long asan in sync replica or an out of sync replica and all of its patches are available.27ResynchronizationIn addition to providing clients direct access to devices via virtual address maps,Strata provides a number of background services to maintain the health of individ-ual virtual objects and the system as a whole. The most fundamental of these is theresync service, which provides a background task that can resynchronize objectsreplicated across multiple devices.Resync is built on top of a special NAD resync API that exposes the underlyinglog structure of the object stores. NADs maintain a Log Serial Number (LSN) withevery physical object in their stores; when a record is appended to an object’s log,its LSN is monotonically incremented. The IO processor uses these LSNs to imposea global ordering on the changes made to physical objects that are replicated acrossstores and to verify that all replicas have received all updates.If a write failure causes a replica to go out of sync, the client can request the systemto resynchronize the replica. It does this by invoking the resync RPC on the NADwhich hosts the out of sync replica. The server then starts a background task whichstreams the missing log records from an in sync replica and applies them to thelocal out of sync copy, using the LSN to identify which records the local copy ismissing.During resync, the background task has exclusive write access to the out of syncreplica because all clients have been reconfigured to use patches. Thus the resynctask can chase the tail of the in sync object’s log while clients continue to write.When the bulk of the data has been copied, the resync task enters a final stop-and-copy phase in which it acquires exclusive write access to all replicas in the replicaset, finalizes the resync, applies any client writes received in the interim, marks thereplica as in sync in the configuration database, and removes the patch.It is important to ensure that resync makes timely progress to limit vulnerability todata loss. Very heavy client write loads may interfere with resync tasks and, in theworst case, result in unbounded transfer times. For this reason, when an object isunder resync, client writes are throttled and resync requests are prioritized.28Crash RecoverySpecial care must be taken in the event of an unclean shutdown. On a cleanshutdown, all objects are released by removing their locks from the configurationdatabase. Crashes are detected when replica sets are discovered with stale locks(i.e., locks identifying unresponsive IO processors). When this happens, it is notsafe to assume that replicas marked in sync in the configuration database are trulyin sync, because a crash might have occured midway through a the configurationdatabase update; instead, all the replicas in the set must be queried directly to de-termine their states.In the common case, the IO processor retrieves the LSN for every replica in theset and determines which replicas, if any, are out of sync. If all replicas have thesame LSN, then no resynchronization is required. If different LSNs are discovered,then the replica with the highest LSN is designated as the authoritative copy, andall other replicas are marked out of sync and resync tasks are initiated.If a replica cannot be queried during the recovery procedure, it is marked as di-verged in the configuration database and the replica with the highest LSN from theremaining available replicas is chosen as the authoritative copy. In this case, writesmay have been committed to the diverged replica that were not committed to anyothers. If the diverged replica becomes available again some time in the future,these extra writes must be discarded. This is achieved by rolling the replica backto its last checkpoint and starting a resync from that point in its log. Consistencyin the face of such rollbacks is guaranteed by ensuring that objects are successfullymarked out of sync in the configuration database before writes are acknowledgedto clients. Thus write failures are guaranteed to either mark replicas out of sync inthe configuration database (and create corresponding patches) or propagate back tothe client.3.5.2 System ReconfigurationStrata also provides a highly-available monitoring service that watches over thehealth of the system and coordinates system-wide recovery procedures as neces-29sary. Monitors collect information from clients, SMART diagnostic tools, and NADRPCs to gauge the status of the system. Monitors build on the per-object recon-figuration mechanisms described above to respond to events that individual clientsdon’t address, such as load imbalance across the system, stores nearing capacity,and device failures.RebalanceStrata provides a rebalance facility which is capable of performing system-widereconfiguration to repair broken replicas, prevent NADs from filling to capacity,and improve load distribution across NADs. This facility is in turn used to recoverfrom device failures and expand onto new hardware.Rebalance proceeds in two stages. In the first stage, the monitor retrieves the cur-rent system configuration, including the status of all NADs and virtual address mapof every virtual object. It then constructs a new layout for the replicas accordingto a customizable placement policy. This process is scriptable and can be easilytailored to suit specific performance and durability requirements for individual de-ployments (see § 3.7.3 for some analysis of the effects of different placement poli-cies). The default policy uses a greedy algorithm that considers a number of criteriadesigned to ensure that replicated physical objects do not share fault domains, ca-pacity imbalances are avoided as much as possible, and migration overheads arekept reasonably low. The new layout is formulated as a rebalance plan describingwhat changes need to be applied to individual replica sets to achieve the desiredconfiguration.In the second stage, the monitor coordinates the execution of the rebalance plan byinitiating resync tasks on individual NADs to effect the necessary data migration.When replicas need to be moved, the migration is performed in three steps:1. A new replica is added to the destination NAD2. A resync task is performed to transfer the data3. The old replica is removed from the source NAD30This requires two reconfiguration events for the replica set, the first to extend it toinclude the new replica, and the second to prune the original after the resync hascompleted. The monitor coordinates this procedure across all NADs and clients forall modified virtual objects.Device FailureStrata determines that a NAD has failed either when it receives a hardware failurenotification from a responsive NAD (such as a failed flash device or excessive errorcount) or when it observes that a NAD has stopped responding to requests for morethan a configurable timeout. In either case, the monitor responds by taking the NADoffline and initiating a system-wide reconfiguration to repair redundancy.The first thing the monitor does when taking a NAD offline is to disconnect it fromthe data path VLAN. This is a strong benefit of integrating directly against anEthernet switch in our environment: prior to taking corrective action, the NAD issynchronously disconnected from the network for all request traffic, avoiding thedistributed systems complexities that stem from things such as overloaded com-ponents appearing to fail and then returning long after a timeout in an inconsis-tent state. Rather than attempting to use completely end-host mechanisms suchas watchdogs to trigger reboots, or agreement protocols to inform all clients of aNAD’s failure, Strata disables the VLAN and requires that the failed NAD reconnecton the (separate) control VLAN in the event that it returns to life in the future.From this point, the recovery logic is straight forward. The NAD is marked as failedin the configuration database and a rebalance job is initiated to repair any replicasets containing replicas on the failed NAD.Elastic Scale OutStrata responds to the introduction of new hardware much in the same way thatit responds to failures. When the monitor observes that new hardware has beeninstalled, it uses the rebalance facility to generate a layout that incorporates the31new devices. Because replication is generally configured underneath striping, wecan migrate virtual objects at the granularity of individual stripes, allowing a sin-gle striped file to exploit the aggregated performance of many devices. Objects,whether whole files or individual stripes, can be moved to another NAD even whilethe file is online, using the existing resync mechanism. New NADs are populatedin a controlled manner to limit the impact of background IO on active client work-loads.3.6 Storage ProtocolsStrata supports legacy protocols by providing an execution runtime for hostingprotocol servers. Protocols are built as thin presentation layers on top of the dis-patch interfaces; multiple protocol instances can operate side by side. Implementa-tions can also leverage SDN-based protocol scaling to transparently spread multipleclients across the distributed runtime environment.3.6.1 Scalable NFSStrata is designed so that application developers can focus primarily on implement-ing protocol specifications without worrying much about how to organize data ondisk. We expect that many storage protocols can be implemented as thin wrappersaround the provided dispatch library. Our NFS implementation, for example, mapsvery cleanly onto the high-level dispatch APIs, providing only protocol-specific ex-tensions like RPC marshalling and NFS-style access control. It takes advantage ofthe configuration database to store mappings between the NFS namespace and thebackend objects, and it relies exclusively on the striping and replication processorsto implement the data path. Moreover, Strata allows NFS servers to be instantiatedacross multiple backend nodes, automatically distributing the additional processingoverhead across backend compute resources.323.6.2 SDN Protocol ScalingScaling legacy storage protocols can be challenging, especially when the protocolswere not originally designed for a distributed back end. Protocol scalability limi-tations may not pose significant problems for traditional arrays, which already sitbehind relatively narrow network interfaces, but they can become a performancebottleneck in Strata’s distributed architecture.A core property that limits scale of access bandwidth of conventional IP storageprotocols is the presentation of storage servers behind a single IP address. For-tunately, emerging “software defined” network (SDN) switches provide interfacesthat allow applications to take more precise control over packet forwarding throughEthernet switches than has traditionally been possible.Using the OpenFlow protocol, a software controller is able to interact with theswitch by pushing flow-specific rules onto the switch’s forwarding path. Open-Flow rules are effectively wild-carded packet filters and associated actions that tella switch what to do when a matching packet is identified. SDN switches (our imple-mentation currently uses an Arista Networks 7050T-52) interpret these flow rulesand push them down onto the switch’s TCAM or L2/L3 forwarding tables.By manipulating traffic through the switch at the granularity of individual flows,Strata protocol implementations are able to present a single logical IP address tomultiple clients. Rules are installed on the switch to trigger a fault event whenevera new NFS session is opened, and the resulting exception path determines whichprotocol instance to forward that session to initially. A service monitors networkactivity and migrates client connections as necessary to maintain an even workloaddistribution.The protocol scaling API wraps and extends the conventional socket API, allowinga protocol implementation to bind to and listen on a shared IP address across allof its instances. The client load balancer then monitors the traffic demands acrossall of these connections and initiates flow migration in response to overload on anyindividual physical connection.33In its simplest form, client migration is handled entirely at the transport layer.When the protocol load balancer observes that a specific NAD is overloaded, it up-dates the routing tables to redirect the busiest client workload to a different NAD.Once the client’s traffic is diverted, it receives a TCP RST from the new NAD andestablishes a new connection, thereby transparently migrating traffic to the newNAD.Strata also provides hooks for situations where application layer coordination isrequired to make migration safe. For example, our NFS implementation registers apre-migration routine with the load balancer, which allows the source NFS serverto flush any pending, non-idempotent requests (such as create or remove) beforethe connection is redirected to the destination server.3.7 EvaluationIn this section we evaluate our system both in terms of effective use of flash re-sources, and as a scalable, reliable provider of storage for NFS clients. First, weestablish baseline performance over a traditional NFS server on the same hardware.Then we evaluate how performance scales as nodes are added and removed fromthe system, using VM-based workloads over the legacy NFS interface, which isoblivious to cluster changes. In addition, we compare the effects of load balancingand object placement policy on performance. We then test reliability in the faceof node failure, which is a crucial feature of any distributed storage system. Wealso examine the relation between CPU power and performance in our system as ademonstration of the need to balance node power between flash, network and CPU.3.7.1 Test EnvironmentEvaluation was performed on a cluster of the maximum size allowed by our 48-portswitch: 12 NADs, each of which has two 10 gigabit Ethernet ports, two 800 GB In-tel 910 PCIe flash cards, 6 3 TB SATA drives, 64 GB of RAM, and 2 Xen E5-2620processors at 2 GHz with 6 cores/12 threads each, and 12 clients, in the form of34Server Read IOPS Write IOPSStrata 40287 9960KNFS 23377 5796Table 3.1: Random IO performance on Strata versus KNFSDell PowerEdge R420 servers running ESXi 5.0, with two 10 gigabit ports each,64 GB of RAM, and 2 Xeon E5-2470 processors at 2.3 GHz with 8 cores/16 threadseach. We configured the deployment to maintain two replicas of every stored ob-ject, without striping (since it unnecessarily complicates placement comparisonsand has little benefit for symmetric workloads). Garbage collection is active, andthe deployment is in its standard configuration with a disk tier enabled, but theworkloads have been configured to fit entirely within flash, as the effects of cachemisses to magnetic media are not relevant to this chapter.3.7.2 Baseline PerformanceTo provide some performance context for our architecture versus a typical NFSimplementation, we compare two minimal deployments of NFS over flash. We setStrata to serve a single flash card, with no replication or striping, and mounted itloopback. We ran a FIO [14] workload with a 4K IO size 80/20 read-write mix ata queue depth of 128 against a fully allocated file. We then formatted the flashcard with ext4, exported it with the linux kernel NFS server, and ran the same test.The results are in Table 3.1. As the table shows, we offer good NFS performanceat the level of individual devices. In the following section we proceed to evaluatescalability.3.7.3 ScalabilityIn this section we evaluate how well performance scales as we add NADs to thecluster. We begin each test by deploying 96 VMs (8 per client) into a cluster of 2NADs. We choose this number of VMs because ESXi limits the queue depth for aVM to 32 outstanding requests, but we do not see maximum performance until a35Seconds0 420 840 1260 1680 2100 2520 2940 3360 3780 4200 4620 5040 5460 5880 6300 6720 7140IOPS     010000020000030000040000050000060000070000080000090000010000001100000Figure 3.4: IOPS over time, read-only workloadqueue depth of 128 per flash card. The VMs are each configured to run the sameFIO workload for a given test. In Figure 3.4, FIO generates 4K random reads tofocus on IOPS scalability. In Figure 3.5, FIO generates an 80/20 mix of reads andwrites at 128K block size in a Pareto distribution such that 80% of requests go to20% of the data. This is meant to be more representative of real VM workloads, butwith enough offered load to completely saturate the cluster.As the tests run, we periodically add NADs, two at a time, up to a maximum oftwelve2. When each pair of NADs comes online, a rebalancing process automat-ically begins to move data across the cluster so that the amount of data on eachNAD is balanced. When it completes, we run in a steady state for two minutesand then add the next pair. In both figures, the periods where rebalancing is inprogress are reflected by a temporary drop in performance (as the rebalance pro-cess competes with client workloads for resources), followed by a rapid increasein overall performance when the new nodes are marked available, triggering the2ten for the read/write test due to an unfortunate test harness problem36Seconds0 360 720 1080 1440 1800 2160 2520 2880 3240 3600 3960 4320 4680 5040 5400 5760 6120 6480 6840IOPS     0 10000 20000 30000 40000 50000 60000 70000Figure 3.5: IOPS over time, 80/20 R/W workloadswitch to load-balance clients to them. A cluster of 12 NADs achieves over 1 mil-lion IOPS in the IOPS test, and 10 NADs achieve 70,000 IOPS (representing morethan 9 gigabytes/second of throughput) in the 80/20 test.We also test the effect of placement and load balancing on overall performance.If the location of a workload source is unpredictable (as in a VM data center withvirtual machine migration enabled), we need to be able to migrate clients quicklyin response to load. However, if the configuration is more static or can be predictedin advance, we may benefit from attempting to place clients and data together toreduce the network overhead incurred by remote IO requests. As discussed in§ 3.5.2, the load-balancing and data migration features of Strata make both ap-proaches possible. Figure 3.4 is the result of an aggressive local placement policy,in which data is placed on the same NAD as its clients, and both are moved asthe number of devices changes. This achieves the best possible performance at thecost of considerable data movement. In contrast, Figure 3.6 shows the performanceof an otherwise identical test configuration when data is placed randomly (while37Seconds0 420 840 1260 1680 2100 2520 2940 3360 3780 4200 4620 5040 5460 5880 6300 6720 7140 7560IOPS     0100000200000300000400000Figure 3.6: IOPS over time, read-only workload with random placementstill satisfying fault tolerance and even distribution constraints), rather than beingmoved according to client requests. The pareto workload (Figure 3.5) is also con-figured with the default random placement policy, which is the main reason that itdoes not scale linearly: as the number of nodes increases, so does the probabilitythat a request will need to be forwarded to a remote NAD.3.7.4 Node FailureAs a counterpoint to the scalability tests run in the previous section, we also testedthe behaviour of the cluster when a node is lost. We configured a 10 NAD clusterwith 10 clients hosting 4 VMs each, running the 80/20 Pareto workload describedearlier. Figure 3.7 shows the behaviour of the system during this experiment. Afterthe VMs had been running for a short time, we powered off one of the NADs byIPMI, waited 60 seconds, then powered it back on. During the node outage, thesystem continued to run uninterrupted but with lower throughput. When the node38came back up, it spent some time resynchronizing its objects to restore full repli-cation to the system, and then rejoined the cluster. The client load balancer shiftedclients onto it and throughput was restored (within the variance resulting from theclient load balancer’s placement decisions).Seconds0 60 120 180 240 300 360 420GB/s0123456789101112Figure 3.7: Aggregate bandwidth for 80/20 clients during failover and recovery3.7.5 Protocol OverheadThe benchmarks up to this point have all been run inside VMs whose storage isprovided by a virtual disk that Strata exports by NFS to ESXi. This configurationrequires no changes on the part of the clients to scale across a cluster, but does im-pose overheads. To quantify these overheads we wrote a custom FIO engine that iscapable of performing IO directly against our native dispatch interface (that is, theAPI by which our NFS protocol gateway interacts with the NADs). We then com-pared the performance of a single VM running a random 4k read FIO workload (formaximum possible IOPS) against a VMDK exported by NFS to the same workloadrun against our native dispatch engine. In this experiment, the VMDK-based exper-39CPU IOPS Freq (Cores) PriceE5-2620 127K 2 GHz (6) $406E5-2640 153K (+20%) 2.5 GHz (6) $885E5-2650v2 188K (+48%) 2.6 GHz (8) $1166E5-2660v2 183K (+44%) 2.2 GHz (10) $1389Table 3.2: Achieved IOPS on an 80/20 random 4K workload across 2 NADsiment produced an average of 50240 IOPS, whereas direct access achieved 54060IOPS, for an improvement of roughly 8%.3.7.6 Effect of CPU on PerformanceA workload running at full throttle with small requests completely saturates theCPU. This remains true despite significant development effort in performance de-bugging, and a great many improvements to minimize data movement and con-tention. In this section we report the performance improvements resulting fromfaster CPUs. These results are from random 4K NFS requests in an 80/20 read-write mix at 128 queue depth over four 10Gb links to a cluster of two NADs, eachequipped with 2 physical CPUs.Table 3.2 shows the results of these tests. In short, it is possible to “buy” ad-ditional storage performance under full load by upgrading the CPUs into a more“balanced” configuration. The wins are significant and carry a non-trivial increasein the system cost. As a result of this experimentation, we elected to use a higherperformance CPU in the shipping version of the product.3.8 Related WorkStrata applies principles from prior work in server virtualization, both in the form ofhypervisor [18, 118] and lib-OS [45] architectures, to solve the problem of sharingand scaling access to fast non-volatile memories among a heterogeneous set ofclients. Our contributions build upon the efforts of existing research in severalareas.40Recently, researchers have begin to investigate a broad range of system perfor-mance problems posed by storage class memory in single servers [15], includingcurrent PCIe flash devices [113], next generation PCM [5], and byte addressabil-ity [33]. Moneta [28] proposed solutions to an extensive set of performance bottle-necks over the PCIe bus interface to storage, and others have investigated improvingthe performance of storage class memory through polling [127], and avoiding sys-tem call overheads altogether [29]. We draw from this body of work to optimizethe performance of our dispatch library, and use this baseline to deliver a highperformance scale-out network storage service. In many cases, we would benefitfurther from these efforts—for example, our implementation could be optimizedto offload per-object access control checks, as in Moneta-D [29]. There is alsoa body of work on efficiently using flash as a caching layer for slower, cheaperstorage in the context of large file hosting. For example, S-CAVE [76] optimizescache utilization on flash for multiple virtual machines on a single VMware hostby running as a hypervisor module. This work is largely complementary to ours;we support using flash as a caching layer and would benefit from more effectivecache management strategies.Prior research into scale-out storage systems, such as FAWN [9], and Corfu [16]has considered the impact of a range of NV memory devices on cluster storage per-formance. However, to date these systems have been designed towards lightweightprocessors paired with simple flash devices. It is not clear that this balance isthe correct one, as evidenced by the tendency to evaluate these same designs onsignificantly more powerful hardware platforms than they are intended to oper-ate [16]. Strata is explicitly designed for dense virtualized server clusters backedby performance-dense PCIe-based nonvolatile memory. In addition, like older com-modity disk-oriented systems including Petal [71, 112] and FAB [95], prior storagesystems have tended to focus on building aggregation features at the lowest level oftheir designs, and then adding a single presentation layer on top. Strata in contrastsisolates shares each powerful PCIe-based storage class memory as its underlyingprimitive. This has allowed us to present a scalable runtime environment in whichmultiple protocols can coexist as peers without sacrificing the raw performance thattoday’s high performance memory can provide. Many scale-out storage systems,41including NV-Heaps [32], Ceph/RADOS [115, 117], and even pNFS [56] are un-able to support the legacy formats in enterprise environments. Our agnosticism toany particular protocol is similar to approach used by Ursa Minor [1], which alsoboasted a versatile client library protocol to share access to a cluster of magneticdisks.Strata does not attempt to provide storage for datacenter-scale environments, un-like systems including Azure [25], FDS [88], or Bigtable [30]. Storage systems inthis space differ significantly in their intended workload, as they emphasize highthroughput linear operations. Strata’s managed network would also need to beextended to support datacenter-sized scale out. We also differ from in-RAM ap-proaches such a RAMCloud [91] and memcached [46], which offer a different classof durability guarantee and cost.3.9 ConclusionStorage system design faces a sea change resulting from the dramatic increase inthe performance density of its component media. Distributed storage systems com-posed of even a small number of network-attached flash devices are now capable ofmatching the offered load of traditional systems that would have required multipleracks of spinning disks.Strata is an enterprise storage architecture that responds to the performance char-acteristics of PCIe storage devices. Using building blocks of well-balanced flash,compute, and network resources and then pairing the design with the integrationof SDN-based Ethernet switches, Strata provides an incrementally deployable, dy-namically scalable storage system.Strata’s initial design is specifically targeted at enterprise deployments of VMwareESX, which is one of the dominant drivers of new storage deployments in enter-prise environments today. The system achieves high performance and scalabil-ity for this specific NFS environment while allowing applications to interact di-rectly with virtualized, network-attached flash hardware over new protocols. Thisis achieved by cleanly partitioning our storage implementation into an underly-42ing, low-overhead virtualization layer and a scalable framework for implementingstorage protocols. Over the next year, we intend to extend the system to providegeneral-purpose NFS support by layering a scalable and distributed metadata ser-vice and small object support above the base layer of coarse-grained storage prim-itives.43Chapter 4Characterizing StorageWorkloads with Counter StacksA version of this chapter was published at the 11th USENIX Conference on Oper-ating Systems Design and Implementation in 2014 [122].4.1 IntroductionCaching is poorly understood. Despite being a pervasive element of computer sys-tem design – one that spans processor, storage system, operating system, and evenapplication architecture – the effective sizing of memory tiers and the design ofalgorithms that place data within them remains an art of characterizing and ap-proximating common case behaviors.The design of hierarchical memories is complicated by two factors: First, the col-lection of live workload-specific data that might be analyzed to make “applicationaware” decisions is generally too expensive to be worthwhile. Approaches thatmodel workloads to make placement decisions risk consuming the computationaland memory resources that they are trying to preserve. As a result, systems inmany domains have tended to use simple, general purpose algorithms such as LRU44to manage cache placement. Second, attempting to perform offline analysis of ac-cess patterns suffers from the performance overheads imposed in trace collection,and the practical challenges of both privacy and sheer volume, in sharing and ana-lyzing access traces.Today, these problems are especially pronounced in designing enterprise storagesystems. Flash memories are now available in three considerably different formfactors: as SAS or SATA-attached solid state disks, as NVMe devices connectedover the PCIe bus, and finally as flash-backed nonvolatile RAM, accessible overa DIMM interface. These three connectivity models all use the same underlyingflash memory, but present performance and pricing that are pairwise 1-2 ordersof magnitude apart. Further, in addition to solid-state memories, spinning disksremain an economical option for the storage of cold data.This chapter describes an approach to modeling, analyzing, and reasoning aboutmemory access patterns that has been motivated through our experience in de-signing a hierarchical storage system [34] that combines these varying classes ofstorage media. The system is a scalable, network-attached storage system that canbenefit from workload awareness in two ways: First, the system can manage allo-cation of the memory hierarchy in response to workload characteristics. Second,the capacity at each level of the hierarchy can be independently expanded to sat-isfy application demands, by adding additional hardware. Both of these propertiesrequire a more precise ability to understand and characterize individual storageworkloads, and in particular their working set sizes over time.Miss ratio curves (MRCs) are an effective tool for assessing working set sizes, butthe space and time required to generate them make them impractical for large-scalestorage workloads. We present a new data structure, the counter stack, which cangenerate approximate LRU MRCs in sublinear space, for the first time making thistype of analysis feasible in the storage domain.Counter stacks use probabilistic counters [47] to estimate LRU MRCs. The origi-nal approach to generating MRCs is based on the observation that a block’s ‘stackdistance’ (also known as its ‘reuse distance’) gives the capacity needed to cache it,and this distance is exactly the number of unique blocks accessed since the previ-45ous request for the block. The key idea behind counter stacks is that probabilisticcounters can be used to efficiently estimate stack distances, allowing us to computeapproximate MRCs at a fraction of the cost of traditional techniques.Counter stacks are fast. Our Java implementation can process a week-long trace of13 enterprise servers in 17 minutes using just 80 MB of RAM; at a rate of 2.3 millionrequests per second, the approach is practical for online analysis in production sys-tems. By comparison, a recent C implementation of a tree-based optimization [89]of Mattson’s original stack algorithm [78] takes roughly an hour and 92 GB ofRAM to process the same trace.Our contributions in this chapter are threefold. First, we introduce a novel tech-nique for estimating miss ratio curves using counter stacks, and we evaluate theperformance and accuracy of this technique. Second, we show how counter stackscan be periodically checkpointed and streamed to disk to provide a highly com-pressed representation of storage workloads. Counter stack streams capture im-portant details that are discarded by statistical aggregation while at the same timerequiring orders of magnitude less storage and processing overhead than full re-quest traces; a counter stack stream of the compressed 2.9 GB trace mentionedabove consumes just 11 MB. Third, we present techniques for working with mul-tiple independent counter stacks to estimate miss ratio curves for new workloadcombinations. Our library implements slice, shift, and join operations, enablingthe nearly-instantaneous computation of MRCs for arbitrary workload combina-tions over arbitrary windows in time. These capabilities extend the functionality ofMRC analysis and provide valuable insight into live workloads, as we demonstratewith a number of case studies.4.2 BackgroundThe many reporting facilities embedded in the modern Linux storage stack [21, 23,61, 83] are testament to the importance of being able to accurately characterize liveworkloads. Common characterizations typically fall into one of two categories:coarse-grain aggregate statistics and full request traces. While these representa-46tions have their uses, they can be problematic for a number of reasons: averagesand histograms discard key temporal information; sampling is vulnerable to theoften bursty and irregular nature of storage workloads; and full traces impose im-practical storage and processing overheads. New representations are needed whichpreserve the important features of full traces while remaining manageable to col-lect, store, and query.Working set theory [36] provides a useful abstraction for describing workloadsmore concisely, particularly with respect to how they will behave in hierarchicalmemory systems. In the original formulation, working sets were defined as the setof all pages accessed by a process over a given epoch. This was later refined byusing LRU modelling to derive an MRC for a given workload and restricting theworking set to only those pages that exhibit strong locality. Characterizing work-loads in terms of the unique, ‘hot’ pages they access makes it easier to understandtheir individual hardware requirements, and has proven useful in CPU cache man-agement for many years [68, 93, 109]. These concepts hold for storage workloadsas well, but their application in this domain is challenging for two reasons.First, until now it has been prohibitively expensive to calculate the working set ofstorage workloads due to their large sizes. Mattson’s original stack algorithm [78]required O(NM) time and O(M) space for a trace of N requests and M uniqueelements. An optimization using a balanced tree to maintain stack distances [7] re-duces the time complexity to O(N logM), and recent approximation techniques [38,126] reduce the time complexity even further, but they still have O(M) space over-heads, making them impractical for storage workloads that may contain billions ofunique blocks.Second, the extended duration of storage workloads leads to subtleties when rea-soning about their working sets. CPU workloads are relatively short-lived, and inmany cases it is sufficient to consider their working sets over small time intervals(e.g., a scheduling quantum) [132]. Storage workloads, on the other hand, can spanweeks or months and can change dramatically over time. MRCs at this scale can betricky: if they include too little history they may fail to capture important recurringpatterns, but if they include too much history they can significantly misrepresent47recent behavior.This phenomenon is further exacerbated by the fact that storage workloads alreadysit behind a file system cache and thus typically exhibit longer reuse distancesthan CPU workloads [133]. Consequently, cache misses in storage workloads mayhave a more pronounced effect on miss ratios than CPU cache misses, becausesubsequent re-accesses are likely to be absorbed by the file system cache ratherthan contributing to hits at the storage layer.One implication of this is that MRC analysis needs to be performed over varioustime intervals to be effective in the storage domain. A workload’s MRC over thepast hour may differ dramatically from its MRC over the past day; both data pointsare useful, but neither provides a complete picture on its own.This leads naturally to the notion of a history of locality: a workload represen-tation which characterizes working sets as they change over time. Ideally, thisrepresentation contains enough information to produce MRCs over arbitrary rangesin time, in much the same way that full traces support statistical aggregation overarbitrary intervals. A naïve implementation could produce this representation byperiodically instantiating new Mattson stacks at fixed intervals of a trace, therebymodelling independent LRU caches with various amounts of history, but such anapproach would be impractical for real-world workloads.In the following section we describe a novel technique for computing stack dis-tances (and by extension, MRCs), from an inefficient, idealized form of counterstacks. § 4.4 explains several optimizations which allow a practical counter stackimplementation that requires sublinear space, and § 4.5 presents the additional op-erations that counter stacks support, such as slicing and joining.4.3 Counter StacksCounter stacks capture locality properties of a sequence of accesses within an ad-dress space. In the context of a storage system, accesses are typically read or writerequests to physical disks, logical volumes, or individual files. A counter stack can48process a sequence of requests as they occur in a live storage system, or it can pro-cess, in a single pass, a trace of a storage workload. The purpose of a counter stackis to represent specific characteristics of the stream of requests in a form that is ef-ficient to compute and store, and that preserves enough information to characterizeaspects of the workload, such as cache behaviour.Rather than representing a trace as a sequence of requests for specific addresses,counter stacks maintain a list of counters, which are periodically instantiated whileprocessing the trace. Each counter records the number of unique trace elementsobserved since the inception of that counter; this captures the size of the workingset over the corresponding portion of the trace. Computing and storing samples ofworking set size, rather than a complete access trace, yields a very compact repre-sentation of the trace that nevertheless reveals several useful properties, such as thenumber of unique blocks requested, or the stack distances of all requests, or phasechanges in the working set. These properties enable computation of MRCs overarbitrary portions of the trace. Furthermore, this approach supports compositionand extraction operations, such as joining together multiple traces or slicing tracesby time, while examining only the compact representation, not the original traces.4.3.1 DefinitionA counter stack is an in-memory data structure that is updated while processinga trace. At each time step, the counter stack can report a list of values giving thenumbers of distinct blocks that were requested between the current time and allprevious points in time. This data structure evolves over time, and it is convenientto display its history as a matrix, in which each column records the values reportedby the counter stack at some point in time.Formally, given a trace sequence (e1 . . .eN), where ei is the ith trace element, con-sider an N×N matrix C whose entry in the ith row and jth column is the numberof distinct elements in the set{ei . . .e j}. For example, the trace (a,b,c,a) yieldsthe following matrix.49( a, b, c, a, )1 2 3 31 2 31 21The jth column of this matrix gives the values reported by the counter stack at timestep j, i.e., the numbers of distinct blocks that were requested between that timeand all previous times. The ith row of the matrix can be viewed as the sequence ofvalues produced by the counter that was instantiated at time step i.The in-memory counter stack only stores enough information to produce, at anypoint in time, a single column of the matrix. To compute our desired propertiesover arbitrary portions of the trace, we need to store the entire history of the datastructure, i.e., the entire matrix. However, the history does not need be stored inmemory. Instead, at each time step we write to disk the current column of valuesreported by the counter stack. This can be viewed as checkpointing, or incremen-tally updating, the on-disk representation of the matrix. This on-disk representationis called a counter stack stream; for conciseness we will typically refer to it simplyas a stream.4.3.2 LRU Stack DistancesStack distances and MRCs have numerous applications in cache sizing [78], mem-ory partitioning between processes or VMs [62, 107, 109, 132], garbage collectionfrequency [128], program analysis [38, 131], workload phase detection [102], etc.A significant obstacle to the widespread use of MRCs is the cost of computing them,particularly the high storage cost [20, 89, 103, 106, 129] – all existing methods re-quire linear space. Counter stacks eliminate this obstacle by providinge extremelyefficient MRC computation while using sublinear space.In this subsection we explain how stack distances, and hence MRCs, can be derivedfrom counter stack streams. Recall that the stack distance of a given request isthe number of distinct elements observed since the last reference to the requestedelement. Because a counter stack stores information about distinct elements, de-50termining the stack distance is straightforward. At time step j one must find thelast position in the trace, i, of the requested element, then examine entry Ci j of thematrix to determine the number of distinct elements requested between times i andj. For example, let us consider the matrix given in § 4.3.1. To determine the stackdistance for the second reference to trace element a at position 4, whose previousreference was at position 1, we look up the value C1,4 and get a stack distance of 3.This straightforward method ignores a subtlety: how can one find the last positionin the trace of the requested element? It turns out that this information is implicitlycontained in the counter stack. To explain this, suppose that the counter that wasinstantiated at time i does not increase during the processing of element e j. Sincethis counter reports the number of distinct elements that it has seen, we can inferthat this counter has already seen element e j. On the other hand, if the counterinstantiated at time i+1 does increase while processing e j, then we can infer thatthis counter has not yet seen element e j. Combining those inferences, we canconclude that i is the position of last reference.These observations lead to a finite-differencing scheme that can pinpoint the po-sitions of last reference. At each time step, we must determine how much eachcounter increases during the processing of the current element of the trace. This iscalled the intra-counter change, and it is defined to be∆xi j = Ci, j−Ci, j−1To pinpoint the position of last reference, we must find the newest counter that doesnot increase. This can be done by comparing the intra-counter change of adjacentcounters. This difference is called the inter-counter change, and it is defined to be∆yi j =∆xi+1, j−∆xi, j if i< j0 if i = jLet us illustrate these definitions with an example. Restricting our focus to the firstfour elements of the example trace from § 4.3.1, the matrices ∆x and ∆y are51{ a, b, c, a }1 1 1 01 1 11 11∆x{ a, b, c, a }0 0 0 10 0 00 00∆yEvery column of ∆y either contains only zeros, or contains a single 1. The formercase occurs when the element requested in this column has never been requestedbefore. In the latter case, if the single 1 appears in row i, then the last request forthat element was at time i. For example, because ∆y14 = 1, the last request forelement a before time 4 was at time 1.Determining the stack distance is now simple, as before. While processing columnj of the stream, we infer that the last request for the element e j occurred at timei by observing that ∆yi j = 1. The stack distance for the jth request is the numberof distinct elements that were requested between time i and time j, which is Ci j.Recall that the MRC at cache size x is the fraction of requests with stack distanceexceeding x. Therefore given all the stack distances, we can easily compute theMRC.4.4 Practical Counter StacksThe idealized counter stack stream defined in § 4.3 stores the entire matrix C, soit requires space that is quadratic in the length of the trace. This is actually moreexpensive than storing the original trace. In this section we introduce several ideasthat allow us to dramatically reduce the space of counter stacks and streams.§ 4.4.1 discusses the natural idea of decreasing the time resolution, i.e., keepingonly every dth row and column of the matrix C. § 4.4.2 discusses the idea ofpruning: eventually a counter may have observed the same set of elements as itsadjacent counter, at which point maintaining both of them becomes unnecessary.Finally, § 4.4.3 introduces the crucial idea of using probabilistic counters to effi-ciently and compactly estimate the number of distinct elements seen in the trace.524.4.1 DownsamplingThe simplest way to improve the space used by counter stacks and streams is todecrease the time resolution. This idea is not novel, and similar techniques havebeen used in previous work [42].In our context, decreasing the time resolution amounts to keeping only a smallsubmatrix of C that provides enough data, and of sufficient accuracy, to be usefulfor applications. For example, one could start a new counter only at every dthposition in the trace; this amounts to keeping only every dth row of the matrix C.Next, one could update the counters only at every dth position in the trace; thisamounts to keeping only every dth column of the matrix C. We call this processdownsampling.Adjacent entries in the original matrix C can differ only by 1, so adjacent entries inthe downsampled matrix can differ only by d. Thus, any entry that is missing fromthe downsampled matrix can be estimated using nearby entries that are present, upto additive error d. For large-scale workloads with billions of distinct elements,even choosing a very large value of d has negligible impact on the estimated stackdistances and MRCs.Our implementation uses a slightly more elaborate form of downsampling becausewe wish to combine traces that may have activity bursts in disjoint time intervalsand avoid writing columns during idle periods. As well as starting a new counterand updating the old counters after every dth request, we also start a new counterand update the old counters every s seconds with one exception: we do not outputa column if the previous s seconds contain no activity. Our experiments reportedin § 4.7 pick d = 106 and s ∈ {60,3600}.4.4.2 PruningRecall that every row of the matrix contains a sequence of values reported by somecounter. For any two adjacent counters, the older one (the upper row) will alwaysemit values larger than or equal to the younger one (the lower row). Let us consider53the difference of these counters. Initially, at the time the younger one is created,their difference is simply the number of distinct elements seen by the older counterso far. If any of these elements reappears in the trace, the older counter will notincrease (as it has seen this element before), but the younger counter will increase,so the difference of the counters shrinks.If at some point the younger counter has seen every element seen by the oldercounter, then their difference becomes zero and will remain zero forever. In thiscase, the younger counter provides no additional information, so it can be deleted.An extension of this idea is that, when the difference between the counters becomessufficiently small, the younger counter provides negligible additional information.In this case, the younger counter can again be deleted, and its value can be approx-imated by referring to the older counter. We call this process pruning.The simplest pruning strategy is to delete the younger counter whenever its valuediffers from its older neighbor by at most p. This strategy ensures that the numberof active counters at any point in time is at most M/p. (Recall that M is the numberof distinct blocks in the entire trace.) In our current implementation, in order to fixa set of parameters that work well across many workloads of varying sizes, weinstead delete the younger counter whenever its value is at least (1− δ ) times theolder counter’s value. This ensures that the number of active counters is at mostO(log(M)/δ ). Our experiments reported in § 4.7 pick δ ∈ {0.1,0.02}.4.4.3 Probabilistic CountersCounter stack streams contain the number of distinct blocks seen in the trace be-tween any two points in time (neglecting the effects of downsampling and pruning).The on-disk stream only needs to store this matrix of counts, as the examples in§ 4.3 suggested. The in-memory counter stack has a more difficult job – it must beable to update these counts while processing the trace, so each counter must keepan internal representation of the set of blocks it has seen.The naïve approach is for each counter to represent this set explicitly, but this wouldrequire quadratic memory usage (again, neglecting downsampling and pruning).54A slight improvement can be obtained through the use of Bloom filters [22], butfor an acceptable error tolerance, the space would still be prohibitively large. Ourapproach is to use a tool, called a probabilistic counter or cardinality estimator, thatwas developed over the past thirty years in the streaming algorithms and databasecommunities.Probabilistic counters consume extremely little space and have guaranteed accu-racy. The most practical of these is the HyperLogLog counter [47], which weuse in our implementation. Each count appearing in our on-disk stream is not thetrue count of distinct blocks, but rather an estimate produced by a HyperLogLogcounter which is correct up to multiplicative factor 1+ ε . The memory usage ofeach HyperLogLog counter is roughly logarithmic in M, with more accurate coun-ters requiring more space. More concretely, our evaluation discussed in § 4.7 usesas little as 53 MB of memory to process traces containing over a hundred millionrequests and distinct blocks.4.4.4 LRU Stack DistancesThe technique in § 4.3.2 for computing stack distances and MRCs using idealizedcounter stacks can be adapted to use practical counter stacks. The matrices ∆x and∆y are defined as before, but are now based on the downsampled, pruned matrixcontaining probabilistic counts. Previously we asserted that every column of ∆y iseither all zeros or contains a single 1. This is no longer true. The entry ∆yi j nowreports the number of requests since the counters were last updated whose stackdistance was approximately Ci j.To approximate the stack distances of all requests, we process all columns of thestream. As there may be many non-zero entries in the jth column of ∆y, we record∆yi j occurrences of stack distance Ci j for every i. As before, given all stack dis-tances we can compute the MRC.An online version of this approach which does not emit streams can produce anMRC of guaranteed accuracy using provably sublinear memory. In a companionpaper [41] we prove the following theorem. The key point is that the space depends55polynomially on ` and ε , the parameters controlling the precision of the MRC, butonly logarithmically on N, the length of the trace.Theorem 1 The online algorithm produces an estimated MRC that is correct towithin additive error ε at cache sizes 1`M,2`M,3`M, . . . ,M using onlyO(`2 log(M) log2(N)/ε2) bits of space, with high probability.4.5 The Counter Stack APIThe previous two sections have given an abstract view of counter stacks. In this sec-tion we describe the system that we have implemented based on those ideas. Thesystem is a flexible, memory-efficient library that can be used to process traces,produce counter stack streams, and perform queries on those streams. The work-flow of applications that use this library is illustrated in Figure On-disk StreamsThe on-disk streams output by the library are produced by periodically outputtinga new column of the matrix. As discussed in § 4.4, a new column is produced ifeither d requests have been observed in the trace or s seconds have elapsed (in thetrace’s time) since the last column was produced, except for idle periods, whichare elided. Each column is written to disk in a sparse format to incorporate the factthat pruning may cause numerous entries to be missing.In addition, the on-disk matrix C includes an extra row, called row R, which recordsthe raw number of requests observed in the stream. That is, CR j contains the totalnumber of requests processed at the time that the jth column is output. Finally, theon-disk stream also records the trace’s time of the current request.4.5.2 Compute QueriesThe counter stack library supports three computational queries on streams: RequestCount, Unique Request Count and MRC.56I/O Trace(per-device, volume, or object)CStackStreamWriterCS1 ReaderReaderReaderRequest CountMiss Ratio CurveUnique Request CountCS2CSmsliceshift...joinspecify computeCounter Stack Creation Query ExecutionCS1 CS2CSmCS3Figure 4.1: The counter stack library architectureThe first two query operations are straightforward but useful, as we will show in§ 4.8.4. The Request Count query simply asks for the total number of requeststhat occur in the stream, which is CR j where j is the index of the last column. TheUnique Request Count query is similar except that it asks for the total number ofunique requests, which is C1 j.The most complicated stream operation is the MRC query, which asks for the missratio curve of the given stream. This query is processed using the method describedin § Time Slicing and ShiftingIt is often useful to analyze only a subset of a given trace within a specific timeinterval. We refer to this time-based selection as slicing. It is similarly useful whenjoining traces to alter the time signature by a constant time interval. We refer to57this alteration as shifting.The counter stack library supports slicing and shifting as specification operations.Given a stream containing a matrix C, the stream for the time slice between timestep i and j is the submatrix with corners at Cii and C j j. Likewise, to obtain thestream for the trace shifted forward/backward s time units, we simply add/subtracts to each of the time indices associated with the rows and columns of the matrix.4.5.4 JoiningGiven two or more workloads, it is often useful to understand the behavior thatwould result if they were combined into a single workload. For example, if eachworkload is an IO trace of a different process, one may want to investigate the cacheperformance of those processes with a shared LRU cache.Counter stacks enable such analyses through the join operation. Given two counterstack streams, the desired output of the join operation is what one would obtainby merging the original two traces according to the traces’ times, then producing anew counter stack stream from that merged trace. Our library can produce this newstream using only the two given streams, without examining the original traces.The only assumption we require is that the two streams must access disjoint sets ofblocks.The join process would be simple if, for every i, the time of the ith request werethe same in both traces; in this case, we could simply add the matrices stored inthe two streams. Unfortunately that assumption is implausible, so more effort isrequired. The main ideas are to:• Expand the two matrices so that each has a row and column for every timethat appears in either trace.• Interpolate to fill in the new matrix entries.• Add the resulting matrices together.Let us illustrate this process with an example. Consider a trace A that requests58time 1:00 1:02 1:05 1:14 1:17A a b bCA 1 1 2 2 20 1 1 11 1 10 11B d dCB 0 1 1 1 11 1 1 10 1 11 10merge a d b d bCA +CB 1 2 3 3 31 2 2 21 2 21 21Figure 4.2: An example illustrating the join operationblocks (a,b,b) at times 1:00, 1:05, 1:17, and a trace B requests blocks (d,d) attimes 1:02 and 1:14. The merge of the two traces is as follows:time 1:00 1:02 1:05 1:14 1:17A a b bB d dmerge a d b d bTo join these streams, we must expand the matrices in the two streams so thateach has five rows and columns, corresponding to the five times that appear in thetraces. After this expansion, each matrix is missing entries corresponding to timesthat were missing in its trace. We fill in those missing entries by an interpolationprocess: a missing row is filled by copying the nearest row beneath it, and a missingcolumn is filled by copying the nearest column to the left of it. Figure 4.2 showsthe resulting matrices; interpolated values are shown in bold blue.Pruned counters can sometimes create negative values in ∆x. For example, afterpruning a counter in row j at time t, the interpolated value of the pruned counter at59t +1 is set to the nearest row beneath it, representing a younger counter. Often, thislower counter has a smaller value than the pruned counter. The interpolated value att+1 will then be less than its previous value at t, producing a negative intra-counterchange. We can avoid introducing negative values in ∆x by replacing any negativevalues in ∆x by the nearest nonnegative value beneath it. This replacement has thesame effect of changing the value of the pruned counter to the lower counter incolumn t prior to calculating the intra-counter change for the column representingt +1.4.6 Error and UncertaintyWhile each of the optimizations described in § 4.4 dramatically reduce the storagerequirements of counter stacks, they may also introduce uncertainty and error intothe final calculations. In this section, we discuss potential sources of error, as wellas how to modify the different operations described in § 4.3 to compute lower andupper bounds on the stack distances.4.6.1 Counter ErrorHyperLogLog counters introduce error in two ways: count estimation and simulta-neous register updates. HyperLogLog counters report a count of distinct elementsthat is only correct up to multiplicative factor ε , which is determined by a preci-sion parameter. This uncertainty produces deviation from the true MRC and can becontrolled by increasing the precision of the HyperLogLog counters, at the cost ofa greater memory requirement.Simultaneous register updates introduce a subtler form of error. A HyperLogLogcounter estimates unique counts by taking the harmonic mean of a set of internalvariables called registers. Due to the design of HLLs, sometimes a register updatemight cause the older counter to increase in value more than the younger counter.This phonemoneon leads to negative updates in ∆y, because older counters areexpected to change more slowly than younger counters. Theorem 1 implies that60the negative entries in the ∆y matrix introduced by simultaneous register updatesare offset by corresponding over-estimates when register modifications betweencounters are not simultaneous.In some cases, the histogram of stack distances may accumulate enough negativeentries that there are bins with negative counts. The cumulative sum of such a his-togram will result in a non-monotonic MRC. We can enforce a monotonic MRC byaccumulating any negative histogram bins in a separate counter, carrying the differ-ence forward in the cumulative sum and discounting positive bins by the negativecount. In practice, negative histogram entries make up less then one percent of thereported stack distances, with little to no visible effect on the accumulated MRC.4.6.2 Downsampling UncertaintyWhereas the scheme of § 4.3.2 computes stack distances exactly, the modifiedscheme of § 4.4.4 only computes approximations. This uncertainty in the stackdistances is caused by downsampling, pruning and use of probabilistic counters.To illustrate this, consider the example shown in Figure 4.3, and for simplicity letus ignore pruning and any probabilistic error.At every time step j, the finite differencing scheme uses the matrix ∆y to help esti-mate the stack distances for all requests that occurred since time step j−1. Moreconcretely, if such a request increases the (i+1)th counter but does not increase theith counter, then we know that the most recent occurrence of the requested blocklies somewhere between time step i and time step i+1. Since there may have beenmany requests between time i and time i+ 1, we do not have enough informationto determine the stack distance exactly, but we estimate it up to additive error d(the downsampling factor). A careful analysis can show that the request must havestack distance at least Ci+1, j−1 +1 and at most Ci j.61C 10 20 5015 5040R 100 200 300→∆x 10 10 3015 3540∆R 100 100 100→∆y 90 (1,10) 5 (1,20) 5 (16,50)85 (1,15) 5 (1,50)60 (1,40)Figure 4.3: An example of computing stack distances using a downsampled matrix.The entries of ∆y show the number of requests and the parenthesized valuesshow the bounds on the stack distances that we can infer for those requests.4.7 EvaluationIn this section we empirically validate two claims: (1) the time and space require-ments of counter stack processing are sufficiently low that it can be used for onlineanalysis of real storage workloads, and (2) the technique produces accurate, mean-ingful results.We use a well-studied collection of storage traces released by Microsoft Researchin Cambridge (MSR) [86] for much of our evaluation. The MSR traces record thedisk activity (captured beneath the file system cache) of 13 servers with a combinedtotal of 36 volumes. Notable workloads include a web proxy (prxy), a filer servingproject directories (proj), a pair of source control servers (src1 and src2), and aweb server (web). The raw traces comprise 417 million records and consume justover 5 GB in compressed CSV format.We compare our technique to the ‘ground truth’ obtained from full trace analysis(using trace trees, the tree-based optimization of Mattson’s algorithm [78, 89]),and, where applicable, to a recent approximation technique [125] which derivesestimated MRCs from average footprints (see § 4.9 for more details). For fairness,we modify the original implementation [37] by using a sparse dictionary to reducememory overhead.4.7.1 PerformanceThe following experiments were conducted on a Dell PowerEdge R720 with twosix-core Intel Xeon processors and 96 GB of RAM. Traces were read from high-62Fidelity Time Memory Throughput Storagelow 17.10 m 78.5 MB 2.31M reqs/sec 747 KBhigh 17.24 m 80.6 MB 2.29M reqs/sec 11 MBTable 4.1: The resources required to create low and high fidelity counter stacks forthe combined MSR workload (64 MB heap)performance flash to eliminate disk IO bottlenecks.Throughout this section we present figures for both ‘low’ and ‘high’ fidelity streams.We control the fidelity by adjusting the number of counters maintained in eachstream; the parameters used in these experiments represent just two points of awide spectrum, and were chosen in part to illustrate how accuracy can be tradedfor performance to meet individual needs.We first report the resources required to convert a raw storage trace to a counterstack stream. The memory footprint for the conversion process is quite modest:converting the entire set of MSR traces to high-fidelity counter stacks can be donewith about 80 MB of RAM 1. The processing time is low as well: our Java im-plementation can convert a trace to a high-fidelity stream at a rate of 2.3 millionrequests per second with a 64 MB heap and 2.7 million requests per second with a256 MB heap.The size of counter stack streams can also be controlled by adjusting fidelity. Ig-noring write requests, the full MSR workload consumes 2.9 GB in a compressed,binary format. We can reduce this to 854 MB by discarding latency values andcapping timestamp resolutions at one second, and we can shave off another 50 MBthrough domain-specific compaction techniques like delta-encoding time and offsetvalues. But as Table 4.1 shows, this is more than 70 times larger than a high-fidelitycounter stack representation.The compression achieved by counter stack streams is workload-dependent. High-1This is not a lower bound. Additional reductions can be achieved at the expense of increasedgarbage collection activity in the JVM; for example, enforcing a heap limit of 32 MB increasesprocessing time for the high-fidelity counter stack by about 30% and results in a peak resident setsize of 53 MB.63fidelity streams of the MSR workloads are anywhere from 12 (hm) to 1,024 (prxy)times smaller than their compressed binary counterparts, with larger traces tendingto compress better. A stream of the combined traces consumes just over 1.5 MBper day, meaning that weeks or even months of workload history can be retained atvery reasonable storage costs.Once a trace has been converted to a counter stack stream, performing queriesis very quick. For example, an MRC for the entire week-long MSR trace can becomputed from the counter stack stream in just seconds, with negligible memoryoverheads. By comparison, computing the same MRC using a trace tree takes aboutan hour and reaches a peak memory consumption of 92 GB, while the averagefootprint technique requires 8 and a half minutes and 23 GB of RAM.4.7.2 AccuracyFigure 4.4 shows miss ratio curves for each of the individual workloads containedin the MSR traces as well as the combined master trace; superimposed on thebaseline curves (showing the exact MRCs) are the curves computed using footprintaverages and counter stacks. Some of the workloads feature MRCs that are notablydifferent from the convex functions assumed in the past [109]. The web workloadis the most obvious example of this, and it is also the workload which causes themost trouble for the average footprint technique.Figure 4.5 shows three examples of MRCs produced by joining individual counterstacks. The choice of workloads is somewhat arbitrary; we elected to join work-loads of commensurate size so that each would contribute equally to the resultingmerged MRC. As described in § 4.5.4, the join operation can introduce additionaluncertainty due to the need to infer the values of missing counters, but the effectsare not prominent with the high-fidelity counter stacks used in these examples.We performed an analysis of curve errors at different fidelities, with verylow (δ =0.46, d = 19M, s = 32K) at one extreme and high (δ = 0.01, d = 1M, s = 60) atthe other. To measure curve error, we use the Mean Absolute Error (MAE) betweena given curve and its ground-truth counterpart. The MAE is defined as the average64absolute difference between two series mrc and mrc′, or 1N ∑ |mrc(x)−mrc′(x)|.Because MRCs range between 0 and 1, the MAEs are also confined to the samerange, where a value of 0 implies perfectly corresponding curves. At the otherextreme, it is difficult to know what constitutes a “bad” MAE because it is unlikelyto be close to 1 except in singular cases. For example, the MAE between the hmand the ts Mattson curves is only 0.15. For the high fidelity counter stacks, weobserve MAEs between 0.002 and 0.02, and for the average footprint algorithm,we observe MAEs between 0.001 and 0.04.We find that curve error under compression is highly workload-dependent. Weobserved the largest errors on “jagged” workloads with sharp discontinuities, suchas src1 and web, while workloads with “flatter” MRCs such as stg and usr arealmost invariant to compression. Figure 4.6 summarizes our findings on two suchworkloads. On the left, we illustrate the difference in the change in error as fidelitydecreases for a jagged workload, src1, and a flat workload, usr. On the right, weshow the smoothing effect of decreasing the counter stack fidelity by comparingthe verylow and high fidelity curves against Mattson on src1.4.8 Workload AnalysisWe have shown that counter stacks can be used to produce accurate MRC estima-tions in a fraction of the time and space used by existing techniques. We nowdemonstrate some of the capabilities of the counter stack query interface through aseries of case studies of the MSR traces.4.8.1 Combined WorkloadsHit rates are often used to gauge the health of a storage system: high hit rates areconsidered a sign that a system is functioning properly, while poor hit rates suggestthat tuning or configuration changes may be required. One problem with this sim-plistic view is that the combined hit rates of multiple independent workloads canbe dominated by a single workload, thereby hiding potential problems.65hm mds prnproj prxy rsrchsrc1 src2 stgts usr wdevweb master0.000.250.500.751. 0.5 1.0 1.5 2.0 0 25 50 75 0 20 40 60 800 400 800 1200 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.60 50 100 150 200 250 0 10 20 30 40 0 25 50 750.0 0.1 0.2 0.3 0.4 0.5 0 250 500 750 1000 0.00 0.05 0.10 0.15 0.200 20 40 60 80 0 1000 2000Cache Size (GB)Miss RatioAlgorithmavgfpcs−highcs−lowmattsonFigure 4.4: MSR miss ratio curves66hm−rsrch−merged src2−prn−merged stg−web−merged0.000.250.500.751.000 1 2 0 30 60 90 120 0 50 100 150Cache Size (GB)Miss RatioAlgorithm cs mattsonFigure 4.5: MRCs for various combinations of MSR workloads (produced by the joinoperation)verylowCounter Stack Error Trends0. 20 40 60Counter Stack Size (KB)Mean Aboslute ErrorVMsrc1usrsrc10.250.500.751.000 50 100 150 200 250Cache Size (GB)Miss Rate QualityhighmattsonverylowFigure 4.6: The qualitative effect of counter stack fidelity is workload-dependent.On the left, we show the curve error and file sizes of different fidelities. Theusr workload is robust to compression to very low fidelity, while the src1workload degrades progressively. On the right, we show the visual outcome ofcompression to both high and verylow fidelity on src1.We find this is indeed the case for the MSR traces. The prxy workload features asmall working set and a high activity rate – it accesses only 2 GB of unique dataover the entire week but issues 15% of all read requests in the combined trace.Table 4.2 puts this in perspective: the combined workload achieves a hit rate of50% with a 550 GB cache; more than 250 GB of additional cache capacity wouldbe required to achieve this same hit rate without the prxyworkload. This illustrateswhy combined hit rate is not an adequate metric of system behavior. Diagnostictools which present hit rates as an indicator of storage well-being should be carefulto consider workloads independently as well as in combination.67Desired Hit Rate Required Cache SizeWith prxy Without prxy30% 2.5 GB 21.6 GB40% 19.2 GB 525.5 GB50% 566.6 GB 816.0 GBTable 4.2: Cache sizes required to obtain desired hit rates for combined MSR work-loads with and without prxy4.8.2 Erratic WorkloadsMRCs can be very sensitive to anomalous events. A one-off bulk read in the middleof an otherwise cache-friendly workload can produce an MRC with high miss rates,arguably mischaracterizing the workload. We wrote a simple script that identifieserratic workloads by searching for hour-long slices with unusually high miss ratios.The script found several workloads, including mds, stg, ts, and prn, whose week-long MRCs are dominated by just a few hours of intense activity.Figure 4.7 shows the effect these bursts can have on workload performance. Thefull-week MRC for prn (Figure 4.4) shows a maximum achievable hit rate of 60%at a cache size of 83 GB. The workload features a two-hour read burst starting 102hours into the trace which accounts for 29% of the total requests and 69% of theunique blocks. Time-sliced MRCs before and after this burst feature hit rates of60% at cache sizes of 10 GB and 12 GB, respectively. This is a clear exampleof how anomalous events can significantly distort MRCs, and it shows why it isimportant to consider MRCs over various intervals in time, especially for long-livedworkloads.4.8.3 Conflicting WorkloadsMany real-world workloads exhibit pronounced diurnal patterns: interactive work-loads typically reflect natural trends in business hours, while automatic workloadsare often scheduled at regular intervals throughout the day [43, 72, 101]. When68hours 0 − 101 hours 101 − 103 hours 103 − 1680.000.250.500.751.000 5 10 15 0 20 40 0 5 10 15 20Cache Size (GB)Miss RatioFigure 4.7: Time-sliced prn workloadsuch workloads are served by the same shared storage, it makes sense to try tolimit the degree to which they interfere with one another.The time-shifting functionality of counter stacks provides a powerful tool for ex-ploring coarse-grain scheduling of workloads. To demonstrate this, we wrote ascript which computes the MRCs of the combined MSR trace (excluding prxy) inwhich the start times of a few of the larger workloads (proj, src1, and usr) areshifted by up to six hours. Figure 4.8 plots the best and worst MRCs computed bythis script. As is evident, workload scheduling can significantly affect hit rates. Inthis case, shifting workloads by just a few hours changes the capacity needed for a50% hit rate by almost 50%.4.8.4 Periodic WorkloadsMRCs are good at characterizing the raw capacity needed to accommodate a givenworking set, but they provide very little information about how that capacity isused over time. In environments where many workloads share a common cache,this lack of temporal information can be problematic. For example, as Figure 4.4shows, the entire working set of web is less than 80 GB, and it exhibits a hit rateof 75% with a dedicated cache at this size. However, as shown in Figure 4.9, theworkload is highly periodic and is idle for all but a few hours every day.69675 GB1 TBcombined workloads0.000.250.500.751.000 500 1000 1500Cache Size (GB)Miss Ratio SchedulebestworstFigure 4.8: Best and worst time-shifted MRCs for MSR workloads (excluding prxy).We omit cache sizes greater than 1.5 TB to preserve details in the plot.This behavior is characteristic of automated tasks like nightly backups and index-ing jobs, and it can be problematic because periodic workloads with long reusedistances tend to perform poorly in shared caches. The cost of this is twofold:first, the periodic workloads exhibit low hit rates because their long reuse distancesgive them low priority in LRU caches; and second, they can penalize other work-loads by repeatedly displacing ‘hotter’ data. This is exactly what happens to webin a cache shared with the rest of the MSR workloads: despite its modest workingset size and high locality, it achieves a hit rate of just 7.5% in a 250 GB cache and20% in a 500 GB cache.Scan-resistant replacement policies like ARC [79] and CAR [17] offer one defenseagainst this poor behavior by limiting the cache churn induced by periodic work-loads. But a better approach might be to the exploit the highly regular nature ofsuch workloads – assuming they can be identified – through intelligent prefetch-ing. Counter stacks are well-suited for this task because they make it easy to detectperiodic accesses to non-unique data. While this alone would not be sufficientto implement intelligent prefetching (because the counters do not indicate which70web0e+003e+066e+069e+060 50 100 150HourRequests TypetotaluniqueFigure 4.9: web total and unique requests per hourblocks should be prefetched), it could be used to alert the system of the recurringpattern and initiate the capture of a more detailed trace for subsequent analysis.4.8.5 Zipfian WorkloadsWe end with a brief discussion of synthetic workload generators like FIO [14] andIOMeter [105]. These tools are commonly used to test and validate storage sys-tems. They are capable of generating IO workloads based on parameters describing,among other things, read/write mix, queue depth, request size, and sequentiality.The simpler among them support various combinations of random and sequentialpatterns; FIO recently added support for pareto and zipfian distributions, with thegoal of better approximating real-world workloads.Moving from uniform to zipfian distributions is a step in the right direction. In-deed, many of the MSR workloads, including hm, mds, and prn, exhibit roughlyzipfian distributions. However, as is evident in Figure 4.4, the MRCs of these work-loads vary dramatically. Figure 4.10 plots the MRC of a perfectly zipfian workloadproduced by FIO alongside two permutations of the same workload; as expected,71zipf0.000.250.500.751.000.00 0.05 0.10 0.15 0.20Cache Size (GB)Miss Ratio OrderrandomseriessortedFigure 4.10: MRCs for three permutations of a single zipfian distribution: random,series (a concatenation of sorted series of unique requests), and sorted(truncated to preserve detail).request ordering has a significant impact on locality and cache behavior. Thesefigures show that synthetic zipfian workloads do not necessarily produce ‘realistic’MRCs, emphasizing the importance of using real-world workloads when evaluatingstorage performance.4.9 Related WorkMattson et al. [78] defined stack distances and presented a simple O(NM) time,O(M) space algorithm to calculate them. Bennett and Kruskal [20] used a tree-based implementation to bring the runtime to O(N log(N)). Almási et al. improvedthis to O(N log(M)), and Niu et al. [89] introduced a parallel algorithm.A different line of work explores techniques to efficiently approximate stack dis-tances. Eklov and Hagersten [42] proposed a method to estimate stack distancesbased on sampling. Ding and Zhong [38] use an approximation technique inspiredby the tree-based algorithms. Xiang et al. [125] define the footprint of a given tracewindow to be the number of distinct blocks occurring in the window. Using reuse72distances, they estimate the average footprint across a logarithmic scale of windowlengths. Xiang et al. [126] then develop a theory connecting the average footprintand the miss ratio, contingent on a regularity condition they call the reuse-windowhypothesis. In comparison, counter stacks use dramatically less memory whileproducing MRCs with comparable accuracy.A large body of work from the storage community explores methods for repre-senting workloads concisely. Chen et al. [31] use machine learning techniquesto extract workload features, Tarasov et al. [111] describe workloads with featurematrices, and Delimitrou et al. [35] model workloads with Markov Chains. Theserepresentations are largely incomparable to counter stacks – they capture many de-tails that are not preserved in counter stack streams, but they discard much of thetemporal information required to compute accurate MRCs.Many domain-specific compression techniques have been proposed to reduce thecost of storing and processing workload traces. These date back to Smith’s stackdeletion [106] and include Burtscher’s VPC compression algorithms [24]. Theygenerally preserve more information than counter stacks but achieve lower com-pression ratios. They do not offer new techniques for MRC computation.4.10 ConclusionSizing the tiers of a hierarchical memory system and managing data placmentacross them is a difficult, workload dependent problem. Techniques such as missratio curve estimation have existed for decades as a method of modeling workloadbehaviors offline, but their computational and memory overheads have preventedtheir incorporation as a means to make live decisions in real systems. Even as anoffline tool, practical issues such as the overheads associated with trace collectionand storage often prevent the sharing and analysis of memory access traces.Counter stacks provide a powerful software tool to address these issues. They area compact form of locality characterization that allow workloads to be studied innew interactive ways, for instance by searching for anomalies or shifting workloadsto identify pathological load possibilities. They can also be incorporated directly73into system design as a means of making more informed and workload-specificdecisions about resource allocation across multiple tenants.While the design and implementation of counter stacks described in this chapterhave been motivated through the design of an enterprise storage system, the tech-niques are relevant in other domains, such as processor architecture, where theanalysis of working set size over time and across workloads is critical to the designof efficient, high-performance systems.74Chapter 5Mirador: An Active ControlPlane for Datacenter StorageA version of this chapter was published at the 15th USENIX Conference on Fileand Storage Technologies in 2017 [120].5.1 IntroductionIn becoming an active resource within the datacenter, storage is now similar tothe compute and network resources to which it attaches. For those resources, re-cent years have seen a reorganization of software stacks to cleanly disentangle thenotions of control and data paths. This thrust toward “software defined” systemsaims for designs in which virtualized resources may be provisioned on demand andin which central control logic allows the programmatic management of resourceplacement in support of scale, efficiency, and performance.This chapter observes that modern storage systems both warrant and demand ex-actly this approach to design. The emergence of high-performance rack-scale hard-ware [13, 44, 92] is amplifying the importance of connectivity between applicationworkloads and their data as a critical aspect of efficient datacenter design. Fortu-75nately, the resource programmability introduced by software defined networks andthe low cost of data migration on non-volatile memory means that the dynamicreconfiguration of a storage system is achievable.How is dynamic placement useful in the context of storage? First, consider thatnetwork topology has become a very significant factor in distributed storage de-signs. Driven by the fact that intra-rack bandwidth continues to outpace east/westlinks and that storage device latencies are approaching that of Ethernet round-triptimes, efficient storage placement should ensure that data is placed in the same rackas the workloads that access it, and that network load is actively balanced acrossphysical links.A separate goal of distributing replicas across isolated failure domains requires asimilar understanding of physical and network topology, but may act in oppositionto the goal of performance and efficiency mentioned above. While placement goalssuch as these examples can be motivated and described in relatively simple terms,the resulting placement problem is multi-dimensional and continuously changing,and so very challenging to solve.Mirador is a dynamic storage placement service that addresses exactly this prob-lem. Built as a component within a scale-out enterprise storage product [34], Mi-rador’s role is to translate configuration intention as specified by a set of objectivefunctions into appropriate placement decisions that continuously optimize for per-formance, efficiency, and safety. The broader storage system that Mirador controlsis capable of dynamically migrating both the placement of individual chunks ofdata and the client network connections that are used to access them. Mirador bor-rows techniques from dynamic constraint satisfaction to allow multi-dimensionalgoals to be expressed and satisfied dynamically in response to evolutions in envi-ronment, scale, and workloads.This chapter describes our experience in designing and building Mirador, which isthe second full version of a placement service we have built. Our contributions arethreefold: We demonstrate that robust placement policies can be defined as sim-ple declarative objective functions and that general-purpose solvers can be used tofind solutions that apply these constraints to both network traffic and data place-76ment in a production storage system, advancing the application of optimizationtechniques to the storage configuration problem [1, 8, 10, 11, 110]. We show thatfor performance-dense storage clusters, placement decisions informed by the rel-ative capabilities of network and storage tiers can yield improvements over morestatic layouts originally developed for large collections of disks. And finally, weinvestigate techniques for exploiting longitudinal workload profiling to craft cus-tom placement policies that lead to additional improvements in performance andcost-efficiency.5.2 A Control Plane for Datacenter StorageMirador implements the control plane of a scale-out enterprise storage systemwhich presents network-attached block devices for use by virtual machines (VMs),much like Amazon’s Elastic Block Store [19]. A typical deployment consists ofone or more independent storage nodes populated with performance-dense NVMedevices, each capable of sustaining random-access throughputs of hundreds ofthousands of IOPS. In order to capitalize on the low latency of these devices, stor-age nodes are commonly embedded horizontally throughout the datacenter along-side the compute nodes they serve. In this environment, Mirador’s role is to pro-vide a centralized placement service that continuously monitors the storage systemand coordinates the migration of both data and network connections in response toworkload and environmental changes.A guiding design principle of Mirador is that placement decisions should be dy-namic and flexible.Dynamic placement decisions allow the system to adapt to environmental change.We regularly observe deployments of hundreds to thousands of VMs where onlya small number of workloads dominate resource consumption across the clusterat any given time. Moreover, the membership of this set often changes as VMsare created and deleted or they transition through different workload phases. Forthese reasons, the initial choices made when placing data in the cluster may notalways be the best ones; significant improvements can often be had by periodically77re-evaluating placement decisions over time in response to changes in workloadbehavior.Flexible placement decisions allow the system to articulate complex and multidi-mensional policy. Rather than trying to combine diverse and often conflicting goalsin a single monolithic description, Mirador approaches system configuration as asearch problem. Policies are composed of one or more objective functions, simplerules that express how resources should be allocated by computing numerical costsfor specific configurations. A planning engine employs established constraint sat-isfaction techniques to efficiently search the configuration space for a minimal-costsolution.In our experience, policies expressed as simple independent rules are substantiallymore perspicuous and robust than their monolithic alternatives. For example, afterupgrading the customized planning engine that shipped in an early version of theproduct to a generic constraint solver, we were able to replace a load balancingpolicy originally defined in 2,000 lines of imperative Python with a similar policycomposed of seven simple rules each expressed in less than thirty lines of code (see§ 5.3.2 for examples). Much of the complexity of the original policy came fromdescribing how it should be realized rather than what it intended to achieve. Bydisentangling these two questions and answering the former with a generic searchalgorithm, we arrived at a policy description that is equally efficient as the firstversion, yet much easier to reason about and maintain.Mirador implements the configuration changes recommended by the planning en-gine by coordinating a cluster-wide schedule of data and network migration tasks,taking care to minimize the performance impact on client workloads. It communi-cates directly with switches and storage nodes to effect these migrations, continu-ally monitoring system performance as it does so. In this way it actively respondsto environmental and workload changes and results in a more responsive, robustsystem.7879Observe (§3.1) Optimize (§3.2)Platform Support (§3.4)Actuate (§3.3)System Monitor Planning Engine SchedulerPhysical SystemIn-memory storage system model Objective Functionsdata placementnotification migrationprioritizationStorage Node Storage Node Storage Node Storage Node Storage Node Storage Nodenetwork connectionsphysical topologyload / livenessClient ClientClientSDN Switch SDN Switchobject object object object object objectobject object object objectobject object object objectSolver Plugin(e.g. greedy, branch+bound)Task SchedulerData placement actuationData migrated by triggeringpeer-wise background copies.Network flow actuationNetwork reconfigurations sentto SDN controller.Monitoring Daemonnotifications pollingactiveplan subtasks work queuesrule ReplicaPlacement:  solutionHints[]  dependentVariables[]  eval(proposedModel):    ...    return scoreplanning requestenviron-mental changesnewactive planactuationrequestsbackground object migrationoverloadedphysical linkFigure 5.1: The storage system architecture (below) and the Mirador rebalance pipeline (above). The figure shows two examples ofthe system performing actuations in response to observed state. First, the fourth storage node has become disproportionatelyfull relative to the other nodes. To balance capacity in the system, the rightmost object on that node is undergoing backgroundmigration to the third node. Second, the physical network link into the left side port of the second storage node has comeunder pressure from two high-volume flows from the first two clients. The system will observe this overload, and then choseone of the flows to migrate to a different physical link.5.3 MiradorMirador is a highly-available data placement service that is part of a commercialscale-out storage product. Figure 5.1 presents a typical cluster composed of mul-tiple storage nodes. Each node is a regular server populated with one or moredirectly-attached, non-volatile storage devices. Nodes implement an object inter-face on top of these devices and manage virtual to physical address translationsinternally. Objects present sparse 63-bit address spaces and are the primary unit ofplacement. A virtual block device interface is presented to clients. Virtual devicesmay be composed of one or more objects distributed across multiple nodes; bydefault, they are striped across 16 objects, resulting in typical object sizes on theorder of tens to hundreds of GiB.The storage cluster is fronted by a set of Software Defined Network (SDN) switchesthat export the cluster over a single virtual IP address. Clients connect to the virtualIP and are directed to storage nodes by a custom SDN controller. Nodes are con-nected in a mesh topology, and any node is capable of servicing requests from anyclient, allowing the mapping between clients and nodes to be modified arbitrarily.One or more nodes in the cluster participate as a Mirador service provider. Serviceproviders work together to monitor the state of the cluster and initiate rebalancejobs in response to topology and load changes. Rebalance jobs are structured asa control pipeline that generates and executes plans for dynamically reconfiguringthe placement of data and client connections in order to optimize for performance,efficiency, and safety. Job state is periodically checkpointed in a replicated statemachine [59], providing strong resliency against failures.The rebalance pipeline is composed of three stages:Observation A system monitor collects resource metrics like device and networkload along with detailed workload profiles to construct a model of the cluster.80Optimization A planning engine computes a numerical cost for the current con-figuration and searches for alternative configurations that would reduce or elimi-nate this cost. If a lower-cost arrangement is identified, a plan is constructed thatyields the desired results.Actuation A scheduler implements the plan by coordinating the migration of dataand client connections.5.3.1 ObservationThe system monitor maintains a storage system model that captures all relevantproperties of the physical system, including static features like cluster topology(e.g., the number of devices and nodes, the capacity of their network links, anduser-defined failure domains) and dynamic features like the current free space andIO load of devices and the utilization of network ports.The monitor also collects highly-compressed sketches of individual workload be-havior [122]. These summaries are collected by a dedicated workload analysis ser-vice, and they include features such as miss ratio curves and windowed footprints.Unlike hardware utilization levels, this data cannot be computed from instanta-neous measurements, but instead requires detailed profiling of workloads over ex-tended periods of time.The monitor synchronizes the model by polling the cluster; sampling frequenciesvary from every few seconds for metrics like link load to tens of minutes for work-load footprint measurements, while exceptional events such as device failures aresignalled via special alerts.5.3.2 OptimizationThe planning engine implements the logic responsible for generating rebalanceplans. Placement logic is encapsulated in one or more objective functions thatspecify rules for how data and flows should be distributed across the cluster. The81engine invokes a solver to search for new configurations that reduce placementcosts, as defined by the objective functions.The planning engine manipulates a copy of the storage model when consideringalternative configurations. For example, if a decision is made to move an objectfrom one device to another, the modelled free space and load of each device isadjusted to reflect the change.Modelling data migration within the cluster is a challenging problem. While anobject’s size serves as a rough approximation of the cost of migrating it, the actualtime required to move the data depends on many things, including the type andload of the source and destination devices, network contention along the migrationpath, and fragmentation of the data being migrated. This is important, however,because system resources like free space and bandwidth may be consumed at boththe source and destination devices during migration, and the solver may make poordecisions if this usage is modelled incorrectly. For this reason, migrations initiatedduring the optimization stage are modelled conservatively by reserving space onthe destination device at the beginning of operation and only releasing it from thesource device once the migration has completed.Objective FunctionsData placement is expressed as an optimization problem by representing objectsand flows as variables and devices and links as the values these variables can take,respectively. Within this framework, objective functions model the cost (or ben-efit) of assigning a value to a given variable (e.g., placing a replica on a specificdevice). 1Mirador objective functions can assign arbitrary numerical costs to a given configu-ration. Hard constraints, implemented by rules imposing an infinite cost, can neverbe violated – any configuration with an infinite cost is rejected outright. Negativecosts can also be used to express affinities for preferred assignments. An optimal1For clarity of exposition, we use the terms objective function and rule interchangably throughoutthe chapter.82configuration is one that minimizes the cumulative cost of all assignments; solversemploy various search strategies to find minimal-cost solutions. In the case that nofinite-cost configuration can be found (e.g., due to catastrophic hardware failure),Mirador raises an alert that manual intervention is required.Objective functions are expressed as simple Python functions operating on the stor-age system model described above. Listing 5.1 shows a rule designed to minimizeload imbalances by stipulating that the spread between the most- and least-loadeddevices falls within a given range. (Note that this formulation codifies a system-level notion of balance by assigning costs to all objects located on overloaded de-vices; moving just one such object to a different device may be enough to eliminatethe cost for all the remaining objects.) During the optimization stage, the plan-ning engine converts the storage model into an abstract representation of variables,values, and objectives, and computes the cost of each assignment by invoking itsassociated rules (see § 5.3.2).A special annotation specifies the scope of the rule, indicating which componentsit affects (e.g., objects, connections, devices, links). Solvers refer to these annota-tions when determining which rules need to be re-evaluated during configurationchanges. For example, the load_balanced rule affects devices, and must beinvoked whenever the contents of a device changes.Mutual objectives can be defined over multiple related objects. For instance, List-ing 5.2 gives the implementation of a rule stipulating that no two objects in a replicaset reside on the same device; it could easily be extended to include broader knowl-edge of rack and warehouse topology as well. Whenever a solver assigns a newvalue to a variable affected by a mutual objective, it must also re-evaluate all re-lated variables (e.g., all other replicas in the replica set), as their costs may havechanged as a consequence of the reassignment.Rules can provide hints to the solver to help prune the search space. Rule imple-mentations accept a domain argument, which gives a dictionary of the values thatcan be assigned to the variable under consideration, and is initially empty. Rulesare free to update this dictionary with the expected cost that would be incurred byassigning a particular value. For example, the rule in Listing 5.2 populates a given83replica’s domain with the pre-computed cost of moving it onto any device alreadyhosting one of its copies, thereby deprioritizing these devices during the search.The intuition behind this optimization is that most rules in the system only affect asmall subset of the possible values a variable can take, and consequently, a handfulof carefully chosen hints can efficiently prune a large portion of the solution space.A policy consists of one or more rules, which can be restricted to specific hardwarecomponents or object groups in support of multi-tenant deployments.@rule(model.Device)def load_balanced(fs, device, domain):cost, penalty = 0, DEVICE_BALANCED_COST# compute load of current device# for the current sample intervalload = device.load()# compute load of least-loaded deviceminload = fs.mindevice().load()if load − minload > LOAD_SPREAD:# if the difference is too large,# the current device is overloadedcost = penaltyreturn costListing 5.1: Load balancing rule@rule(model.ReplicaSet)def rplset_devices_unique(fs, replica, domain):cost, penalty = 0, INFINITYfor rpl in replica.rplset:if rpl is replica:# skip current replicacontinueif rpl.device is replica.device:# two replicas on the same device# violate redundancy constraintcost = penalty# provide a hint to the solver that the# devices already hosting this replica set# are poor candidates for this replica.domain[rpl.device] += penaltyreturn costListing 5.2: Hardware redundancy rule84SolversThe planning engine is written in a modular way, making it easy to implementmultiple solvers with different search strategies. Solvers accept three arguments: adictionary of assignments mapping variables to their current values, a dictionary ofdomains mapping variables to all possible values they can take, and a dictionary ofobjectives mapping variables to the rules they must satisfy. Newly-added variablesmay have no assignment to start with, indicating that they have not yet been placedin the system. Solvers generate a sequence of solutions, dictionaries mapping vari-ables to their new values. The planning engine iterates through this sequence ofsolutions until it finds one with an acceptable cost, or no more solutions can befound.Mirador provides a pluggable solver interface that abstracts all knowledge of thestorage model described abover. Solvers implement generic search algorithms andare free to employ standard optimization techniques like forward checking [54] andconstraint propagation [77] to improve performance and solution quality.We initially experimented with a branch and bound solver [97] because at firstglance it fits well with our typical use case of soft constraints in a dense solutionspace [48]. A key challenge to using backtracking algorithms for data placement,however, is that these algorithms frequently yield solutions that are very differentfrom their initial assignments. Because reassigning variables in this context mayimply migrating a large amount of data from one device to another, this propertycan be quite onerous in practice. One way to address this is to add a rule whosecost is proportional to the difference between the solution and its initial assign-ment (as measured, for example, by its Hamming distance) [55]. However, thistechnique precludes zero-cost reconfigurations (since every reassignment incurs acost) and thus requires careful tuning when determining whether a solution with anacceptable cost has been found.We eventually adopted a simpler greedy algorithm. While it is not guaranteed toidentify optimal solutions in every case, we find in practice that it yields qualitysolutions with fewer reassignments and a much more predictable run time. In fact,85the greedy algorithm has been shown to be a 2-approximate solution for the relatedmakespan problem [52], and it is a natural fit for load rebalancing as well [3].Listing 5.3 presents a simplified implementation of the greedy solver. It main-tains a priority queue of variables that are currently violating rules, ordered by thecost of the violations, and a priority-ordered domain for each variable specifyingthe possible values it can take. A pluggable module updates domain priorities inresponse to variable reassignments, making it possible to model capacity and loadchanges as the solver permutes the system searching for a solution. The current im-plementation prioritizes values according to various utilization metrics, includingfree space and load.As described in § 5.3.2, objective functions can provide hints to the solver aboutpotential assignments. The greedy algorithm uses these hints to augment the pri-ority order defined by the storage system model, so that values that would violaterules are deprioritized. The search is performed in a single pass over all variables,starting with the highest-cost variables. First the rules for the variable are invokedto determine whether any values in its domain violate the prescribed placementobjectives (or alternatively, satisfy placement affinities). If the rules identify a zeroor negative-cost assignment, this is chosen. Otherwise, the highest-priority uncon-strained value is selected from the variable’s domain. The search yields its solutiononce all violations have been resolved or all variables have been evaluated.Besides its predictable run time, the greedy algorithm generally yields low mi-gration overheads, since only variables that are violating rules are considered forreassignment. However, if the initial assignments are poor, the algorithm can gettrapped in local minima and fail to find a zero-cost solution. In this case, a sec-ond pass clears the assignment of a group of the costliest variables collectively,providing more freedom for the solver, but potentially incurring higher migrationcosts. We find that this second pass is rarely necessary given the typically under-constrained policies we use in production and is limited almost exclusively to unittests that intentionally stress the planning engine (see § 5.5 for more details).86def greedy(assignments, domains, objectives):# rank variables according to costqueue = PriorityQueue(domains)while queue.cost() > 0:# select the highest-cost variableval = Nonevar = queue.pop()cur = assignments.get(var)domain = domains[var]# retrieve the variable’s current cost and any domain hints provided# by the rulescost, hints = score(var, cur, objectives)if cost <= 0:continue # current assignment is goodif hints:# find the lowest-cost hint; typically, most values are# unconstrained, so this linear scan adds a small constant overheadtry:val = min(v for v in hints if v in domain and v ! = cur)except ValueError:passif val is None or hints[val] > 0:# if we have no hints, or the best hints are costly, choose the# lowest-cost unconstrained value in the domainval = next((v for v in domain if v not in hints and v ! = cur), val)if val is None:c = infinity # couldn’t find a valueelse:c, _ = score(var, val, objectives) # compute cost of new valueif c >= cost:continue # no benefit to re-assigningassignments[var] = val # we found a better assignment# recompute the cost of any mutually-constrained variables that# haven’t already been evaluatedfor v in rulemap(var, objectives):if v in queue:queue.reschedule(v)return assignments # we’ve arrived at a solutionListing 5.3: Greedy solver875.3.3 ActuationMirador can migrate both data and client connections. The scheduler models thecost of data migration conservatively, and attempts to minimize the impact of suchmigrations on client performance whenever possible. Connection migrations aregenerally cheaper to perform and as such occur much more frequently – on theorder of minutes rather than hours.Optimally scheduling data migration tasks is NP-hard [65–67]; Mirador imple-ments a simple global scheduler that parallelizes migrations as much as possiblewithout overloading individual devices or links.Data migrations are performed in two steps: first, a background task copies anobject to the destination device, and then, only after the object is fully replicatedat the destination, it is removed from the source. This ensures that the durabilityof the object is never compromised during migration. Client connections are mi-grated using standard SDN routing APIs augmented by custom protocol handlersthat facilitate session state handover.5.3.4 Platform SupportMirador executes rebalance jobs in batches by (1) selecting a group of objectsand/or client connections to inspect, (2) invoking the planning engine to searchfor alternative configurations for these entities, and (3) coordinating the migrationtasks required to achieve the new layout. Batches can overlap, allowing parallelismacross the three stages. Mirador attempts to prioritize the worst offenders in earlybatches in order to minimize actuation costs, but it guarantees that every object isprocessed at least once during every job.Mirador is able to perform its job efficiently thanks to three unique features pro-vided by the storage platform. First, the system monitor relies on a notificationfacility provided by the cluster metadata service to quickly identify objects thathave been recently created or modified. This allows nodes in the cluster to makequick, conservative placement decisions on the data path while making it easy for88Name Objective Cost Lines of Codedevice_has_space devices are not filled beyond capacity ∞ 4rplset_durable replica sets are adequately replicated onhealthy devices∞ 4load_balanced load is balanced across devices 70 13links_balanced load is balanced across links 20 13node_local client files are co-located on commonnodes60 30direct_connect client connections are routed directly totheir most-frequently accessed nodes10 14wss_best_fit active working set sizes do not exceedflash capacities40 4isolated cache-unfriendly workloads areco-located20 30co_scheduled competing periodic workloads areisolated20 35Table 5.1: Objective functions used in evaluation section; cost gives the penalty in-curred for violating the rule.Mirador to inspect and modify these decisions in a timely manner, providing astrong decoupling of data and control paths. Second, the planning engine makesuse of a prioritization interface implemented at each node that accepts a metricidentifier as an argument (e.g., network or disk throughput, storage IOPS or capac-ity) and returns a list of the busiest workloads currently being serviced by the node.Mirador can use this to inspect problematic offenders first when attempting to min-imize specific objective functions (such as load balancing and capacity constraints)rather than inspecting objects in arbitrary order. Finally, the actuation schedulerimplements plans with the help of a migration routine that performs optimizedbackground copies of objects across nodes and supports online reconfiguration ofobject metadata. This interface also provides hooks to the network controller tomigrate connections and session state across nodes.89Objects Devices Reconfigurations Time (seconds)1K 10 6.40±2.72 0.40±0.061K 100 145.50±33.23 0.83±0.081K 1000 220.00±12.53 10.11±0.4910K 10 0.00±0.00 1.61±0.0110K 100 55.70±5.46 5.54±0.3710K 1000 1475.00±69.70 16.71±0.88100K 10 0.00±0.00 17.10±0.37100K 100 9.30±4.62 22.37±5.38100K 1000 573.80±22.44 77.21±2.87Table 5.2: Greedy solver runtime for various deployment sizes with a basic load-balancing policy; reconfigurations gives the number of changes made to yield azero-cost solution.5.4 EvaluationIn this section we explore both the expressive power of Mirador policies and theimpact such policies can have on real storage workloads. Table 5.1 lists the rulesfeatured in this section; some have been used in production deployments for overa year, while others are presented to demonstrate the breadth and variety of place-ment strategies enabled by Mirador.§ 5.4.1 measures the performance and scalability of the planning engine, indepen-dent of storage hardware. § 5.4.2 shows how Mirador performs in representativeenterprise configurations; storage nodes in this section are equipped with 12 1 TBSSDs, two 10 gigabit Ethernet ports, 64 GiB of RAM, and 2 Xeon E5-2620 proces-sors at 2 GHz with 6 cores each and hyperthreading enabled. § 5.4.3 and § 5.4.4highlight the flexibility of rule-based policies, as measured on a smaller develop-ment cluster where 2 800 GB Intel 910 PCIe flash cards replace the 12 SSDs oneach node.Client workloads run in virtual machines hosted on four Dell PowerEdge r420boxes running VMware ESXi 6.0, each with two 10 gigabit Ethernet ports, 64 GiBof RAM, and 2 Xeon ES-2470 processors at 2.3 GHz with 8 cores and hyperthread-ing enabled. Clients connect to storage nodes using NFSv3 via a dedicated 48-portSDN-controlled Arista 7050Tx switch, and VM disk images are striped across six-90teen objects.5.4.1 OptimizationWe begin by benchmarking the greedy solver, which is used in all subsequent ex-periments. Given rules that run in constant time, this solver has a computationalcomplexity of O(N logN logM) for a system with N objects and M devices.We measure solver runtime when enforcing a simple load-balancing policy (basedon the device_has_space and load_balanced rules, with the latter enforcinga LOAD_SPREAD of 20%) in deployments of various sizes. In each experiment, asimulated cluster is modelled with fixed-capacity devices (no more than ten pernode) randomly populated with objects whose sizes and loads are drawn from aPareto distribution, scaled such that no single object exceeds the capacity of a de-vice and the cluster is roughly 65% full. For each configuration we present thetime required to find a zero-cost solution as well as the number of reconfigurationsrequired to achieve the solution, averaged over ten runs. Some experiments requireno reconfigurations because their high object-to-device ratios result in very smallobjects that yield well-balanced load distributions under the initial, uniformly ran-dom placement; the runtimes for these experiments measure only the time requiredto validate the initial configuration.As Table 5.2 shows, the flexibility provided by Python-based rules comes witha downside of relatively high execution times (more than a minute for a systemwith 100K objects and 1K devices). While we believe there is ample opportunityto improve our unoptimized implementation, we have not yet done so, primarilybecause rebalance jobs run in overlapping batches, allowing optimization and ac-tuation tasks to execute in parallel, and actuation times typically dominate.5.4.2 ActuationIn the following experiment we measure actuation performance by demonstrat-ing how Mirador restores redundancy in the face of hardware failures. We pro-91vision four nodes, each with 12 1 TB SSDs, for a total of 48 devices. We deploy1,500 client VMs, each running fio [14] with a configuration modelled after virtualdesktop workloads. VMs issue 4 KiB requests against 1 GiB disks. Requests aredrawn from an 80/20 Pareto distribution with an 80:20 read:write ratio; read andwrite throughputs are rate-limited to 192 KiB/sec and 48 KiB/sec, respectively,with a maximum queue depth of 4, generating an aggregate throughput of roughly100K IOPS.Five minutes into the experiment, we take a device offline and schedule a rebal-ance job. The rplset_durable rule assigns infinite cost to objects placed onfailed devices, forcing reconfigurations, while load-balancing and failure-domainrules prioritize the choice of replacement devices. The job defers actuation untila 15 minute stabilization interval expires so that transient errors do not trigger un-necessary migrations. During this time it inspects more than 118,000 objects, andit eventually rebuilds 3053 in just under 20 minutes, with negligible effect on clientworkloads, as seen in Figure Resource ObjectivesWe now shift our attention to the efficacy of specific placement rules, measuring thedegree to which they can affect client performance in live systems. We first focuson resource-centric placement rules that leverage knowledge of cluster topologyand client configurations to improve performance and simplify lifecycle operations.Topology-Aware PlacementIn this experiment we measure the value of topology-aware placement policies indistributed systems. We deploy four storage nodes and four clients, with eachclient hosting 8 VMs running a FIO workload issuing random 4 KiB reads againstdedicated 2 GiB virtual disks at queue depths ranging between 1 and 32.Figure 5.3a presents the application-perceived latency achieved under three dif-ferent placement policies when VMs issue requests at a queue depth of one. The920 10 20 30 40 50Time (minutes)020406080100120Object Count (thousands)device failureactuation initiatedrebuild completedFailure Recovery Timeline020406080100120Aggregate Client KIOPSObjects InspectedObjects ReconfiguredObjects RebuiltClient IOPSFigure 5.2: Rebuilding replicas after a device failurerandom policy distributes stripes across backend devices using a simple consistenthashing scheme and applies a random one-to-one mapping from clients to storagenodes. This results in a configuration where each node serves requests from ex-actly one client, and with four nodes, roughly 75% of reads access remotely-hostedstripes. This topology-agnostic strategy is simple to implement, and, assumingworkload uniformity, can be expected to achieve even utilization across the cluster,although it does require significant backend network communication. Indeed, asthe number of storage nodes in a cluster increases, the likelihood that any node isable to serve requests locally decreases; in the limit, all requests require a backendRTT. This behavior is captured by the remote policy, which places stripes such thatno node has a local copy of any of the data belonging to the clients it serves. Thelocal policy follows the opposite strategy, placing all stripes for a given VM on asingle node and ensuring that clients connect directly to the nodes hosting their93500 550 600 650 700 750 800 850Request Latency (µsecs)0.0000.0050.0100.0150.0200.0250.0300.035Estimated ProbabilityLatency Distributionslocalrandomremote(a) Latency distributions at queue depth of 11 4 8 16 32VM Queue Depth050K100K150K200K250K300K350KIOPSAggregate IOPSlocalrandomremote(b) Mean throughput at various queue depthsFigure 5.3: Performance under three different placement strategies. The local policyyields a median latency 18% and 22% lower than the random and remote poli-cies, respectively, resulting in an average throughput increase of 26%. (Errorbars in Figure 5.3b give 95% confidence intervals.)940 20 40 60 80 100 120Time (minutes)050K100K150K200K250K300K350KAggregate IOPSnode addednode addednode addedworkloadsdeactivatedloadrebalancedIOPS TimelineFigure 5.4: Mirador responds to changes in cluster topology and workload behavior.Data is immediately migrated to new storage nodes as they are introduced in20 minute increments, starting at time t20; the brief throughput drops are due tocompetition with background data copies. At time t85, two of the four client ma-chines are deactivated; the remaining client load is subsequently redistributed,at which point performance is limited by client resources.data. Notably, all three policies are implemented in less than twenty lines of code,demonstrating the expressiveness of Mirador’s optimization framework.By co-locating VM stripes and intelligently routing client connections, the localpolicy eliminates additional backend RTTs and yields appreciable performanceimprovements, with median latencies 18% and 22% lower than those of the ran-dom and remote policies, respectively. Similar reductions are obtained across allmeasured queue depths, leading to comparable increases in throughput, as shownin Figure 5.3b.Elastic Scale OutIn addition to improving application-perceived performance, minimizing cross-node communication enables linear scale out across nodes. While a random place-ment policy would incur proportionally more network RTTs as a cluster growsin size (potentially consuming oversubscribed cross-rack bandwidth), local place-ment strategies can make full use of new hardware with minimal communicationoverhead. This is illustrated in Figure 5.4, which presents a timeline of aggregateclient IOPS as storage nodes are added to a cluster. At time t0 the cluster is config-ured with a single storage node serving four clients, each hosting 16 VMs issuing95random 4 KiB reads at a queue depth of 32; performance is initially bottleneckedby the limited storage. At time t20, an additional node is introduced, and the place-ment service automatically rebalances the data and client connections to make useof it. It takes just over two minutes to move roughly half the data in the clus-ter onto the new node. This migration is performed as a low-priority backgroundtask to limit interference with client IO. Two additional nodes are added at twentyminute intervals, and in each case, after a brief dip in client performance caused bycompeting migration traffic, throughput increases linearly.The performance and scalability benefits of the local policy are appealing, but tobe practical, this approach requires a truly dynamic placement service. While bothlocal and random policies are susceptible to utilization imbalances caused by non-uniform workload patterns (e.g., workload ‘hot spots’), the problem is exacerbatedin the local case. For example, if all workloads placed on a specific node happento become idle at the same time, that node will be underutilized. Figure 5.4 showsexactly this scenario at time t85, where two clients are deactivated and the nodesserving them sit idle, halving overall throughput. After waiting for workload be-havior to stabilize, the placement service responds to this imbalance by migratingsome of the remaining VMs onto the idle storage, at which point the clients becomethe bottleneck.5.4.4 Workload ObjectivesPlacement policies informed by resource monitoring can provide significant im-provements in performance and efficiency, but they are somewhat reactive in thesense that they must constantly try to ‘catch up’ to changes in workload behavior.In this section we introduce and evaluate several techniques for improving dataplacement based on longitudinal observations of workload behavior.The following examples are motivated by an analysis of hundreds of thousands ofworkload profiles collected from production deployments over the course of morethan a year. The synthetic workloads evaluated here, while relatively simple, reflectsome of the broad patterns we observe in these real-world profiles.96For these experiments, we extend the storage configuration described in § 5.4.3with a disk-based capacity tier. The placement service controls how objects areassigned to flash devices as before; nodes manage the flash cards as LRU cachesand page objects to disk in 512 KiB blocks. We artificially reduce the capacity ofeach flash device to 4 GiB to stress the tiering subsystem. While our evaluation fo-cuses on conventional tiered storage, we note that the techniques presented here areapplicable to a wide variety of hierarchical and NUMA architectures in which ex-pensive, high-performance memories are combined with cheaper, more capaciousalternatives, possibly connected by throughput-limited networks.Footprint-Aware PlacementMany real-world workloads feature working sets (roughly defined as the set ofdata that is frequently accessed over a given period of time) that are much smallerthan their total data sets [36, 124]. Policies that make decisions based only onknowledge of the latter may lead to suboptimal configurations. We show howaugmenting traditional capacity rules with knowledge of working set sizes can leadto improved placement decisions.We begin by deploying eight VMs across two clients connected to a cluster of twonodes. Each VM disk image holds 32 GiB, but the VMs are configured to runrandom 4 KiB read workloads over a fixed subset of the disks, such that workingset sizes range from 500 MiB to 4 GiB. Given two nodes with 8 GiB of flash each,it is impossible to store all 256 GiB of VM data in flash; however, the total workloadfootprint as measured by the analysis service is roughly 17 GiB, and if carefullyarranged, it can fit almost entirely in flash without exceeding the capacity of anysingle device by more than 1 GiB.We measure the application-perceived latency for these VMs in two configurations.In the first, VMs are partitioned evenly among the two nodes using the local policydescribed in § 5.4.3 to avoid network RTTs. In the second, the same placementpolicy is used, but it is extended with one additional rule that discourages configu-rations where combined working set sizes exceed the capacity of a given flash card.970.00000.00050.00100.00150.00200.00250.00300.0035policy = local103 104Request Latency (µsecs)0.00000.00050.00100.00150.00200.00250.00300.0035policy = best fitEstimated ProbabilityLatency DistributionsFigure 5.5: Fitting working sets to flash capacities (‘best fit’) yields a median latencyof 997 µsecs, compared to 2088 µsecs for the ‘local’ policy that eliminatesbackend network RTTs but serves more requests from disk.The cost of violating this rule is higher than the cost of violating the node-localrule, codifying a preference for remote flash accesses over local disk accesses. Thegreedy solver is a good fit for this problem and arrives at a configuration in whichonly one flash device serves a combined working set size larger than its capacity.As Figure 5.5 shows, the best-fit policy results in significantly lower latencies,because the cost of additional network hops is dwarfed by the penalty incurred bycache misses. The purely local policy exhibits less predictable performance and along latency tail because of cumulative queuing effects at the disk tier. This is aclear example of how combining knowledge of the relative capabilities of networklinks and storage tiers with detailed workload profiling can improve placementdecisions.98Noisy Neighbor IsolationWe next introduce four cache-unfriendly workloads each with 4 GiB disks. Theworkloads perform linear scans that, given 4 GiB LRU caches, are always servedfrom disk and result in substantial cache pollution. These workloads make it im-possible to completely satisfy the working set size rule of the previous experiment.We measure the request latency of the original workloads as they compete withthese new cache-unfriendly workloads under two policies: a fair share policy thatdistributes the cache-unfriendly workloads evenly across the flash devices, and anisolation policy that attempts to limit overall cache pollution by introducing a newrule that encourages co-locating cache-unfriendly workloads on common nodes,regardless of whether or not they fit within flash together. As Figure 5.6 shows, thislatter policy exhibits a bimodal latency distribution, with nearly 48% of requestsenjoying latencies less than one millisecond while a handful of ‘victim’ workloadsexperience higher latencies due to contention with cache-unfriendly competitors.The fair share policy, on the other hand, features a more uniform distribution, withall workloads suffering equally, and a median latency more than three times higherthan that of the isolated policy.Workload Co-schedulingFinally, we introduce a technique for leveraging long-term temporal patterns inworkload behavior to improve data placement. We frequently see storage work-loads with pronounced diurnal patterns of high activity at key hours of the dayfollowed by longer periods of idleness. This behavior typically correlates withworkday habits and regularly scheduled maintenance tasks [43, 87, 101]. Similareffects can be seen at much smaller scales in CPU caches, where the strategy ofco-locating applications to avoid contention is called ‘co-scheduling’ [114].We present a simple algorithm for reducing cache contention of periodic work-loads. The workload analysis service maintains an extended time series of thefootprint of each workload, where footprint is defined as the number of uniqueblocks accessed over some time window; in this experiment we use a window of990.00000.00020.00040.00060.00080.00100.00120.00140.0016policy = fair share103 104Request Latency (µsecs)0.00000.00020.00040.00060.00080.00100.00120.00140.0016policy = isolatedEstimated ProbabilityLatency DistributionsFigure 5.6: Isolating cache-unfriendly workloads on a single device yields a medianlatency of 1036 µsecs, compared to 3220 µsecs for the ‘fair’ policy that dis-tributes these workloads uniformly across all devices.ten minutes. Given a set of workloads, we compute the degree to which they con-tend by measuring how much their bursts overlap. Specifically, we model the costof co-locating two workloads W1 and W2 with corresponding footprint functionsf1(t) and f2(t) as∫min( f1(t), f2(t)). We use this metric to estimate the cost ofplacing workloads together on a given device, and employ a linear first-fit algo-rithm [39] to search for an arrangement of workloads across available devices thatminimizes the aggregate cost. Finally, we introduce the co_scheduled rule whichencodes an affinity for assignments that match this arrangement.We evaluate this heuristic by deploying 8 VMs with 4 GiB disks across two storagenodes each with two 4 GiB flash devices. The VMs perform IO workloads featur-ing periodic hour-long bursts of random reads followed by idle intervals of roughly3 hours, with the periodic phases shifted in some VMs such that not all workloadsare active at the same time. The combined footprint of any two concurrent bursts100103 104 105 106Request Latency (µsecs) CDFsoptimalpessimalfirst-fitrandomFigure 5.7: Co-scheduling periodic workloadsexceeds the size of any single flash device, and if co-located, will incur significantpaging. We measure request latency under a number of different configurations:random, in which stripes are randomly distributed across devices, optimal and pes-simal, in which VMs are distributed two to a device so as to minimize and maximizecontention, respectively, and first-fit, as described above.Figure 5.7 plots latency CDFs for each of these configurations. The penalty ofconcurrent bursts is evident from the pronounced disparity between the optimaland pessimal cases; in the latter configuration, contention among co-located work-loads is high, drastically exceeding the available flash capacity. The first-fit ap-proximation closely tracks optimal in the first two quartiles but performs more likerandom in the last two, suggesting room for improvement either by developing amore sophisticated search algorithm or responding more aggressively to workloadchanges.101101 102 103 104 105 106 107Objects Inspected100101102103104105Optimization Time (seconds) Optimization Time vs Objects InspectedFigure 5.8: Optimization time versus objects inspected5.5 ExperienceTo see how Mirador performs in real-world environments, we sample logs detailingmore than 8,000 rebalance jobs in clusters installed across nearly 50 customer sitesand ranging in size from 8 to 96 devices. Figure 5.8 illustrates how time spentin the optimization stage scales in proportion to the number of objects inspected;these measurements include rate-limiting delays imposed to prevent Mirador fromimpacting client workloads when reading metadata. Figure 5.9 plots the number ofobserved violations against the number of objects inspected per job, and highlightsjobs that fail to find a zero-cost solution after a single optimization pass. Thisoccurs in only 2.5% of sampled jobs in which objective functions are violated, andin 71% of these cases, no zero-cost solutions are possible due to environmentalcircumstances (some log samples cover periods in which devices were intentionallytaken offline for testing or maintenance).We have found Mirador’s flexibility and extensibility to be two of its best attributes.Over the nearly 18 months in which it has been in production, we have adapted it102101 102 103 104 105 106Objects Inspected100101102103104105Violations ObservedViolations Observed vs Objects Inspectedzero costnon-zero costFigure 5.9: Violations observed versus objects inspected (jobs where no zero-costsolution was found after a single optimization round are marked with a red x)to new replication policies and storage architectures simply by modifying existingrules and adding new ones. It has also been straightforward to extend Miradorto support new functionality: in addition to providing capacity balancing acrossstorage devices and network links, it now plays a central role in cluster expansion,hardware retirement, failure recovery, health monitoring, and disk scrubbing fea-tures. For example, upon discovering an invalid data checksum, our disk scrubbingservice simply marks the affected object as corrupt and notifies the placement ser-vice, where a custom rule forces the migration of marked objects to new locations,effectively rebuilding them from valid replicas in the process.Our deployment strategy to date has been conservative: we ship a fixed set of rules(currently seven) and control how and when they are used. Assigning appropriatecosts to rules requires domain knowledge, since rules often articulate conflictingobjectives and poorly chosen costs can lead to unintended behavior. As an example,if solvers fail to identify a zero-cost solution, they yield the one with the lowestaggregate cost – if multiple rules conflict for a given assignment, the assignment103which minimizes the overall cost is chosen. It is thus important to know whichobjective functions a replica set may violate so that high priority rules are assignedcosts sufficiently large enough to avoid priority inversion in the face of violationsof multiple lower-priority rules.While objective functions neatly encapsulate individual placement goals and arerelatively easy to reason about, comprehensive policies are more complex and mustbe carefully vetted. We validate rules, both in isolation and combination, with hun-dreds of policy tests. Declarative test cases specify a cluster configuration andinitial data layout along with an expected optimization plan; the test harness gen-erates a storage system model from the specification, invokes the planning engine,and validates the output. We have also built a fuzz tester that can stress policiesin unanticipated ways. The test induces a sequence of random events (such asthe addition and removal of nodes, changes in load, etc.) and invokes the policyvalidation tool after each step. Any cluster configuration that generates a policyviolation is automatically converted into a test case to be added to the regressionsuite after the desired behavior is determined by manual inspection. Validating anynon-trivial placement policy can require a fair amount of experimentation, but inour experience, the cost-based framework provided by Mirador provides knobs thatgreatly simplify this task.In production, rebalance jobs run in two passes: the first enforces critical rulesrelated to redundancy and fault tolerance, while the second additionally enforcesrules related to load-balancing and performance. This is done because the planningengine must inspect objects in batches (batches are limited to roughly 10,000 ob-jects to keep memory overheads constant), and we want to avoid filling a device inan early batch in order to satisfy low-priority rules when that same device may benecessary to satisfy higher-priority rules in a later batch.Early testing revealed the importance of carefully tuning data migration rates. Ourmigration service originally provided two priorities, with the higher of these in-tended for failure scenarios in which replicas need to be rebuilt. In practice, how-ever, we found that such failures place additional stress on the system, often driv-ing latencies up. Introducing high-priority migration traffic in these situations can104lead to timeouts that only make things worse, especially under load. We havesince adopted a single migration priority based on an adaptive queuing algorithmthat aims to isolate migration traffic as much as possible while ensuring forwardprogress is made.5.6 Related WorkResearchers have proposed a wide variety of strategies for addressing the dataplacement problem, also known as the file assignment problem [40]. Determinis-tic approaches are common in large-scale systems [88, 94, 108, 115, 117] becausethey are decentralized and impose minimal metadata overheads, and they achieveprobabilistically uniform load distribution for large numbers of objects [96, 100].Consistent hashing [64] provides relatively stable placement even as storage tar-gets are added and removed [51, 130]. Related schemes offer refinements like theability to prioritize storage targets and modify replication factors [57, 58, 116], butthese approaches are intrinsically less flexible than dynamic policies.Non-deterministic strategies maintain explicit metadata in order to locate data.Some of these systems employ random or semi-random placement policies for thesake of simplicity and scalability [70, 90, 95], but others manage placement withhard-coded policies [49, 104]. Customized policies provide better control overproperties such as locality and fault tolerance, which can be particularly importantas clusters expand across racks [63].Explicit metadata also make it easier to perform fine-grain migrations in responseto topology and workload changes, allowing systems to redistribute load and ame-liorate hot spots [73, 87]. Hierarchical Storage Management and multi-tier sys-tems dynamically migrate data between heterogeneous devices, typically employ-ing policies based on simple heuristics intended to move infrequently accessed datato cheaper, more capacious storage or slower, more compact encodings [4, 119].Mirador has much in common with recent systems designed to optimize specificperformance and efficiency objectives. Guerra et al. [53] describe a tiering systemthat makes fine-grain placement decisions to reduce energy consumption in SANs105by distributing workloads among the most power-efficient devices capable of sat-isfying measured performance requirements. Janus [6] is a cloud-scale system thatuses an empirical cacheability metric to arrange data across heterogeneous media ina manner that maximizes reads from flash, using linear programming to computeoptimal layouts. Volley [2] models latency and locality using a weighted springanalogy and makes placement suggestions for geographically distributed cloud ser-vices. Tuba [12] is a replicated key-value store designed for wide area networksthat allows applications to specify latency and consistency requirements via servicelevel agreements (SLAs). It collects hit ratios and latency measurements and peri-odically reconfigures replication and placement settings to maximize system utility(as defined by SLAs) while honoring client-provided constraints on properties likedurability and cost. Mirador supports arbitrary cost-function optimizations using ageneric framework and supports policies that control network flows as well as dataplacement.Mirador also resembles resource planning systems [8, 11] like Hippodrome [10],which employ a similar observe/optimize/actuate pipeline to design cost-efficientstorage systems. Given a set of workload descriptions and an inventory of avail-able hardware, these tools search for low-cost array configurations and data layoutsthat satisfy performance and capacity requirements. Like Mirador, they simplifya computationally challenging multidimensional bin-packing problem by combin-ing established optimization techniques with domain-specific heuristics. However,while these systems employ customized search algorithms with built-in heuristics,Mirador codifies heuristics as rules with varying costs and relies on generic solversto search for low-cost solutions, making it easier to add new heuristics over time.Ursa Minor [1] is a clustered storage system that supports dynamically config-urable m-of-n erasure codes, extending the data placement problem along multiplenew dimensions. Strunk et al. [110] describe a provisioning tool for this systemthat searches for code parameters and data layouts that maximize user-defined util-ity for a given set of workloads, where utility quantifies metrics such as availability,reliability, and performance. Utility functions and objective functions both provideflexibility when evaluating potential configurations; however, Mirador’s greedy al-gorithm and support for domain-specific hints may be more appropriate for online106rebalancing than the randomized genetic algorithm proposed by Strunk et al.5.7 ConclusionMirador is a placement service designed for heterogeneous distributed storage sys-tems. It leverages the high throughput of non-volatile memories to actively migratedata in response to workload and environmental changes. It supports flexible, ro-bust policies composed of simple objective functions that specify strategies forboth data and network placement. Combining ideas from constraint satisfactionwith domain-specific language bindings and APIs, it searches a high-dimension so-lution space for configurations that yield performance and efficiency gains overmore static alternatives.107Chapter 6ConclusionAs a commercial product, one of the features that sets Strata apart from its manycompetitors is the architectural support it provides for dynamic cluster reconfigura-tion. This is valuable for a number of reasons. First, it abolishes the much-loathedfive year refresh cycle imposed by many incumbent vendors. Allowing adminis-trators to expand clusters in response to growing demand relieves them of the bur-den of estimating at purchase time what their storage requirements will be manyyears down the road. And supporting rolling upgrades and heterogeneous clusterseliminates the need for disruptive ‘forklift’ upgrades in which existing systems aremigrated to new hardware en masse. Second, deferring purchases until hardware isactually needed can dramatically reduce capital and operating expenses, both by al-lowing Moore’s Law to accrue longer before money is exchanged, and by reducingthe number of devices that sit idle in initially over-provisioned systems. Finally,the ability to provision performance and capacity independently gives storage ad-ministrators the flexibility they need to adapt to changing requirements within thedata center.These advantages are natural consequences of the design advocated in this thesis.The platform provided by Strata decouples logical resources from physical hard-ware and separates control- and data-path logic, enabling dynamic configurationchanges without degrading performance, and the robust policy engine provided108by Mirador arranges for hardware resources to be allocated where they are mostneeded. This paradigm of abstraction, analysis, and actuation helps systems to au-tomatically respond to changes in workload behavior and hardware configurations,a valuable capability in data center environments serving diverse workloads acrosslarge, heterogeneous clusters. It has been incredibly rewarding to see this approachsucceed in real customer deployments, but it has also been instructive to observesome of its limitations. Indeed, there is still ample opportunity – and need – tocontinue innovating storage software, especially as hardware continues to evolve.Below I enumerate what I see as some of the most interesting directions for futureimprovements, some of which we have already begun to explore.Volume Management Strata’s departure from traditional aggregated designs wasa response to the unprecedented performance of new PCIe flash devices like theIntel 910, which provides 800 GB of storage and serves 180,000 random read re-quests per second. Three years after we published the Strata paper, the Intel p3700,providing 2 TB of storage and serving 460,000 random read requests per second,hit the market at roughly the same price as the original 910. This rapid rate ofprogress reinforces many of the design choices we made, particularly regarding theneed to efficiently virtualize hardware in support of dynamic workload multiplex-ing. But these new devices place even more stringent constraints on the data path:access latencies have dropped from 65 microseconds in the 910 to 20 microsec-onds in the p3700, and NVDIMM modules currently operate at latencies of just 10nanoseconds. At these speeds, software overheads imposed by context switchesand thread synchronization become problematic. In response, we built Decibel,a device virtualization layer designed to completely eliminate cross-core commu-nication along the data path. Decibel’s disaggregated architecture is similar toStrata’s, but rather than presenting individual devices over the network, it presentsa volume abstraction that encapsulates storage, network, and compute resources.Decibel volumes bind chunks of storage to dedicated cores and NIC queues; flowsteering based on an explicit network addressing scheme ensures that client re-quests are automatically directed to the appropriate cores, eliminating the need forforwarding or synchronization in software. This, combined with a userspace net-109working stack that bypasses kernel scheduling and context switches, allows Deci-bel to serve remote workloads from p3700 devices at saturation with an overheadrelative to local access of just 20 microseconds.Hybrid Placement Decibel both refines and complements Strata’s separation ofcontrol- and data-path logic, and it naturally benefits from the optimization tech-niques in Mirador that correct load imbalances and mitigate hot spots. However,while a centralized placement engine simplifies the difficult task of optimizing re-source allocation, it also presents some challenges, particularly when scaling tovery large deployments with billions of objects. Individually optimizing the place-ment of so many objects can be prohibitively expensive. Fortunately, the tech-niques employed by Mirador can naturally be combined with less computation-ally expensive approaches like statistical multiplexing to good effect. Under thisregime, a deterministic policy such as consistent hashing can be used to decidethe default placement of the vast majority of objects, while dynamic optimizationtechniques can be applied only to objects that actively contribute to performanceand utilization problems. Strata’s clean separation of addressing and placement fa-cilities would naturally accommodate this hybrid approach, improving scalabilitywithout sacrificing flexibility.Demand Swap Optimizing the placement of data in heterogeneous clusters is par-ticularly challenging because of the huge performance variations across devices.The demand fault strategy conventionally used by cache replacement policies canlead to surprisingly poor performance when the combined size of active workingsets is even marginally larger than the available fast storage. This technique hasa tendency to penalize many workloads a small amount, which can become prob-lematic as data dependencies exacerbate the effects of even a few cache misses perworkload. The data we have collected from real-world deployments of productionvirtual machines, which comprises thousands of workload-years of detailed profil-ing, suggests that a better approach might be to swap entire workloads in and outof fast storage as they cycle between active and idle phases, which regularly last110hours at a time. Preliminary investigation confirms that phase changes are easilyidentified via counter stack analysis and that they can be predicted with fairly highconfidence for a large class of workloads. We have further found that online clas-sifiers generally identify active phases less than a minute after they begin. Giventhat we can reasonably expect to load entire working sets, which are typically onthe order of a few dozens of gigabytes, from disk in a matter of seconds (so longas transfers are carefully scheduled across large numbers of spindles), the idea ofswapping entire workloads in and out of fast storage, either reactively or specu-latively, is alluring. This would leverage the sequential throughput of disks muchmore effectively than the demand fault approach, and would additionally make iteasier to isolate ill-behaved or under-provisioned workloads.Programmable Storage Heterogeneous clusters expose a tension between costand performance. In many cases, purely economic constraints make this tensioninevitable. However, in our experience, providing predictable performance is of-ten more important than achieving device-rate speeds. Mirador uses a number ofheuristics to attempt to automatically infer the optimal allocation of resources atany given time, but these heuristics do not always align with the business needs ofindividual customers. In situations where resources are scarce, it may be preferableto delegate allocation decisions higher up the stack, either to application developersor storage administrators. This is in keeping with recent trends in software designthat have shifted traditional storage responsibilities like replication and consistencyto application-level services like key/value stores and databases. These services un-derstand the performance and placement requirements of their data better than theunderlying storage system, so providing them with an interface for safely influ-encing resource allocation decisions, while protecting against buggy and maliciousapplications, could present new opportunities to improve performance and elimi-nate unwelcome surprises. Mirador’s support for arbitrary soft and hard constraintsprovides a good starting point for this approach; exposing more of this functionalityto applications would extend many of the benefits introduced by software definednetworking to the storage domain.Strata provides a solid platform for exploring these and other techniques because111of the design abstractions it provides. Implementing these abstractions in an en-terprise storage product has been a labor-intensive task, but one that has yieldedmany benefits. In addition to producing a system that solves real problems for ourcustomers, it has provided an opportunity to explore novel techniques for optimiz-ing performance and efficiency within the data center, and its organizing principlesoffer a useful model for future system designers.112Bibliography[1] M. Abd-El-Malek, W. V. C. Courtright II, C. Cranor, G. R. Ganger,J. Hendricks, A. J. Klosterman, M. P. Mesnier, M. Prasad, B. Salmon, R. R.Sambasivan, S. Sinnamohideen, J. D. Strunk, E. Thereska, M. Wachs, andJ. J. Wylie. Ursa minor: Versatile cluster-based storage. In Proceedings ofthe 4th USENIX Conference on File and Storage Technologies, FAST ’05.USENIX, 2005. → pages 42, 77, 106[2] S. Agarwal, J. Dunagan, N. Jain, S. Saroiu, and A. Wolman. Volley:Automated data placement for geo-distributed cloud services. InProceedings of the 7th USENIX Conference on Networked Systems Designand Implementation, NSDI ’10, pages 17–32. USENIX Association, 2010.→ pages 106[3] G. Aggarwal, R. Motwani, and A. Zhu. The load rebalancing problem.Journal of Algorithms, 60(1):42–59, 2006. → pages 86[4] M. K. Aguilera, K. Keeton, A. Merchant, K.-K. Muniswamy-Reddy, andM. Uysal. Improving recoverability in multi-tier storage systems. InProceedings of the 37th Annual IEEE/IFIP International Conference onDependable Systems and Networks, pages 677–686. IEEE ComputerSociety, 2007. → pages 105[5] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and S. Swanson.Onyx: a protoype phase change memory storage array. In Proceedings ofthe 3rd USENIX Conference on Hot Topics in Storage and File Systems,HotStorage ’11, pages 2–2, Berkeley, CA, USA, 2011. USENIXAssociation. → pages 41[6] C. Albrecht, A. Merchant, M. Stokely, M. Waliji, F. Labelle, N. Coehlo,X. Shi, and E. Schrock. Janus: Optimal flash provisioning for cloud storage113workloads. In USENIX Annual Technical Conference, ATC ’13, pages91–102. USENIX Association, 2013. → pages 106[7] G. S. Almási, C. Cas¸caval, and D. A. Padua. Calculating stack distancesefficiently. In Proceedings of the 2002 Workshop on Memory SystemPerformance, MSP ’02, pages 37–43, 2002. → pages 47[8] G. A. Alvarez, E. Borowsky, S. Go, T. H. Romer, R. A. Becker-Szendy,R. A. Golding, A. Merchant, M. Spasojevic, A. C. Veitch, and J. Wilkes.Minerva: An automated resource provisioning tool for large-scale storagesystems. ACM Transactions on Computer Systems, 19(4):483–518, 2001.→ pages 77, 106[9] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, andV. Vasudevan. Fawn: a fast array of wimpy nodes. In Proceedings of the22nd ACM SIGOPS Symposium on Operating Systems Principles, SOSP’09, pages 1–14, 2009. → pages 41[10] E. Anderson, M. Hobbs, K. Keeton, S. Spence, M. Uysal, and A. C. Veitch.Hippodrome: Running circles around storage administration. InProceedings of the 1st USENIX Conference on File and StorageTechnologies, FAST ’02, pages 175–188. USENIX, 2002. → pages 77, 106[11] E. Anderson, S. Spence, R. Swaminathan, M. Kallahalla, and Q. Wang.Quickly finding near-optimal storage designs. ACM Transactions onComputer Systems, 23(4):337–374, 2005. → pages 77, 106[12] M. S. Ardekani and D. B. Terry. A self-configurable geo-replicated cloudstorage system. In Proceedings of the 11th USENIX Conference onOperating Systems Design and Implementation, OSDI ’14, pages 367–381.USENIX Association, 2014. → pages 106[13] K. Asanovic. Firebox: A hardware building block for 2020warehouse-scale computers. Keynote presentation, 12th USENIXConference on File and Storage Technologies (FAST ’14), 2014. → pages75[14] J. Axboe. Fio–flexible I/O tester, 2011. https://github.com/axboe/fio.Visited July 2017. → pages 35, 71, 92[15] K. Bailey, L. Ceze, S. D. Gribble, and H. M. Levy. Operating systemimplications of fast, cheap, non-volatile memory. In Proceedings of the13th USENIX Conference on Hot Topics in Operating Systems, HotOS’13,pages 2–2, Berkeley, CA, USA, 2011. USENIX Association. → pages 41114[16] M. Balakrishnan, D. Malkhi, V. Prabhakaran, T. Wobber, M. Wei, and J. D.Davis. Corfu: a shared log design for flash clusters. In Proceedings of the9th USENIX Conference on Networked Systems Design andImplementation, NSDI’12, 2012. → pages 41[17] S. Bansal and D. S. Modha. CAR: Clock with adaptive replacement. InProceedings of the 4th USENIX Conference on File and StorageTechnologies, FAST ’04, pages 187–200, 2004. → pages 70[18] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization.In Proceedings of the 19th ACM symposium on Operating systemsprinciples, SOSP ’03, pages 164–177, 2003. ISBN 1-58113-757-5. →pages 40[19] J. Barr. Amazon EBS (elastic block store) - bring us your data, August2008. https://aws.amazon.com/blogs/aws/amazon-elastic. Visited July2017. → pages 77[20] B. T. Bennett and V. J. Kruskal. LRU stack processing. IBM Journal ofResearch and Development, 19(4):353–357, 1975. → pages 50, 72[21] M. Blaze. NFS tracing by passive network monitoring. In Proceedings ofthe USENIX Winter 1992 Technical Conference, pages 333–343, 1992. →pages 46[22] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors.Communications of the ACM, 13(7):422–426, 1970. → pages 55[23] A. D. Brunelle. Block I/O Layer Tracing: blktrace. HP, Gelato-Cupertino,CA, USA, 2006. → pages 46[24] M. Burtscher, I. Ganusov, S. J. Jackson, J. Ke, P. Ratanaworabhan, andN. B. Sam. The vpc trace-compression algorithms. IEEE Transactions onComputers, 54(11):1329–1344, 2005. → pages 73[25] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie,Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Uddaraju, H. Khatri,A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, M. F. u. Haq,M. I. u. Haq, D. Bhardwaj, S. Dayanand, A. Adusumilli, M. McNett,S. Sankaran, K. Manivannan, and L. Rigas. Windows azure storage: ahighly available cloud storage service with strong consistency. InProceedings of the 23rd ACM Symposium on Operating SystemsPrinciples, SOSP ’11, pages 143–157, 2011. → pages 42115[26] M. Casado, T. Garfinkel, A. Akella, M. J. Freedman, D. Boneh,N. McKeown, and S. Shenker. Sane: a protection architecture for enterprisenetworks. In Proceedings of the 15th USENIX Security Symposium, SS ’06,Berkeley, CA, USA, 2006. USENIX Association. → pages 26[27] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. Mckeown, and S. Shenker.Ethane: Taking control of the enterprise. In Proceedings of the 2007Conference on Applications, Technologies, Architectures, and Protocols forComputer Communications. ACM, 2007. → pages 26[28] A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, andS. Swanson. Moneta: A high-performance storage array architecture fornext-generation, non-volatile memories. In Proceedings of the 2010 43rdAnnual IEEE/ACM International Symposium on Microarchitecture,MICRO ’43, pages 385–395, 2010. → pages 41[29] A. M. Caulfield, T. I. Mollov, L. A. Eisner, A. De, J. Coburn, andS. Swanson. Providing safe, user space access to fast, solid state disks. InProceedings of the 17th International Conference on Architectural Supportfor Programming Languages and Operating Systems, ASPLOS XVII,pages 387–400, 2012. → pages 41[30] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storagesystem for structured data. ACM Transactions on Computer Systems, 26(2):4:1–4:26, June 2008. → pages 42[31] Y. Chen, K. Srinivasan, G. Goodson, and R. Katz. Design implications forenterprise storage systems via multi-dimensional trace analysis. InProceedings of the 23rd ACM Symposium on Operating SystemsPrinciples, SOSP ’11, pages 43–56. ACM, 2011. → pages 73[32] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala,and S. Swanson. Nv-heaps: making persistent objects fast and safe withnext-generation, non-volatile memories. In Proceedings of the 16thInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS XVI, pages 105–118, NewYork, NY, USA, 2011. ACM. → pages 42[33] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, andD. Coetzee. Better i/o through byte-addressable, persistent memory. InProceedings of the ACM SIGOPS 22nd Symposium on Operating Systems116Principles, SOSP ’09, pages 133–146, New York, NY, USA, 2009. ACM.→ pages 41[34] B. Cully, J. Wires, D. T. Meyer, K. Jamieson, K. Fraser, T. Deegan,D. Stodden, G. Lefebvre, D. Ferstay, and A. Warfield. Strata: scalablehigh-performance storage on virtualized non-volatile memory. InProceedings of the 12th USENIX Conference on File and StorageTechnologies, FAST ’14, pages 17–31. USENIX, 2014. → pages iv, 9, 14,45, 76[35] C. Delimitrou, S. Sankar, K. Vaid, and C. Kozyrakis. Decouplingdatacenter studies from access to large-scale applications: A modelingapproach for storage workloads. In Proceedings of the 2011 IEEEInternational Symposium on Workload Characterization, IISWC ’11, pages51–60. IEEE, 2011. → pages 73[36] P. Denning. working set model of program behavior. Communications ofthe ACM, 1968. → pages 47, 97[37] C. Ding. Program locality analysis tool, 2014.https://github.com/dcompiler/loca. Visited July 2017. → pages 62[38] C. Ding and Y. Zhong. Predicting whole-program locality through reusedistance analysis. In Proceedings of the ACM SIGPLAN 2003 Conferenceon Programming Language Design and Implementation, PLDI ’03, pages245–257. ACM, 2003. → pages 47, 50, 72[39] G. Dósa. The tight bound of first fit decreasing bin-packing algorithm isFFD(i) <= 11/9OPT(i) + 6/9. In ESCAPE, volume 4614 of Lecture Notesin Computer Science, pages 1–11. Springer, 2007. → pages 100[40] L. W. Dowdy and D. V. Foster. Comparative models of the file assignmentproblem. ACM Computing Surveys, 14(2):287–313, 1982. → pages 105[41] Z. Drudi, N. J. A. Harvey, S. Ingram, A. Warfield, and J. Wires.Approximating hit rate curves using streaming algorithms. In Proceedingsof the 18th International Workshop on Approximation Algorithms forCombinatorial Optimization Problems, APPROX ’15, pages 225–241,2015. → pages 9, 55[42] D. Eklov and E. Hagersten. StatStack: Efficient modeling of LRU caches.In Proceedings of the 2010 IEEE International Symposium on PerformanceAnalysis of Systems & Software, ISPASS ’10, pages 55–65. IEEE, 2010. →pages 53, 72117[43] D. Ellard, J. Ledlie, P. Malkani, and M. I. Seltzer. Passive nfs tracing ofemail and research workloads. In Proceedings of the 2nd USENIXConference on File and Storage Technologies, FAST ’03. USENIX, 2003.→ pages 68, 99[44] EMC. DSSD D5, 2016.https://www.emc.com/en-us/storage/flash/dssd/dssd-d5/index.htm. VisitedJuly 2017. → pages 75[45] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Exokernel: an operatingsystem architecture for application-level resource management. InProceedings of the 15th ACM Symposium on Operating Systems Principles,SOSP ’95, pages 251–266, 1995. → pages 40[46] B. Fitzpatrick. Distributed caching with memcached. Linux Journal, 2004(124):5–, Aug. 2004. → pages 42[47] P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. HyperLogLog: theanalysis of a near-optimal cardinality estimation algorithm. In Proceedingsof the 2007 Conference on Analysis of Algorithms, AofA ’07, pages127–146, 2007. → pages 45, 55[48] E. Freuder. A sufficient condition for backtrack-free search.Communications of the ACM, 29(1):24–32, 1982. → pages 85[49] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. InACM SIGOPS Operating Systems Review, volume 37, pages 29–43, 2003.→ pages 105[50] G. A. Gibson, K. Amiri, and D. F. Nagle. A case for network-attachedsecure disks. Technical Report CMU-CS-96-142, Carnegie-MellonUniversity.Computer science. Pittsburgh (PA US), Pittsburgh, 1996. →pages 17[51] A. Goel, C. Shahabi, S.-Y. D. Yao, and R. Zimmermann. SCADDAR: Anefficient randomized technique to reorganize continuous media blocks. InProceedings of the 18th International Conference on Data Engineering,ICDE ’02, pages 473–482. IEEE, 2002. → pages 105[52] R. L. Graham. Bounds on multiprocessing anomalies and related packingalgorithms. In Proceedings of the May 16-18, 1972, Spring Joint ComputerConference, AFIPS ’72 (Spring), pages 205–217, New York, NY, USA,1972. ACM. → pages 86118[53] J. Guerra, H. Pucha, J. S. Glider, W. Belluomini, and R. Rangaswami. Costeffective storage using extent based dynamic tiering. In Proceedings of the9th USENIX Conference on File and Storage Technologies, FAST ’11,pages 273–286. USENIX, 2011. → pages 105[54] R. Haralick and G. Elliot. Increasing tree search efficiency for constraintsatisfaction problems. Artificial Intelligence, 14(3):263–313, 1980. →pages 85[55] E. Hebrard, B. Hnich, B. O’Sullivan, and T. Walsh. Finding diverse andsimilar solutions in constraint programming. In Proceedings of the 20thNational Conference on Artificial Intelligence, AAAI ’05, pages 372–377.MIT Press, 2005. → pages 85[56] D. Hildebrand and P. Honeyman. Exporting storage systems in a scalablemanner with pnfs. In Proceedings of the 22nd IEEE/13th NASA GoddardConference on Mass Storage Systems and Technologies, MSST ’05, 2005.→ pages 42[57] R. J. Honicky and E. L. Miller. A fast algorithm for online placement andreorganization of replicated data. In Proceedings of the 17th InternationalSymposium on Parallel and Distributed Processing, IPDPS ’03, page 57.IEEE Computer Society, 2003. → pages 105[58] R. J. Honicky and E. L. Miller. Replication under scalable hashing: Afamily of algorithms for scalable decentralized data distribution. InProceedings of the 18th International Symposium on Parallel andDistributed Processing, IPDPS ’04. IEEE Computer Society, 2004. →pages 105[59] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-freecoordination for internet-scale systems. In Proceedings of the 2010USENIX Annual Technical Conference, ATC ’10. USENIX Association,2010. → pages 80[60] N. C. Hutchinson and L. L. Peterson. The x-kernel: An architecture forimplementing network protocols. IEEE Transactions on SoftwareEngineering, 17(1):64–76, Jan. 1991. → pages 21[61] B. Jacob, P. Larson, B. Leitao, and S. da Silva. SystemTap: instrumentingthe Linux kernel for analyzing performance and functional problems. IBMRedbook, 2008. → pages 46119[62] S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Geiger:Monitoring the buffer cache in a virtual machine environment. InProceedings of the 12th International Conference on Architectural Supportfor Programming Languages and Operating Systems, ASPLOS XII, pages14–24. ACM, 2006. → pages 50[63] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. Thenature of data center traffic: measurements & analysis. In Proceedings ofthe 9th ACM SIGCOMM Conference on Internet Measurement, IMC ’09,pages 202–208, New York, NY, USA, 2009. ACM. → pages 105[64] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, andD. Lewin. Consistent hashing and random trees: distributed cachingprotocols for relieving hot spots on the world wide web. In Proceedings ofthe 29th Annual ACM Symposium on Theory of Computing, STOC ’97,pages 654–663, New York, NY, USA, 1997. ACM. → pages 24, 105[65] S. R. Kashyap, S. Khuller, Y.-C. J. Wan, and L. Golubchik. Fastreconfiguration of data placement in parallel disks. In Proceedings of theMeeting on Algorithm Engineering and Experiments, ALENEX ’06, pages95–107. SIAM, 2006. → pages 88[66] S. Khuller, Y. A. Kim, and Y.-C. J. Wan. Algorithms for data migrationwith cloning. In Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems, PODS ’03, pages 27 – 36,2003. → pages[67] S. Khuller, Y.-A. Kim, and A. Malekian. Improved approximationalgorithms for data migration. Algorithmica, 63(1-2):347–362, 2012. →pages 88[68] S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning ina chip multiprocessor architecture. In Proceedings of the 13th InternationalConference on Parallel Architectures and Compilation Techniques, pages111–122. IEEE Computer Society, 2004. → pages 47[69] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The clickmodular router. ACM Transactions on Computer Systems, 18(3):263–297,Aug. 2000. → pages 21[70] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels,R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and120B. Zhao. OceanStore: an architecture for global-scale persistent storage.SIGPLAN Notices, 35(11):190–201, 2000. → pages 105[71] E. K. Lee and C. A. Thekkath. Petal: distributed virtual disks. InProceedings of the 7th International Conference on Architectural Supportfor Programming Languages and Operating Systems, ASPLOS VII, pages84–92, 1996. → pages 41[72] A. W. Leung, S. Pasupathy, G. R. Goodson, and E. L. Miller. Measurementand analysis of large-scale network file system workloads. In USENIXAnnual Technical Conference, ATC ’08, pages 213–226, 2008. → pages 68[73] L. Lin, Y. Zhu, J. Yue, Z. Cai, and B. Segee. Hot random off-loading: Ahybrid storage system with dynamic data migration. In Proceedings of the2011 IEEE 19th Annual International Symposium on Modelling, Analysis,and Simulation of Computer Telecommunication Systems, MASCOTS ’11,pages 318–325. IEEE Computer Society, 2011. → pages 105[74] Linux Device Mapper Resource Page. http://sourceware.org/dm/. VisitedJuly 2017. → pages 21[75] Linux Logical Volume Manager (LVM2) Resource Page.http://sourceware.org/lvm2/. Visited July 2017. → pages 21[76] T. Luo, S. Ma, R. Lee, X. Zhang, D. Liu, and L. Zhou. S-cave: Effectivessd caching to improve virtual machine storage performance. In ParallelArchitectures and Compilation Techniques, PACT ’13, pages 103–112,2013. → pages 41[77] A. K. Mackworth. Consistency in networks of relations. ArtificialIntelligence, 8(1):99–118, 1977. → pages 85[78] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluationtechniques for storage hierarchies. IBM Systems journal, 9(2):78–117,1970. → pages 46, 47, 50, 62, 72[79] N. Megiddo and D. S. Modha. ARC: A self-tuning, low overheadreplacement cache. In Proceedings of the 2nd USENIX Conference on Fileand Storage Technologies, FAST ’03, pages 115–130, 2003. → pages 70[80] D. T. Meyer, B. Cully, J. Wires, N. C. Hutchinson, and A. Warfield. Blockmason. In Proceedings of the First Conference on I/O Virtualization,WIOV ’08, 2008. → pages 21121[81] D. T. Meyer, B. Cully, J. Wires, N. C. Hutchinson, and A. Warfield. Blockmason. In First Workshop on I/O Virtualization, WIOV ’08. USENIXAssociation, 2008. → pages 9[82] D. T. Meyer, M. Shamma, J. Wires, Q. Zhang, N. C. Hutchinson, andA. Warfield. Fast and cautious evolution of cloud storage. In Proceedingsof the 2nd USENIX Workshop on Hot Topics in Storage and File Systems,HotStorage ’10. USENIX Association, 2010. → pages 9[83] P. Mochel. The sysfs Filesystem. In Linux Symposium, page 313, 2005. →pages 46[84] D. Mosberger and L. L. Peterson. Making paths explicit in the scoutoperating system. In Proceedings of the 2nd USENIX Symposium onOperating Systems Design and Implementation, OSDI ’96, pages 153–167,1996. → pages 21[85] M. Nanavati, J. Wires, and A. Warfield. Decibel: Isolation and sharing indisaggregated rack-scale storage. In Proceedings of the 14th USENIXSymposium on Networked Systems Design and Implementation, NSDI ’17,pages 17–33. USENIX Association, 2017. → pages 9[86] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practicalpower management for enterprise storage. ACM Transactions on Storage(TOS), 4(3):10, 2008. → pages 62[87] D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, and A. I. T.Rowstron. Everest: Scaling down peak loads through i/o off-loading. InProceedings of the 8th USENIX Conference on Operating Systems Designand Implementation, OSDI ’08, pages 15–28. USENIX Association, 2008.→ pages 99, 105[88] E. B. Nightingale, J. Elson, J. Fan, O. Hofmann, J. Howell, and Y. Suzue.Flat datacenter storage. In Proceedings of the 10th USENIX Conference onOperating Systems Design and Implementation, OSDI ’12, pages 1–15,Berkeley, CA, USA, 2012. USENIX Association. → pages 42, 105[89] Q. Niu, J. Dinan, Q. Lu, and P. Sadayappan. Parda: A fast parallel reusedistance analysis algorithm. In Proceedings of the 2012 IEEE 26thInternational Parallel & Distributed Processing Symposium, IPDPS ’12,pages 1284–1294. IEEE, 2012. → pages 46, 50, 62, 72122[90] D. Ongaro, S. M. Rumble, R. Stutsman, J. K. Ousterhout, andM. Rosenblum. Fast crash recovery in RAMCloud. In Proceedings of the23rd ACM Symposium on Operating Stystems Principles, SOSP ’11, pages29–41. ACM, 2011. → pages 105[91] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich,D. Mazières, S. Mitra, A. Narayanan, D. Ongaro, G. Parulkar,M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman. The casefor ramcloud. Communications of the ACM, 54(7):121–130, July 2011. →pages 42[92] C. Petersen. Introducing Lightning: A flexible NVMe JBOF, March 2016.https://code.facebook.com/posts/989638804458007/introducing-lightning-a-flexible-nvme-jbof/. Visited July 2017. → pages 75[93] M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: Alow-overhead, high-performance, runtime mechanism to partition sharedcaches. In Proceedings of the 39th Annual IEEE/ACM InternationalSymposium on Microarchitecture, pages 423–432. IEEE Computer Society,2006. → pages 47[94] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized objectlocation, and routing for large-scale peer-to-peer systems. In Proceedingsof the IFIP/ACM International Conference on Distributed SystemsPlatforms, Middleware 2001, pages 329–350. Springer, 2001. → pages 105[95] Y. Saito, S. Frølund, A. Veitch, A. Merchant, and S. Spence. Fab: buildingdistributed enterprise disk arrays from commodity components. InProceedings of the 11th International Conference on Architectural Supportfor Programming Languages and Operating Systems, ASPLOS XI, pages48–58, New York, NY, USA, 2004. ACM. → pages 41, 105[96] J. R. Santos, R. R. Muntz, and B. A. Ribeiro-Neto. Comparing randomdata allocation and data striping in multimedia servers. In Proceedings ofthe 2000 ACM SIGMETRICS International Conference on Measurementand Modeling of Computer Systems, SIGMETRICS ’00, pages 44–55.ACM, 2000. → pages 105[97] T. Schiex, H. Fargier, and G. Verfaillie. Valued constraint satisfactionproblems: Hard and easy problems. In Proceedings of the 14thInternational Joint Conference on Artificial Intelligence, IJCAI ’95, pages631–639. Morgan Kaufmann, 1995. → pages 85123[98] SCSI Object-Based Storage Device Commands - 2, 2011.http://www.t10.org/members/w_osd-.htm. Visited July 2017. → pages 25[99] Seagate Kinetic Open Storage Documentation. http://www.seagate.com/ca/en/tech-insights/kinetic-vision-how-seagate-new-developer-tools-meets-the-needs-of-cloud-storage-platforms-master-ti/.Visited July 2017. → pages 17, 25[100] B. Seo and R. Zimmermann. Efficient disk replacement and data migrationalgorithms for large disk subsystems. ACM Transactions on Storage, 1(3):316–345, 2005. → pages 105[101] M. Shamma, D. T. Meyer, J. Wires, M. Ivanova, N. C. Hutchinson, andA. Warfield. Capo: Recapitulating storage for virtual desktops. FAST ’11,pages 31–45. USENIX, 2011. → pages 9, 68, 99[102] X. Shen, Y. Zhong, and C. Ding. Locality phase prediction. In Proceedingsof the 11th International Conference on Architectural Support forProgramming Languages and Operating Systems, ASPLOS XI, pages165–176. ACM, 2004. → pages 50[103] X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation usingtime. In Proceedings of the 34th Annual ACM SIGPLAN-SIGACTSymposium on Principles of Programming Languages, POPL ’07, pages55–61. ACM, 2007. → pages 50[104] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributedfile system. In Proceedings of the 2010 IEEE 26th Symposium on MassStorage Systems and Technologies, MSST ’10, pages 1–10, Washington,DC, USA, 2010. IEEE Computer Society. → pages 105[105] J. Sievert. Iometer: The I/O performance analysis tool for servers, 2004.http://www.iometer.org. Visited July 2017. → pages 71[106] A. J. Smith. Two methods for the efficient analysis of memory addresstrace data. IEEE Transactions on Software Engineering, 3(1):94–101,1977. → pages 50, 73[107] G. Soundararajan, D. Lupei, S. Ghanbari, A. D. Popescu, J. Chen, andC. Amza. Dynamic resource allocation for database servers running onvirtual storage. In Proceedings of the 7th USENIX Conference on File andStorage Technologies, FAST ’09. USENIX, 2009. → pages 50124[108] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan.Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.In Proceedings of the 2001 Conference on Applications, Technologies,Architectures, and Protocols for Computer Communications, SIGCOMM’01, Aug. 2001. → pages 105[109] H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cachememory. IEEE Transactions on Computers, 41(9):1054–1068, 1992. →pages 47, 50, 64[110] J. D. Strunk, E. Thereska, C. Faloutsos, and G. R. Ganger. Using utility toprovision storage systems. In Proceedings of the 6th USENIX Conferenceon File and Storage Technologies, FAST ’08, pages 313–328. USENIX,2008. → pages 77, 106[111] V. Tarasov, S. Kumar, J. Ma, D. Hildebrand, A. Povzner, G. Kuenning, andE. Zadok. Extracting flexible, replayable models from large block traces.In Proceedings of the 10th USENIX Conference on File and StorageTechnologies, FAST ’12, 2012. → pages 73[112] C. A. Thekkath, T. Mann, and E. K. Lee. Frangipani: a scalable distributedfile system. In Proceedings of the 16th ACM Symposium on OperatingSystems Principles, SOSP ’97, pages 224–237, 1997. → pages 41[113] V. Vasudevan, M. Kaminsky, and D. G. Andersen. Using vector interfacesto deliver millions of iops from a networked key-value storage server. InProceedings of the 3rd ACM Symposium on Cloud Computing, SoCC ’12,pages 8:1–8:13, New York, NY, USA, 2012. ACM. → pages 41[114] X. Wang, Y. Li, Y. Luo, X. Hu, J. Brock, C. Ding, and Z. Wang. Optimalfootprint symbiosis in shared cache. In 2015 15th IEEE/ACM InternationalSymposium on Cluster, Cloud, and Grid Computing, CCGRID ’15, pages412–422. IEEE Computer Society, 2015. → pages 99[115] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn.Ceph: A scalable, high-performance distributed file system. In Proceedingsof the 7th Symposium on Operating Systems Design and Implementation,OSDI ’06, pages 307–320. USENIX Association, 2006. → pages 42, 105[116] S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH:Controlled, scalable, decentralized placement of replicated data. InProceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC’06, New York, NY, USA, 2006. ACM. → pages 105125[117] S. A. Weil, A. W. Leung, S. A. Brandt, and C. Maltzahn. RADOS: ascalable, reliable storage service for petabyte-scale storage clusters. InProceedings of the 2nd International Workshop on Petascale Data Storage,PDSW ’07, pages 35–44. ACM Press, 2007. → pages 42, 105[118] A. Whitaker, M. Shaw, and S. D. Gribble. Denali: A scalable isolationkernel. In Proceedings of the 10th ACM SIGOPS European Workshop,2002. → pages 40[119] J. Wilkes, R. A. Golding, C. Staelin, and T. Sullivan. The HP AutoRAIDhierarchical storage system. ACM Transactions on Computer Systems, 14(1):108–136, 1996. → pages 105[120] J. Wires and A. Warfield. Mirador: An active control plane for datacenterstorage. In Proceedings of the 15th USENIX Conference on File andStorage Technologies, FAST ’17, pages 213–228. USENIX Association,2017. → pages iv, 9, 75[121] J. Wires, M. Spear, and A. Warfield. Exposing file system mappings withmapfs. In Proceedings of the 3rd USENIX Workshop on Hot Topics inStorage and File Systems, HotStorage ’11. USENIX Association, 2011. →pages 9[122] J. Wires, S. Ingram, Z. Drudi, N. J. A. Harvey, and A. Warfield.Characterizing storage workloads with counter stacks. In Proceedings ofthe 11th USENIX Conference on Operating Systems Design andImplementation, OSDI ’14, pages 335–349. USENIX Association, 2014.→ pages iv, 9, 44, 81[123] J. Wires, P. Ganesan, and A. Warfield. Sketches of space: Ownershipaccounting for shared storage. In Proceedings of the 8th ACM Symposiumon Cloud Computing, SoCC ’17. ACM, 2017. → pages 9[124] T. M. Wong and J. Wilkes. My cache or yours? making storage moreexclusive. In USENIX Annual Technical Conference, ATC ’02, pages161–175. USENIX, 2002. → pages 97[125] X. Xiang, B. Bao, C. Ding, and Y. Gao. Linear-time modeling of programworking set in shared cache. In Proceedings of the 2011 InternationalConference on Parallel Architectures and Compilation Techniques, PACT’11, pages 350–360. IEEE, 2011. → pages 62, 72126[126] X. Xiang, C. Ding, H. Luo, and B. Bao. HOTL: a higher order theory oflocality. In Proceedings of the 18th International Conference onArchitectural Support for Programming Languages and Operating Systems,ASPLOS ’13, pages 343–356. ACM, 2013. → pages 47, 73[127] J. Yang, D. B. Minturn, and F. Hady. When poll is better than interrupt. InProceedings of the 10th USENIX Conference on File and StorageTechnologies, FAST ’12, pages 3–10, Berkeley, CA, USA, 2012. USENIXAssociation. → pages 41[128] T. Yang, E. D. Berger, S. F. Kaplan, and J. E. B. Moss. CRAMM: Virtualmemory support for garbage-collected applications. In Proceedings of the7th USENIX Symposium on Operating Systems Design andImplementation, OSDI ’06, pages 103–116. ACM, 2006. → pages 50[129] W. Zhao, X. Jin, Z. Wang, X. Wang, Y. Luo, and X. Li. Low cost workingset size tracking. In Annual Technical Conference, ATC ’11, pages223–228. USENIX, 2011. → pages 50[130] W. Zheng and G. Zhang. Fastscale: Accelerate RAID scaling byminimizing data migration. In Proceedings of the 9th USENIX Conferenceon File and Storage Technologies, FAST ’11, pages 149–161. USENIX,2011. → pages 105[131] Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping andstructure splitting using whole-program reference affinity. In Proceedingsof the ACM SIGPLAN 2004 Conference on Programming Language Designand Implementation, PLDI ’04, pages 255–266. ACM, 2004. → pages 50[132] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar.Dynamic tracking of page miss ratio curve for memory management. InProceedings of the 11th International Conference on Architectural Supportfor Programming Languages and Operating Systems, ASPLOS XI, pages177–188. ACM, 2004. → pages 47, 50[133] Y. Zhou, J. Philbin, and K. Li. The multi-queue replacement algorithm forsecond level buffer caches. In USENIX Annual Technical Conference, ATC’01, pages 91–104, 2001. → pages 48127


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items