UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Extending systems with virtual hardware aggregation Cully, Brendan 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_may_cully_brendan.pdf [ 2.08MB ]
JSON: 24-1.0342739.json
JSON-LD: 24-1.0342739-ld.json
RDF/XML (Pretty): 24-1.0342739-rdf.xml
RDF/JSON: 24-1.0342739-rdf.json
Turtle: 24-1.0342739-turtle.txt
N-Triples: 24-1.0342739-rdf-ntriples.txt
Original Record: 24-1.0342739-source.json
Full Text

Full Text

Extending Systems with Virtual Hardware AggregationbyBrendan CullyM. Sc., Computer Science, The University of British Columbia, 2007B. A., Computer Science, New York University, 2001A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)February 2017c© Brendan Cully, 2017AbstractHardware has physical limitations. It has fixed performance limits, and may fail.Applications suffer from the limitations of the physical hardware on which theyrun. Making applications able to take advantage of multiple hardware instancesto avoid these limitations is complex. Since this effort must be expended for ev-ery application, it is impractical for most of them. In this thesis, we show thatwe can aggregate multiple physical machines at the virtual machine interface, al-lowing them to transcend the limitations of single machines without changing theapplications themselves.iiPrefaceChapters 2, 3, and 4 of this dissertation are based on publications at peer-reviewedacademic conferences. They have been edited for formatting and expanded to clar-ify some points that could not be made to fit within the page count of the conferencesubmissions. Some new material has been added to cover technology changes thathave taken place since these papers were first published.Chapter 2A version of Chapter 2 has been published in [26]. I was the lead investigator re-sponsible for all aspects of the research. I presented this work at NSDI in 2008. Mycoauthors assisted in evaluation and contributed to system design and manuscriptcomposition.Chapter 3A version of Chapter 3 has been published in [73]. I was not the principal authorbut was heavily involved in both the design of the project and writing the paper.Chapter 4A version of Chapter 4 has been published in [28]. This was a joint work withseveral authors. I was the lead author and presented this work at FAST in 2014.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Remus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 SecondSite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Remus: High Availability via Asynchronous Virtual Machine Repli-cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Design and Implementation . . . . . . . . . . . . . . . . . . . . . 132.2.1 Failure Model . . . . . . . . . . . . . . . . . . . . . . . . 15iv2.2.2 Pipelined Checkpoints . . . . . . . . . . . . . . . . . . . 162.2.3 Checkpoint Consistency . . . . . . . . . . . . . . . . . . 172.2.4 Memory and CPU . . . . . . . . . . . . . . . . . . . . . 182.2.5 Network Buffering . . . . . . . . . . . . . . . . . . . . . 212.2.6 Disk Buffering . . . . . . . . . . . . . . . . . . . . . . . 222.2.7 Detecting Failure . . . . . . . . . . . . . . . . . . . . . . 252.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . 272.3.2 Correctness Verification . . . . . . . . . . . . . . . . . . 272.3.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.4 Potential Optimizations . . . . . . . . . . . . . . . . . . . 322.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 SecondSite: Disaster Tolerance as a Service . . . . . . . . . . . . . . 423.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 Challenges of Wide-Area Disaster Tolerance . . . . . . . . . . . . 453.2.1 Challenges of Wide-Area Networks . . . . . . . . . . . . 473.2.2 Bandwidth-Efficient Replication . . . . . . . . . . . . . . 473.2.3 Failure Detection . . . . . . . . . . . . . . . . . . . . . . 483.2.4 Failure Recovery . . . . . . . . . . . . . . . . . . . . . . 493.3 SecondSite Design . . . . . . . . . . . . . . . . . . . . . . . . . 493.3.1 Reducing Replication Overhead . . . . . . . . . . . . . . 503.3.2 Failure Detection in SecondSite . . . . . . . . . . . . . . 513.3.3 Failure Recovery . . . . . . . . . . . . . . . . . . . . . . 543.3.4 Seamless Failback . . . . . . . . . . . . . . . . . . . . . 593.4 Evaluation on a WAN Testbed . . . . . . . . . . . . . . . . . . . 603.4.1 Regression Tests . . . . . . . . . . . . . . . . . . . . . . 613.4.2 Protecting a Cluster of VMs . . . . . . . . . . . . . . . . 623.4.3 DRBD Resynchronization Delays . . . . . . . . . . . . . 653.4.4 Resource Consumption & Throughput . . . . . . . . . . . 673.4.5 Throughput versus Replication Latency . . . . . . . . . . 69v3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.5.1 Replication Overhead . . . . . . . . . . . . . . . . . . . . 703.5.2 Minimizing Downtime . . . . . . . . . . . . . . . . . . . 713.5.3 IP Migration across WAN . . . . . . . . . . . . . . . . . 713.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724 Strata: Scalable High-Performance Storage on Virtualized Non-volatileMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.3 Data Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.3.1 The Virtual Address Map . . . . . . . . . . . . . . . . . . 814.3.2 Dispatch . . . . . . . . . . . . . . . . . . . . . . . . . . 824.3.3 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . 834.4 Network Attached Disks . . . . . . . . . . . . . . . . . . . . . . 844.5 Online Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . 854.5.1 Object Reconfiguration . . . . . . . . . . . . . . . . . . . 864.5.2 System Reconfiguration . . . . . . . . . . . . . . . . . . 884.6 Storage Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . 914.6.1 Scalable NFS . . . . . . . . . . . . . . . . . . . . . . . . 914.6.2 SDN Protocol Scaling . . . . . . . . . . . . . . . . . . . 924.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.7.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . 934.7.2 Baseline Performance . . . . . . . . . . . . . . . . . . . 944.7.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 944.7.4 Node Failure . . . . . . . . . . . . . . . . . . . . . . . . 984.7.5 Protocol Overhead . . . . . . . . . . . . . . . . . . . . . 994.7.6 Effect of CPU on Performance . . . . . . . . . . . . . . . 994.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104viBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107viiList of TablesTable 3.1 Differences Between Local and Wide-Area Networks . . . . . 44Table 3.2 Failure Scenarios and Downtimes using Quorum Servers . . . 55Table 3.3 The Regression Test Suite . . . . . . . . . . . . . . . . . . . . 61Table 3.4 Resource Consumption in a Multi-VM Environment . . . . . . 66Table 4.1 Random IO Performance on Strata versus KNFS . . . . . . . . 94Table 4.2 Achieved IOPS on an 80/20 Random 4K Workload across 2MicroArrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 99viiiList of FiguresFigure 2.1 Speculative Execution and Asynchronous Replication in Remus 12Figure 2.2 Remus: High-Level Architecture . . . . . . . . . . . . . . . . 14Figure 2.3 Network Buffering in Remus . . . . . . . . . . . . . . . . . . 23Figure 2.4 Disk Write Buffering in Remus . . . . . . . . . . . . . . . . 25Figure 2.5 Checkpoint Time Relative to Pages Dirtied . . . . . . . . . . 28Figure 2.6 Kernel Build Time by Checkpoint Frequency . . . . . . . . . 30Figure 2.7 SPECweb Scores by Checkpoint Frequency (Native Score: 305) 31Figure 2.8 The Effect of Network Delay on SPECweb Performance . . . 32Figure 2.9 The Effect of Disk Replication on Postmark Performance . . . 33Figure 2.10 Comparison of Bandwidth Requirements Using Various Com-pression Schemes . . . . . . . . . . . . . . . . . . . . . . . . 34Figure 3.1 State Transitions of a Quorum Node . . . . . . . . . . . . . . 52Figure 3.2 SecondSite Setup over WAN . . . . . . . . . . . . . . . . . . 58Figure 3.3 SecondSite Failover and Failback of a Set of HeterogeneousWorkloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Figure 3.4 OLTP Disk Resynchronization Costs . . . . . . . . . . . . . . 65Figure 3.5 Replication Bandwidth Consumption before Failover (4 VMsx 100 Clients/VM) . . . . . . . . . . . . . . . . . . . . . . . 66Figure 3.6 Cost of HA as a Function of Application Load(OLTP Work-load with 100 Concurrent Users) . . . . . . . . . . . . . . . . 68Figure 3.7 Impact of Replication Link Latency on Application Through-put (SPECweb with 100 Clients) . . . . . . . . . . . . . . . . 69ixFigure 4.1 Strata Network Storage Architecture . . . . . . . . . . . . . . 74Figure 4.2 Hardware View of a Strata Deployment . . . . . . . . . . . . 78Figure 4.3 Virtual Object to Physical Object Range Mapping . . . . . . . 81Figure 4.4 IOPS over Time, Read-Only Workload . . . . . . . . . . . . 95Figure 4.5 IOPS Over Time, 80/20 Read/Write Workload . . . . . . . . . 96Figure 4.6 IOPS Over Time, Read-Only Workload with Random Placement 97Figure 4.7 Aggregate Bandwidth for 80/20 Clients during Failover andRecovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98xAcknowledgementsI am deeply indebted to my supervisor, Andrew Warfield, for his wise counsel, hishumour, his boundless enthusiasm, and for being so generous with his time andenergy. He has been an inspiration and a friend. My thanks also to my committeemembers, Norm Hutchinson and Bill Aiello, for all of their wisdom, patience, andguidance over the years. I would also like to thank the other members of my exam-ining committee, HakimWeatherspoon, Sasha Fedorova, and Sathish Gopalakrish-nan, for their time and insight.I am lucky to have had wonderful lab mates over the years, especially GeoffreyLefebvre and Dutch Meyer. They are great hackers, critical thinkers, and excel-lent companions in the lab, at conferences, and during the late nights before paperdeadlines. I look forward to crossing paths again with them again in the future.xiDedicationTo Lisa, Eloise, and Harriet, for all your love and support for all these years. Icould never have done it without you.xiiChapter 1IntroductionAs individual computers increased in power, it became more difficult for singleapplications to saturate them. Virtualization has been effective at addressing thisproblem by multiplexing as many virtual hosts as needed to get good utilizationonto a single physical host. But this approach also brings new problems. For ex-ample, if the physical host suffers a hardware failure, that will affect many virtualmachines. On top of that, application load is often very dynamic, but physical re-sources are static, making it difficult to achieve high utilization without also riskingresource starvation.Fortunately, virtualization also contains the seeds of a solution to the problems thatit creates. First, virtualization makes it straightforward to capture the entire state ofa running virtual machine. Second, because the virtual hardware interface abstractsphysical hardware, it makes it possible to move that state from one physical host toanother. In short, the virtual machine is decoupled from the host on which it runs.This means that multiple hosts can be assembled into clusters, and the virtualizationplatform can arrange to run virtual machines on different hosts as load shifts or inorder to perform maintenance on individual servers. It is even possible to movevirtual machines around while they are running, with the machines themselvesoblivious to the physical hosts on which they run. The mechanism for doing this iscalled live migration.1Using live migration, a single virtual machine can use the resources of more thanone physical machine. This allows the virtual machine to escape from some ofthe limitations of the physical hardware on which it runs. But this is only a par-tial escape. For instance, while live migration can be used to perform hardwaremaintenance without interrupting service to a virtual machine, unplanned hard-ware failures will still result in the failure of the virtual machines that depend onthem. This thesis extends and generalizes the idea, taken from live migration, thatvirtualization can allow virtual hardware to exist on more than one physical de-vice. It asserts that by aggregating multiple computers behind the virtual machineinterface, virtual machines can transcend the physical limitations of execution on asingle computer.It takes the form of three case studies. In the first, I use virtual machine snapshotsand state replication to maintain a single virtual machine on two hosts, allowing itto tolerate the physical failure of either. This is fundamentally different from livemigration because unplanned failures of physical hardware have no effect on thevirtual machine— the hardware itself has been made redundant at the virtualizationlayer. In the second, I extend that mechanism to insulate applications from thefailure of an entire data center, incorporating a wide-area distributed system intothe virtual hardware in the process. And in the final study, I transfer this approachfrom host virtualization to storage devices to make more efficient use of newlypowerful storage technology, providing fault tolerance and live data migration asintrinsic services to virtual storage devices.The following three chapters are edited versions of the conference papers we pub-lished for the three projects I have just outlined. Each chapter describes an entiresystem, not just the parts of the project that are relevant to this thesis. With thatin mind, I have provided below an overview of the way each project fits into theapproach of this thesis, in more detail than the brief project descriptions above.21.1 RemusNSDI 2008 (Best Paper)Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson,and Andrew WarfieldHardware fails, and when it does, it takes whatever was running on it out of ser-vice. Depending on the application, the costs can be catastrophically high. Creditcard processing outages have costs measured in millions per minute [2010, 6.2 mil-lion/minute for Visa]. The costs of failures to other systems, like air traffic controlor anesthesia monitoring, are measured in lives. For this reason, these systems aremade highly available. They use specialized, rigorously tested code to ensure thatwhen hardware fails, execution can resume on a redundant system with as littleinterruption as possible. Often, this requires custom hardware to replicate state,detect faults, and reroute through backup components.Custom hardware is inherently expensive, and so is the code required to providehigh availability in software. For applications like those above, the costs of imple-menting high availability are lower than the costs of failure. But because of theenormous fixed costs of high availability, it is hard to justify when the effects offailure are not quite as widespread or severe. Often, the owners of a system simplycannot afford it.Over time, techniques have been developed to somewhat reduce the fixed costsof high availability. In 1995, Bressoud and Schneider observed that virtualizationallowed the mechanisms employed for hardware fault tolerance to be applied insoftware. In particular, they used virtualization to simulate instruction-level replayon a second processor, carefully capturing all external events and injecting themwith enough precision to ensure that execution was deterministic. This propertycan be difficult to provide depending on the hardware architecture, but is crucialto the success of the system. For example, consider the Intel RDTSC instruction,which returns the number of cycles since the processor was last reset. This hasunpredictable output that must be explicitly captured and replayed to ensure thatthe state of the replayed machine does not diverge from the original, and it is an3unprivileged instruction that would normally not cause a trap to the hypervisor.Instructions like these mean that simply virtualizing the processor is not enough toprovide replay — the instruction set architecture must be thoroughly audited forsources of non-determinism, and custom code installed for each occurrence. If anycase is mishandled, the result is silently fatal divergence on the replicated machine.So while this approach is cheaper and more general than custom hardware, it isessentially the same mechanism and has similar drawbacks. And crucially, thisapproach is very expensive for SMP (symmetric multiprocessing) systems, wheresynchronizing memory accesses between cores has so far proven to be too slowto be practical. While SMP support was less of a problem in 1995, it is now arequirement for most systems.Stepping back a bit, there are two essential requirements for a fault tolerant sys-tem: replicated state (the contents of memory, processor registers, and disk), andsynchronization between systems so that the replica can resume execution fromthe point that failure is detected in the primary system. The approach describedabove, synchronized instruction replay, is one mechanism to provide these twoproperties. It relies on the processor to recreate state given only system inputs, andconsequently must take care to work around any non-deterministic instructions.The advantage of this approach is that only the inputs need to be captured, but thedisadvantage is that it is difficult to provide completely deterministic execution,particularly on SMP systems.In Remus, we consider what might happen if we attempt to copy the whole systemstate directly, instead of recreating it through deterministic execution of recordedinput. At first glance this seems highly impractical — since memory is much fasterthan the network, synchronous replication would slow the system to a crawl. Re-mus solves this problem with speculative execution. It allows the primary to run atfull speed most of the time, but takes periodic consistent snapshots of local state.It replicates these snapshots asynchronously while the primary resumes execution.Execution is synchronized from the point of view of external users by buffering theoutput of the primary until the state on which it depends has completely replicatedon the backup. This approach allows Remus to completely sidestep the problem of4deterministic execution, which in turn greatly simplifies its implementation, makesit much more portable, and allows it to easily support SMP systems.The approach used in Remus is new compared to hardware fault tolerance, andeven to previous virtualization approaches. Instead of treating the virtual machineas simply a more malleable version of the underlying physical hardware, it usesvirtualization to abstract away the underlying hardware as much as it can. Ab-stracting the hardware makes it possible to manipulate its state in powerful ways,and the following projects explore this further.1.2 SecondSiteVEE 2012Shriram Rajagopalan, Brendan Cully, Ryan O’Connor, and Andrew WarfieldRemus was built from a profoundly simple abstraction of machine execution: con-sistent snapshots and asynchronous replication. That abstraction demonstrated it-self to be powerful enough to provide high availability, but I wondered whether wecould do more with it. A natural extension of the problem domain was to attempt toprovide not just high availability but disaster tolerance: that is, to insulate a systemagainst not just the failure of a single server, but the loss of the entire data centerthat hosts it.While the problems seem similar, disaster tolerance presents several new chal-lenges. For a start, it has to contend with much greater latency and lower bandwidthbetween primary and secondary systems. Latency alone has made it extremely dif-ficult to supply the level of protection that Remus provides at the server level.Disaster recovery systems typically measure themselves in terms of RPO (recov-ery point objective) and RTO (recovery time objective), where RPO defines howmuch data may be lost in the event of failure, and RTO defines how long a systemmay be out of service. Some amount of data loss and service outage is generally agiven.5In Remus, replication is asynchronous, so latency on the replication link is muchless of an issue. But while this opens the door to disaster tolerance, there areseveral other challenges that must be dealt with. Limited bandwidth is an obviousproblem. Whole-system snapshots are large, and while the frequency at which theyare taken can be adjusted to amortize the cost of transmitting them, the checkpointfrequency directly affects client latency. Failure detection is more subtle, since thereplication link is more likely to be an unreliable internet route. Finally, even ifwe can replicate state fast enough and detect failures accurately, actually reroutingtraffic to the backup is much more complicated over the wide area.SecondSite extended Remus to address each of these problems. To deal with lim-ited bandwidth, we added checkpoint compression. This was straightforward inprinciple, if not in engineering time. For failure detection, we could no longerrely on the replication link to signal failure, so we required an external tiebreaker.There are several different ways to get this; the approach we chose, for reasons dis-cussed in detail in the conference paper, was to create a quorum of witness nodesin the cloud. The most difficult problem we had to deal with was how to reroutetraffic on failure. This was a trivial problem in Remus, where a gratuitous ARP(address resolution protocol) advertisement sufficed to redirect traffic from the up-stream switch. In the wide area, routing is a distributed system coordinated throughBGP (Border Gateway Protocol). This makes it difficult to redirect traffic quicklyenough to hide failures from clients. Only after much experimentation (and morethan one call from our upstream ISP after we broke their network) did we find amechanism that was reliable and fast enough for our needs.SecondSite takes the idea of this thesis, that virtual hardware can span physicaldevices, much further than Remus. In the most literal sense, it allows a virtualmachine to span hundreds of kilometers across sites. But to do that, it also extendsthe virtual machine in ways that are impossible for real hardware, coordinating notjust state replication but also internet-based quorum and internet routing changeswithin the space between the virtual and real machines.61.3 StrataFAST 2014Brendan Cully, Jake Wires, Dutch Meyer, Kevin Jamieson, Keir Fraser, Tim Dee-gan, Daniel Stodden, Geoffrey Lefebvre, Daniel Ferstay, and Andrew WarfieldAfter decades of very slow improvement, storage technology has recently becomedramatically faster. Individual non-volatile memory devices based on flash orphase change memory can provide one hundred times the throughput of magnetichard drives, at one thousandth the latency per request. As a result, storage deviceshave found themselves in a similar position to that which previously motivated therise of CPU (central processing unit) virtualization: they have become too powerfulto be saturated by a single application, and too expensive to leave idle.Observing this, Strata approaches high performance storage as a virtualizationproblem. As with CPU virtualization, the first priority is to make full use of the de-vices by multiplexing many workloads onto them. And again this implies that theworkloads should be decoupled from their physical devices to respond to dynamicload changes and to protect them from device failures.But before we can do any of that, we need a virtualization layer for storage devices.Because storage has only recently become powerful enough to take advantage ofvirtualization, there is not a preexisting mature virtualization layer on which tobuild it. Therefore Strata implements its own software virtualization layer.Building our own virtualization layer gave us the opportunity to design its interface,rather than conforming to an existing standard. The basic interface is dictated bythe standard principle of virtualization that it should resemble an idealized, simpli-fied version of a hardware interface. In Strata, the fundamental interface is virtualdevices of effectively infinite size with their own sparse virtual address space. But,we also considered location independence to be a first class property of the Stratavirtualization layer, and so the IO (input/output) interface can address any objecton any device regardless of where the request is initiated.7As was the case in the prior two projects, decoupling made it possible to span vir-tual devices across physical devices in several ways. Although device aggregationtechniques like RAID (redundant array of inexpensive disks) have been featuresof storage systems for a long time, they are inflexible. They can tolerate devicefailures, and provide improved throughput, but, for example, they do not adaptto dynamic load changes. In contrast, Strata not only provides fault tolerance,but also virtual device migration, as a natural consequence of its architecture. Itmakes aggressive use of this both to dynamically adapt to both changing work-loads and changing hardware — it can incorporate new devices to increase totalsystem throughput in the same way that it can tolerate the loss of old ones, withoutinterrupting service to the virtual devices. Storage device aggregation through vir-tualization, as implemented in Strata, suggests that the properties that we exploredin the prior two projects are general properties of virtualization, not just artifacts ofthe virtual machine environment in which we developed the prior two projects.8Chapter 2Remus: High Availability viaAsynchronous Virtual MachineReplication2.1 IntroductionHighly available systems are the purview of the very rich and the very scared.However, the desire for reliability is pervasive, even among system designers withmodest resources.Unfortunately, high availability is hard — it requires that systems be constructedwith redundant components that are capable of maintaining and switching to back-ups in the face of failure. Commercial high availability systems that aim to protectmodern servers generally use specialized hardware, customized software, or both(e.g [45]). In each case, the ability to transparently survive failure is complex andexpensive enough to prohibit deployment on common servers.This chapter describes Remus, a software system that provides OS (operating system)-and application-agnostic high availability on commodity hardware. Our approach9capitalizes on the ability of virtualization to migrate running VMs (virtual ma-chines) between physical hosts [22], and extends the technique to replicate snap-shots of an entire running OS instance at very high frequencies — as often as every25ms — between a pair of physical machines. Using this technique, our systemdiscretizes the execution of a VM into a series of replicated snapshots. Externaloutput, specifically transmitted network packets, is not released until the systemstate that produced it has been replicated.Virtualization makes it possible to create a copy of a running machine, but it doesnot guarantee that the process will be efficient. Propagating state synchronouslyat every change is impractical: it effectively reduces the throughput of memory tothat of the network device performing replication. Rather than running two hosts inlock-step [12] we allow a single host to execute speculatively and then checkpointand replicate its state asynchronously. System state is not made externally visibleuntil the checkpoint is committed—we achieve high-speed replicated performanceby effectively running the system tens of milliseconds in the past.The contribution of this chapter is a practical one. Whole-system replication is awell-known approach to providing high availability. However, it usually has beenconsidered to be significantly more expensive than application-specific checkpoint-ing techniques that only replicate relevant data [56]. Our approach may be used tobring HA (high availability) “to the masses” as a platform service for virtual ma-chines. In spite of the hardware and software constraints under which it operates,this system provides protection equal to or better than expensive commercial of-ferings. Many existing systems only actively mirror persistent storage, requiringapplications to perform recovery from crash-consistent persistent state. In con-trast, Remus ensures that regardless of the moment at which the primary fails, noexternally visible state is ever lost.2.1.1 GoalsRemus aims to make mission-critical availability accessible to mid- and low-endsystems. By simplifying provisioning and allowing multiple servers to be consoli-10dated on a smaller number of physical hosts, virtualization has made these systemsmore popular than ever. However, the benefits of consolidation come with a hiddencost in the form of increased exposure to hardware failure. Remus addresses thisby commodifying high availability as a service offered by the virtualization plat-form itself, providing administrators of individual VMs with a tool to mitigate therisks associated with virtualization.Remus’s design is based on the following high-level goals:Generality. It can be prohibitively expensive to customize a single application tosupport high availability, let alone the diverse range of software upon which anorganization may rely. To address this issue, high availability should be providedas a low-level service, with common mechanisms that apply regardless of the ap-plication being protected or the hardware on which it runs.Transparency. The reality in many environments is that OS and application sourcemay not even be available to modify. To support the broadest possible range ofapplications with the smallest possible barrier to entry, high availability should notrequire that OS or application code be modified to support facilities such as failuredetection or state recovery.Seamless failure recovery. No externally visible state should ever be lost in thecase of single-host failure. Furthermore, failure recovery should proceed rapidlyenough that it appears as nothing more than temporary packet loss from the per-spective of external users. Established TCP (transmission control protocol) con-nections should not be lost or reset.These are lofty goals, entailing a degree of protection well beyond that providedby common HA systems, which are based on asynchronous storage mirroring fol-lowed by application-specific recovery code. Moreover, the desire to implementthis level of availability without modifying the code within a VM necessitates avery coarse-grained approach to the problem. A final and pervasive goal of the sys-tem is that it realize these goals while providing deployable levels of performanceeven in the face of SMP hardware that is common on today’s server hardware.11Primary HostBackup HostCompleted ExecutionClient’s ViewSpeculative Execution2341Committed State2341 CheckpointTransmitSyncReleaseState BufferFigure 2.1: Speculative Execution and Asynchronous Replication in Remus2.1.2 ApproachRemus runs paired servers in an active-passive configuration. We employ threemajor techniques in order to overcome the difficulties traditionally associated withthis approach. First, we base our system on a virtualized infrastructure to facilitatewhole-system replication. Second, we increase system performance through spec-ulative execution, which decouples external output from synchronization points.The third technique is asynchronous replication, which allows the primary serverto remain productive while the state of the machine is being transmitted to thebackup. The basic stages of operation in Remus are given in Figure 2.1.VM-based whole-system replication. Hypervisors have been used to build HAsystems before [12]. In that work, virtualization is used to run a pair of systems inlock-step, and additional support has been added to ensure that VMs on a pair ofphysical hosts follow a deterministic path of execution: external events are care-fully injected into both the primary and fallback VMs so that they result in iden-tical states. Enforcing such deterministic execution suffers from two fundamentalproblems. First, it is highly architecture-specific, requiring that the system have acomprehensive understanding of the instruction set being executed and the sourcesof external events. Second, it results in an unacceptable overhead when applied inmulti-processor systems, where shared-memory communication between proces-sors must be accurately tracked and propagated [30].12Speculative execution. Replication may be achieved either by copying the stateof a system or by replaying input deterministically. We believe the latter to beimpractical for real-time operation, especially in a multi-processor environment.Therefore, Remus does not attempt to make computation deterministic — there isa very real possibility that the output produced by a system after a given check-point will be different if the system is rolled back to that checkpoint and its inputis replayed. However, the state of the replica needs to be synchronized with theprimary only when the output of the primary has become externally visible. In-stead of letting the normal output stream dictate when synchronization must occur,we can buffer output1 until a more convenient time, performing computation spec-ulatively ahead of synchronization points. This allows a favorable trade-off to bemade between output latency and runtime overhead, the degree of which may becontrolled by the administrator.Asynchronous replication. Buffering output at the primary server allows repli-cation to be performed asynchronously. The primary host can resume executionat the moment its machine state has been captured, without waiting for acknowl-edgment from the remote end. Overlapping normal execution with the replicationprocess yields substantial performance benefits. This enables efficient operationeven when checkpointing at intervals on the order of tens of milliseconds.2.2 Design and ImplementationFigure 2.2 shows a high-level view of our system. We begin by encapsulatingthe machine to be protected within a VM. Our implementation is based on theXen virtual machine monitor [10], and extends Xen’s support for live migration toprovide fine-grained checkpoints. An initial subset of our checkpointing supporthas been accepted into the upstream Xen source.Remus achieves high availability by propagating frequent checkpoints of an ac-tive VM to a backup physical host. On the backup, the VM image is resident in1Remus buffers network and disk output. Other devices, such as the console or the serial port, arepresumed to be used for local administration and therefore would not require buffering. However,nothing prevents these devices from being buffered as well.13VMMProtected VMActive Host(Other)Active HostsReplicationEngineMemoryExternalDevicesexternalnetworkVMMBackup VMBackup HostReplicationServerMemoryStorageHeartbeat HeartbeatVMMProtected VM ReplicationEngineMemoryExternalDevicesHeartbeatVMMProtected VM ReplicationEngineMemoryExternalDevicesHeartbeatFigure 2.2: Remus: High-Level Architecturememory and may begin execution immediately if failure of the active system isdetected. Because the backup is only periodically consistent with the primary, allnetwork output must be buffered until state is synchronized on the backup. When acomplete, consistent image of the host has been received, this buffer is released toexternal clients. The checkpoint, buffer, and release cycle happens very frequently– we include benchmark results at frequencies up to forty times per second, repre-senting a whole-machine checkpoint including network and on-disk state every 25milliseconds.Unlike transmitted network traffic, disk state is not externally visible. It must,however, be propagated to the remote host as part of a complete and consistentsnapshot. Tomaintain disk replication, all writes to the primary disk are transmittedasynchronously to the backup, where they are buffered in RAM (random accessmemory) until the corresponding memory checkpoint has arrived. At that point, thecomplete checkpoint is acknowledged to the primary, which then releases outboundnetwork traffic, and the buffered disk writes are applied to the backup disk.It is worth emphasizing that the virtual machine does not actually execute on thebackup host until a failure occurs. It simply acts as a receptacle for checkpointsof the active VM. This consumes a relatively small amount of the backup host’sresources, allowing it to concurrently protect VMs running on multiple physicalhosts in an N-to-1-style configuration. Such a configuration gives administrators ahigh degree of freedom to balance the degree of redundancy against resource costs.142.2.1 Failure ModelSystem failure can manifest itself in a variety of ways:1. Fail-stop. A fail-stop failure is one in which the system is no longer runningafter the failure occurs, and this is detectable by other systems.2. Crash. In this case, other systems cannot tell whether a system has failed orsimply become unreachable. Activating a backup in this scenario may resultin a split-brain condition in which both systems are running, unaware of theother.3. Byzantine failure. A byzantine failure has no constraints on its behaviour. Itmay respond with incorrect output or even silently corrupt state.We do not aim to recover from byzantine failures or non-fail-stop conditions. Asobserved in [19], this class of approach provides complete system state captureand replication, and as such will propagate application errors to the backup. Thisis a necessary consequence of providing both transparency and generality. We alsodo not attempt to take on crash failures or network partition in this work. Theseare very difficult to address in a two-node system without compromising eitherconsistency or availability. Instead, we rely on redundant network cards, links,and switching if the primary and backup are not directly connected. Liveness issignalled over this link by the Remus control plane, which must be assumed to beworking in order to provide system protection. In environments in which we cannotassume simple hardware redundancy is sufficient to distinguish between crashesand network partitions, we would need to introduce some form of arbitration tochoose a live node and fence off a failed one.Remus provides the following properties:1. The fail-stop failure of any single host is tolerable.2. Should both the primary and backup hosts fail concurrently, the protectedsystem’s data will be left in a crash-consistent state.153. No output will be made externally visible until the associated system statehas been committed to the replica.Our goal is to provide completely transparent recovery from fail-stop failures of asingle physical host. The compelling aspect of this system is that high availabilitymay be easily retrofitted onto existing software running on commodity hardware.It uses a pair of commodity host machines, connected over redundant gigabit Eth-ernet connections, and survives the failure of any one of these components. Byincorporating block devices into its state replication protocol, it avoids requiringexpensive, shared network-attached storage for disk images.Our failure model is identical to that of commercial HA products, which provideprotection for virtual machines today [85, 90]. However, the degree of protectionoffered by these products is substantially less than that provided by Remus: ex-isting commercial products respond to the failure of a physical host by simply re-booting the VM on another host from its crash-consistent disk state. Our approachsurvives failure on time frames similar to those of live migration, and leaves theVM running and network connections intact. Exposed state is not lost and disksare not corrupted.2.2.2 Pipelined CheckpointsCheckpointing a running virtual machine many times per second places extremedemands on the host system. Remus addresses this by aggressively pipelining thecheckpoint operation. We use an epoch-based system in which execution of theactive VM is bounded by brief pauses in execution in which changed state is atom-ically captured, and external output is released when that state has been propagatedto the backup. Referring back to Figure 2.1, this procedure can be divided intofour stages: (1) Once per epoch, pause the running VM and copy any changedstate into a buffer. This process is effectively the stop-and-copy stage of live mi-gration [22], but as described later in this section it has been dramatically optimizedfor high-frequency checkpoints. With state changes preserved in a buffer, the VMis unpaused and speculative execution resumes. (2) Buffered state is transmitted16and stored in memory on the backup host. (3) Once the complete set of state hasbeen received, the checkpoint is acknowledged to the primary. Finally, (4) bufferednetwork output is released.The result of this approach is that execution is effectively discretized at checkpointboundaries; the acknowledgment of a completed checkpoint by the backup trig-gers the release of network traffic that has been buffered and represents an atomictransition into the new epoch.2.2.3 Checkpoint ConsistencyVirtual machine checkpoints must be coordinated across a few different subsys-tems, each of which processes events asynchronously. Given this, it’s worth un-derstanding what kind of consistency Remus provides, and how it provides it. Theproperty that we require is that the state of the machine, as captured in the check-point, is the same when the machine resumes execution, no matter which physicalhost it resumes execution on. This ensures that execution has not diverged priorto the point where failover may occur. This property can be provided by ensuringthat the state in the checkpoint is not modified while the checkpoint is being taken.In Remus, we achieve this though a checkpoint hypercall in the virtual machinemonitor. The hypercall transfers execution atomically from the virtual machine tothe monitor, and at the start of the checkpoint hypercall, external events are paused,preventing network or disk events from modifying the virtual machine state. But,as described in the subsequent section, checkpointing is paravirtualized in Remusand requires some cooperation with the guest. So, while external events are paused,the virtual machine monitor delivers a request to the guest to suspend itself. Whenthe guest signals that it has suspended itself, then no internal or external events canmodify the virtual machine state until the hypervisor resumes guest execution. Aslong as the state of the machine is read during this period, the checkpoint will beconsistent.172.2.4 Memory and CPUCheckpointing is implemented above Xen’s existing machinery for performing livemigration [22]. Live migration is a technique by which a virtual machine is relo-cated to another physical host with only slight interruption in service. To accom-plish this, memory is copied to the new location while the VM continues to run atthe old location. During migration, writes to memory are intercepted, and dirtiedpages are copied to the new location in rounds. After a specified number of in-tervals, or when no forward progress is being made because the virtual machine iswriting to memory at least as fast as the migration process can copy it out, the guestis suspended and the remaining dirty memory is copied out along with the currentCPU state. At this point the image on the new location is activated. Total down-time depends on the amount of memory remaining to be copied when the guest issuspended, but is typically under 100ms. Total migration time is a function of theamount of memory in use by the guest, and its writable working set [22], which isthe set of pages changed repeatedly during guest execution.Xen provides the ability to track guest writes to memory using a mechanism calledshadow page tables. When this mode of operation is enabled, the VMM (virtualmachine monitor) maintains a private (“shadow”) version of the guest’s page tablesand exposes these to the hardware MMU. Page protection is used to trap guest ac-cess to its internal version of page tables, allowing the hypervisor to track updates,which are propagated to the shadow versions as appropriate.For live migration, this technique is extended to transparently (to the guest) markall VM memory as read only. The hypervisor is then able to trap all writes that aVMmakes to memory and maintain a map of pages that have been dirtied since theprevious round. Each round, the migration process atomically reads and resets thismap, and the iterative migration process involves chasing dirty pages until progresscan no longer be made. As mentioned above, the live migration process eventuallysuspends execution of the VM and enters a final “stop-and-copy” round, where anyremaining pages are transmitted and execution resumes on the destination host.18Remus implements checkpointing as repeated executions of the final stage of livemigration: each epoch, the guest is paused while changed memory and CPU stateis copied to a buffer. The guest then resumes execution on the current host, ratherthan on the destination. Several modifications to the migration process are requiredin order to provide sufficient performance and to ensure that a consistent image isalways available at the remote location. These are described below.Migration enhancements. In live migration, guest memory is iteratively copiedover a number of rounds and may consume minutes of execution time; the briefservice interruption caused by the singular stop-and-copy phase is not a signif-icant overhead. This is not the case when capturing frequent VM checkpoints:every checkpoint is just the final stop-and-copy phase of migration, and so thisrepresents a critical point of optimization in reducing checkpoint overheads. Anexamination of Xen’s checkpoint code revealed that the majority of the time spentwhile the guest is in the suspended state is lost to scheduling, largely due to ineffi-ciencies in the implementation of the xenstore daemon that provides administrativecommunication between guest virtual machines and domain 0.Remus optimizes checkpoint signaling in two ways: First, it reduces the numberof inter-process requests required to suspend and resume the guest domain. Sec-ond, it entirely removes xenstore from the suspend/resume process. In the originalcode, when the migration process desired to suspend a VM it sent a message toxend, the VM management daemon. Xend in turn wrote a message to xenstore,which alerted the guest by an event channel (a virtual interrupt) that it should sus-pend execution. The guest’s final act before suspending was to make a hypercall2which descheduled the domain and caused Xen to send a notification to xenstore,which then sent an interrupt to xend, which finally returned control to the migrationprocess. This convoluted process could take a nearly arbitrary amount of time —typical measured latency was in the range of 30 to 40ms, but we saw delays as longas 500ms in some cases.2Paravirtual Xen guests contain code specifically for suspend requests that are responsible forcleaning up Xen-related state such as shared memory mappings used by virtual devices. In the caseof hardware virtualized (e.g., Windows) VMs, this state is completely encapsulated by Xen’s devicemodel, and these in-guest changes are unnecessary.19Remus’s optimized suspend code streamlines this process by creating an eventchannel in the guest specifically for receiving suspend requests, which the mi-gration process can invoke directly. Additionally, a new hypercall is provided toallow processes to register an event channel for callbacks notifying them of thecompletion of VM suspension. In concert, these two notification mechanisms re-duce the time required to suspend a VM to about one hundred microseconds – animprovement of two orders of magnitude over the previous implementation.In addition to these signaling changes, we have increased the efficiency of thememory copying process. First, we quickly filter out clean pages from the memoryscan, because at high checkpoint frequencies most memory is unchanged betweenrounds. Second, we map the guest domain’s entire physical memory into the repli-cation process when it begins, rather than mapping and unmapping dirty pages atevery epoch — we found that mapping foreign pages took approximately the sametime as copying them.Checkpoint support. Providing checkpoint support in Xen required two primarychanges to the existing suspend-to-disk and live migration code. First, supportwas added for resuming execution of a domain after it had been suspended; Xenpreviously did not allow “live checkpoints” and instead destroyed the VM afterwriting its state out. Second, the suspend program was converted from a one-shotprocedure into a daemon process. This allows checkpoint rounds after the first tocopy only newly-dirty memory.Supporting resumption requires two basic changes. The first is a new hypercallto mark the domain as schedulable again (Xen removes suspended domains fromscheduling consideration, because previously they were always destroyed aftertheir state had been replicated). A similar operation is necessary in order to re-arm watches in xenstore.Asynchronous transmission. To allow the guest to resume operation as quicklyas possible, the migration process was modified to copy touched pages to a stag-ing buffer rather than delivering them directly to the network while the domain ispaused. This results in a significant throughput increase: the time required for the20kernel build benchmark discussed in Section 2.3.3 was reduced by approximately10% at 20 checkpoints per second.Guest modifications. As discussed above, paravirtual guests in Xen contain a sus-pend handler that cleans up device state upon receipt of a suspend request. In addi-tion to the notification optimizations described earlier in this section, the suspendrequest handler has also been modified to reduce the amount of work done prior tosuspension. In the original code, suspension entailed disconnecting all devices andunplugging all but one CPU. This work was deferred until the domain was restoredon the other host. These modifications are available in Xen as of version 3.1.0.These changes are not strictly required for correctness, but they do improve theperformance of the checkpoint considerably, and involve very local modificationsto the guest kernel. Total changes were under 100 lines of code in the paravirtualsuspend handler. As mentioned earlier, these modifications are not necessary in thecase of non-paravirtualized VMs.2.2.5 Network BufferingMost networks cannot be counted on for reliable data delivery. Therefore, net-worked applications must either accept packet loss, duplication and reordering, oruse a high-level protocol such as TCP which provides stronger service guarantees.This fact simplifies the network buffering problem considerably: transmitted pack-ets do not require replication, since their loss will appear as a transient networkfailure and will not affect the correctness of the protected state. However, it is cru-cial that packets queued for transmission be held until the checkpointed state ofthe epoch in which they were generated is committed to the backup; if the primaryfails, these generated packets reflect speculative state that has been lost.Figure 2.3 depicts the mechanism by which we prevent the release of specula-tive network state. Inbound traffic is delivered to the protected host immediately,but outbound packets generated since the previous checkpoint are queued until thecurrent state has been checkpointed and that checkpoint has been acknowledgedby the backup site. We have implemented this buffer as a linux queuing disci-21pline applied to the guest domain’s network device in domain 0, which responds totwo RTnetlink messages. Before the guest is allowed to resume execution after acheckpoint, the network buffer receives a CHECKPOINTmessage, which causes itto insert a barrier into the outbound queue preventing any subsequent packets frombeing released until a corresponding release message is received. When a guestcheckpoint has been acknowledged by the backup, the buffer receives a RELEASEmessage, at which point it begins dequeueing traffic up to the barrier.There are two minor wrinkles in this implementation. The first is that in linux,queueing disciplines only operate on outgoing traffic. Under Xen, guest networkinterfaces consist of a frontend device in the guest, and a corresponding backenddevice in domain 0. Outbound traffic from the guest appears as inbound traffic onthe backend device in domain 0. Therefore in order to queue the traffic, we convertthe inbound traffic to outbound by routing it through a special device called anintermediate queueing device [57]. This module is designed to work at the IP layervia iptables [76], but it was not difficult to extend it to work at the bridging layerwe use to provide VM network access in our implementation.The second wrinkle is due to the implementation of the Xen virtual network device.For performance, the memory used by outbound networking traffic is not copiedbetween guest domains and domain 0, but shared. However, only a small numberof pages may be shared at any one time. If messages are in transit between a guestand domain 0 for only a brief time, this limitation is not noticeable. Unfortunately,the network output buffer can result in messages being in flight for a significantamount of time, which results in the guest network device blocking after a verysmall amount of traffic has been sent. Therefore when queueing messages, we firstcopy them into local memory and then release the local mappings to shared data.2.2.6 Disk BufferingDisks present a rather different challenge than network interfaces, largely becausethey are expected to provide much stronger reliability guarantees. In particular,when a write has been acknowledged by a disk, an application (or file system) ex-22BufferClientPrimary HostVMFigure 2.3: Network Buffering in Remuspects to be able to recover that data even in the event of a power failure immediatelyfollowing the acknowledgment. While Remus is designed to recover from a singlehost failure, it must preserve crash consistency even if both hosts fail. Moreover,the goal of providing a general-purpose system precludes the use of expensive mir-rored storage hardware designed for HA applications. Therefore Remus maintainsa complete mirror of the active VM’s disks on the backup host. Prior to engagingthe protection system, the current state of the disk on the primary is mirrored tothe backup host. Once protection has been engaged, writes to persistent storageare tracked and checkpointed similarly to updates to memory. Figure 2.4 gives ahigh-level overview of the disk replication mechanismAs with the memory replication subsystem described in Section 2.2.4, writes todisk from the active VM are treated as write-through: they are immediately appliedto the primary disk image, and asynchronously mirrored to an in-memory bufferon the backup. This approach provides two direct benefits: First, it ensures that theactive disk image remains crash consistent at all times; in the case of both hostsfailing, the active disk will reflect the crashed state of the externally visible VMat the time of failure (the externally visible VM resides on the primary host if theprimary host has not failed or if the backup also fails before it has been activated,otherwise it resides on the backup). Second, writing directly to disk accuratelyaccounts for the latency and throughput characteristics of the physical device. Thisobvious-seeming property is of considerable value: accurately characterizing disk23responsiveness is a subtle problem, as we ourselves experienced in an earlier ver-sion of the disk buffer which held write requests in memory on the primary VM un-til checkpoint commit. Such an approach either buffers writes, under-representingthe time required to commit data to disk and allowing the speculating VM to raceahead in execution, or conservatively over-estimates write latencies resulting in aloss of performance. Modeling disk access time is notoriously challenging [79],but our implementation avoids the problem by preserving direct feedback from thedisk to its client VM.At the time that the backup acknowledges that a checkpoint has been received, diskupdates reside completely in memory. No on-disk state may be changed until theentire checkpoint has been received, as this would prevent the backup from rollingback to the most recent complete checkpoint. Once the checkpoint is acknowl-edged, the disk request buffer may be applied to disk. In the event of a failure,Remus will wait until all buffered writes have been applied before resuming exe-cution. Although the backup could begin execution immediately using the requestbuffer as an overlay on the physical disk, this would violate the disk semantics pre-sented to the protected VM: if the backup fails after activation but before data iscompletely flushed to disk, its on-disk state might not be crash consistent.Only one of the two disk mirrors managed by Remus is actually valid at any giventime. This point is critical in recovering from multi-host crashes. This property isachieved by the use of an activation record on the backup disk, which is writtenafter the most recent disk buffer has been completely flushed to disk and beforethe backup VM begins execution. In recovering from multiple host failures, thisrecord may be used to identify the valid, crash consistent version of the disk.The disk buffer is implemented as a Xen block tap module [91]. The block tap is adevice which allows a process in the privileged domain to efficiently interpose itselfbetween the frontend disk device presented to a guest VM and the backend devicewhich actually services requests. The buffer module logs disk write requests fromthe protected VM and mirrors them to a corresponding module on the backup,which executes the checkpoint protocol described above and then removes itself24PrimaryHost12BufferSecondary            Host31 Disk writes are issued directly to local disk2 Simultaneously sentto backup bu!er3 Writes released to disk after checkpointFigure 2.4: Disk Write Buffering in Remusfrom the disk request path before the backup begins execution in the case of failureat the primary.2.2.7 Detecting FailureRemus’s focus is on demonstrating that it is possible to provide advanced highavailability in a general and transparent way using commodity hardware and with-out modifying the protected applications. We currently use a simple failure detectorthat is directly integrated in the checkpointing stream: a timeout of the backup re-sponding to commit requests will result in the primary assuming that the backuphas crashed and disabling protection. Similarly, a timeout of new checkpoints be-ing transmitted from the primary will result in the backup assuming that the pri-mary has crashed and resuming execution from the most recent checkpoint.The system is configured to use a pair of bonded network interfaces, and the twophysical hosts are connected using a pair of Ethernet crossover cables (or indepen-dent switches) on the protection NICs (network interface controllers). Should bothof these network paths fail, Remus does not currently provide mechanism to fenceexecution. Traditional techniques for resolving partitioning (i.e., quorum proto-cols) are notoriously difficult to apply in two host configurations. We feel that inthis case, we have designed Remus to the edge of what is possible with commodityhardware.252.3 EvaluationRemus has been designed with the primary objective of making high availabilitysufficiently generic and transparent that it may be deployed on today’s commodityhardware. In this section, we characterize the overheads resulting from our ap-proach for a variety of different workloads, in order two answer two questions:(1) Is this system practically deployable? (2) What kinds of workloads are mostamenable to our approach?Before measuring the performance impact, we must establish that the system func-tions correctly. We accomplish this by injecting network failures at each phaseof the replication protocol, while putting substantial disk, network and CPU loadon the protected system. We find that the backup takes over for the lost primarywithin approximately one second in every case, preserving all externally visiblestate, including active network connections.We then evaluate the overhead of the system on application performance acrossvery different workloads. We find that a general-purpose task such as kernel com-pilation incurs approximately a 50% performance penalty when checkpointed 20times per second, while network-dependent workloads as represented by SPECwebperform at somewhat more than one quarter native speed. The additional overheadin this case is largely due to output-commit delay on the network interface.Based on this analysis, we conclude that although Remus is efficient at state repli-cation, it does introduce significant network delay, particularly for applications thatexhibit poor locality in memory writes. Thus, applications that are very sensitiveto network latency may not be well suited to this type of high availability service(although there are a number of optimizations which have the potential to notice-ably reduce network delay, some of which we discuss in more detail following thebenchmark results). We feel that we have been conservative in our evaluation, usingbenchmark-driven workloads which are significantly more intensive than would beexpected in a typical virtualized system; the consolidation opportunities such anenvironment presents are particularly attractive because system load is variable.262.3.1 Test EnvironmentUnless otherwise stated, all tests were run on IBM eServer x306 servers, consist-ing of one 3.2 GHz (gigahertz) Pentium 4 processor with hyperthreading enabled,1 GB (gigabyte) of RAM, 3 Intel e1000 GbE (gigabit ethernet) network interfaces,and an 80 GB SATA (serial AT attachment) hard drive. The hypervisor was Xen3.1.2, modified as described in Section 2.2.4, and the operating system for all vir-tual machines was linux 2.6.18 as distributed in Xen 3.1.2, with the modificationsdescribed in Section 2.2.4. The protected VM was allocated 512 MB (megabytes)of total RAM. To minimize scheduling effects from the VMM, domain 0’s VCPU(virtual central processing unit) was pinned to the first hyperthread. One physicalnetwork interface was bridged to the guest virtual interface and used for applicationtraffic, one was used for administrative access, and the last was used for replica-tion (we did not bond interfaces for replication, but this is immaterial to the testswe performed). Virtual disks were provided by disk images on the SATA drive,exported to the guest using the tapdisk AIO driver.2.3.2 Correctness VerificationAs discussed in Section 2.2.2, Remus’s replication protocol operates in four distinctphases: (1) checkpoint changed state and increment the epoch of network and diskrequest streams, (2) replicate system state, (3) when the complete memory check-point and corresponding set of disk requests has been received, send a checkpointacknowledgement from the backup, and (4) upon receipt of the acknowledgement,release outbound network packets queued during the previous epoch. To verify thatour system functions as intended, we tested deliberately induced network failureat each stage. For each test, the protected system executed a kernel compilationprocess in order to generate disk, memory and CPU load. To verify the networkbuffer, we simultaneously executed a graphics-intensive X11 client (glxgears)attached to an external X11 server. Remus was configured to take checkpointsevery 25 milliseconds throughout. Each individual test was repeated twice.27Pages dirtied1 256 512 1024Milliseconds01020304050607080time suspendedtime transmittingFigure 2.5: Checkpoint Time Relative to Pages DirtiedAt every failure point, the backup successfully took over execution of the pro-tected system, with only minor network delay (about one second) noticeable whilethe backup detected the failure and activated the replicated system. The glxgearsclient continued to run after a brief pause, and the kernel compilation task contin-ued to successful completion. We then gracefully shut down the VM and executeda forced file system check on the backup disk image, which reported no inconsis-tencies.2.3.3 BenchmarksIn the following section, we evaluate the performance of our system using a varietyof macrobenchmarks which are meant to be representative of a range of real-worldworkload mixtures. The primary workloads we run are a kernel compilation test,the SPECweb2005 benchmark, and the Postmark disk benchmark. Kernel compi-lation is a balanced workload which stresses the virtual memory system, the diskand the CPU, SPECweb primarily exercises networking performance and memorythroughput, and Postmark focuses on disk performance.28To better understand the following measurements, we performed a microbench-mark measuring the time spent copying guest state (while the guest was suspended)and the time spent sending that data to the backup relative to the number of pageschanged since the previous checkpoint. We wrote an application to repeatedlychange the first byte of a set number of pages and measured times over 1000 iter-ations. Figure 2.5 presents the average, minimum and maximum recorded timesspent in the checkpoint and replication stages, within a 95% confidence interval. Itshows that the bottleneck for checkpoint frequency is replication time.Kernel compilation. The kernel compile test measures the wall-clock time re-quired to build linux kernel version 2.6.18 using the default configuration and thebzImage target. Compilation uses GCC version 4.1.2, and make version 3.81. Thisis a balanced workload that tests CPU, memory and disk performance.Figure 2.6 shows protection overhead when configured to checkpoint at rates of 10,20, 30 and 40 times per second, compared to a baseline compilation in an unpro-tected virtual machine. Total measured overhead at each of these frequencies was31%, 52%, 80% and 103%, respectively. Overhead scales linearly with checkpointfrequency within the rates we tested. We believe that the overhead measured inthis set of tests is reasonable for a general-purpose system, even at a checkpointfrequency of 40 times per second.SPECweb2005. The SPECweb benchmark is composed of at least three separatesystems: a web server, an application server, and one or more web client sim-ulators. We configure these as three VMs on distinct physical machines. Theapplication server and the client are configured with 640 MB out of 1024 MB to-tal available RAM. The web server and backup are provisioned with 2048 MB ofRAM, of which 1024 is allocated to the web server VM, which is the system undertest. The SPECweb scores we mention in this section are the highest results weachieved with the SPECweb “e-commerce” test maintaining 95% “good” and 99%“tolerable” times.Figure 2.7 shows SPECweb performance at various checkpoint frequencies relativeto an unprotected server. These scores are primarily a function of the delay imposedby the network buffer between the server and the client. Although they are con-29Checkpoints per second0 10 20 30 40Kernel build time (seconds)0255075100125150175200225250275300325350375400425450475500525550575600625650Figure 2.6: Kernel Build Time by Checkpoint Frequencyfigured for a range of frequencies, SPECweb touches memory rapidly enough thatthe time required to propagate the memory dirtied between checkpoints sometimesexceeds 100ms, regardless of checkpoint frequency. Because the network buffercannot be released until checkpointed state has been acknowledged, the effectivenetwork delay can be higher than the configured checkpoint interval. Remus doesensure that the VM is suspended at the start of every epoch, but it cannot currentlyensure that the total amount of state to be replicated per epoch does not exceedthe bandwidth available during the configured epoch length. Because the effec-tive checkpoint frequency is lower than the configured rate, and network latencydominates the SPECweb score, performance is relatively flat across the range ofconfigured frequencies. At configured rates of 10, 20, 30 and 40 checkpoints persecond, the average checkpoint rates achieved were 9.98, 16.38, 20.25 and 23.34respectively, or average latencies of 100ms, 61ms, 49ms and 43ms respectively.SPECweb is a RAM-hungry workload which is also very sensitive to network la-tency. This makes it a poor fit for our current implementation, which trades networkdelay for memory throughput. Figure 2.8 demonstrates the dramatic effect delay30Checkpoints per second0 10 20 30 40SPECweb2005 score0102030405060708090100110120130140150160170180190with netbuf no netbufFigure 2.7: SPECweb Scores by Checkpoint Frequency (Native Score: 305)between the client VM and the web server has on SPECweb. We used the Linuxnetem [64] queueing discipline to add varying degrees of delay to the outbound linkfrom the web server (virtualized but not running under Remus). For comparison,Figure 2.7 also shows protection overhead when network buffering is disabled, tobetter isolate network latency from other forms of checkpoint overhead (again, theflat profile is due to the effective checkpoint rate falling short of the configuredrate). Deadline scheduling and page compression, discussed in Section 2.3.4 aretwo possible techniques for reducing checkpoint latency and transmission time.Either or both would reduce checkpoint latency, and therefore be likely to increaseSPECweb performance considerably.Postmark. The previous sections characterize network and memory performanceunder protection, but the benchmarks used put only moderate load on the disk sub-system. In order to better understand the effects of the disk buffering mechanism,we ran the Postmark disk benchmark (version 1.51). This benchmark is sensitiveto both throughput and disk response time. To isolate the cost of disk replication,we did not engage memory or network protection during these tests. Configuration31Network latency (ms)0 60 70 80 90 100SPECweb2005 score0153045607590105120135150165180195210225240255270285300315Figure 2.8: The Effect of Network Delay on SPECweb Performancewas identical to an unprotected system, with the exception that the virtual disk wasprovided by the tapdisk replication module. Figure 2.9 shows the total time re-quired to perform 10000 postmark transactions with no disk replication, and witha replicated disk committing at frequencies of 10, 20, 30 and 40 times per second.Each run achieved about 3.5 MBps in reads and 7.5 MBps in writes. The resultsindicate that replication has no significant impact on disk performance.2.3.4 Potential OptimizationsAlthough we believe the performance overheads shown earlier in this section arereasonable for what they provide, we are eager to reduce them further, particularlyfor latency-sensitive workloads. In addition to more careful tuning of the existingcode, we believe the following techniques have the potential to greatly increaseperformance.Deadline scheduling. The amount of time required to perform a checkpoint iscurrently variable, depending on the amount of memory to be copied. Although32Checkpoints per second0 10 20 30 40Time per 10000 transactions (seconds)0255075100125150175200225250275300325350375Figure 2.9: The Effect of Disk Replication on Postmark PerformanceRemus ensures that the protected VM is suspended at the start of each epoch, itcurrently makes no attempt to control the amount of state which may change be-tween epochs. To provide stricter scheduling guarantees, the rate at which the guestoperates could be deliberately slowed [40] between checkpoints, depending on thenumber of pages dirtied. Applications which prioritize latency over throughput,such as those modeled by the SPECweb benchmark discussed in Section 2.3.3,may enable this throttling for improved performance. To perform such an opera-tion, the shadow page table handler could be extended to invoke a callback whenthe number of dirty pages exceeds some high water mark, or it may be configuredto pause the virtual machine directly.Page compression. It has been observed that disk writes typically only alter 5–20% of a data block [99]. If a similar property holds for RAM, we may exploit it inorder to reduce the amount of state requiring replication, by sending only the deltafrom a previous transmission of the same page.To evaluate the potential benefits of compressing the replication stream, we haveprototyped a basic compression engine. Before transmitting a page, this system33TimeBandwidth (MB/s)05101520253035404550556065RawXORGZIPHybridFigure 2.10: Comparison of Bandwidth Requirements Using VariousCompression Schemeschecks for its presence in an address-indexed LRU (least recently used) cache ofpreviously transmitted pages. On a cache hit, the page is XORed (exclusive ored)with the previous version and the differences are run-length encoded. This pro-vides significant compression when page writes do not change the majority of thepage. Although this is true for much of the data stream, there remains a significantfraction of pages that have been modified to the point where XOR compression isnot effective. In these cases, a general-purpose algorithm such as that used by gzipmay achieve a higher degree of compression.We found that by using a hybrid approach, in which each page is preferentiallyXOR-compressed, but falls back to gzip compression if the XOR compression ratiofalls below 5:1 or the previous page is not present in the cache, we could observe atypical compression ratio of 10:1 on the replication stream. Figure 2.10 shows thebandwidth consumed in MBps (megabytes per second) for a 60-second period ofthe kernel compilation benchmark described in Section 2.3.3. The cache size was8192 pages and the average cache hit rate was 99%.34Compressing the replication stream can consume additional memory and CPU re-sources on the replicating host, but lightweight schemes such as the XOR com-pression technique should pay for themselves through the reduction in bandwidthrequired for replication and consequent reduction in network buffering delay.Copy-on-write checkpoints. The current implementation pauses the domain ateach checkpoint for an amount of time linear in the number of pages which havebeen dirtied since the last checkpoint. This overhead could be mitigated by markingdirty pages as copy-on-write and resuming the domain immediately. This wouldreduce the time during which the domain must be paused to a fixed small costproportional to the total amount of RAM available to the guest. We intend toimplement COW (copy on write) by supplying the Xen shadow paging systemwith a userspace-mapped buffer into which it could copy touched pages beforerestoring read-write access. The replication process could then extract any pagesmarked as copied from the COW buffer instead of reading them directly from theguest. When it had finished replicating pages, their space in the buffer could bemarked for reuse by the Xen COW module. If the buffer were to become full, theguest could simply be paused, resulting in a graceful degradation of service fromCOW to stop-and-copy operation.The march of time. It is also likely that technological trends will reduce thecost of replication naturally. When this project was first developed, we used 1gigabit NICs for replication, but now 100 gigabit NICs are available. Over thatsame period, CPU speed and RAM bandwidth have increased only moderately.As the ratio between network and RAM bandwidth increases, there is less needto coalesce overwrites and checkpoints can be taken more frequently. If hardwarevirtualization were also to grow support for checkpointing, it would likely makeit much cheaper to capture local checkpoints than what has been possible so farin software. Coupling hardware checkpoints with fast kernel or hardware-assistednetwork transmission through mechanisms such as RDMA (remote direct memoryaccess) would likely substantially reduce the overhead of the Remus approach tostate replication, enabling it to be used for more performance-critical applications.352.4 Related WorkState replication may be performed at several levels, each of which balances effi-ciency and generality differently. At the lowest level, hardware-based replication ispotentially the most robust solution. Hardware, however, is much more expensiveto develop than software and thus hardware replication is at a significant economicdisadvantage. Replication at the virtualization layer has many of the advantagesof the hardware approach, but comes at lower cost because it is implemented insoftware. Like hardware, however, the virtualization layer has no semantic under-standing of the operating-system and application state it replicates. As a result itcan be less flexible than process checkpointing in the operating system, in appli-cation libraries or in applications themselves, because it must replicate the entiresystem instead of individual processes. It can also be less efficient, because it mayreplicate unnecessary state. The challenge for these higher-level approaches, how-ever, is that interdependencies among state elements that comprise a checkpointare insidiously difficult to identify and untangle from the rest of the system andthus these checkpointing mechanisms are significantly more complex than check-pointing in the virtualization layer.Virtual machine migration. As described earlier, Remus is built on top of the Xensupport for live migration [22], extended significantly to support frequent, remotecheckpointing. Bradford et al. extended Xen’s live migration support in anotherdirection: migrating persistent state along with the migrating guest so that it can berestarted on a remote node that does not share network storage with the originatingsystem [11].Like Remus, other projects have used virtual machines to provide high availability.The closest to our work is Bressoud and Schneider’s [12]. They use the virtualmachine monitor to forward the input events seen by a primary system to a backupsystem where they are deterministically replayed to replicate the primary’s state.Deterministic replay requires much stricter constraints on the target architecturethan simple virtualization and it requires an architecture- specific implementationin the VMM.36Another significant drawback of deterministic replay as exemplified by Bressoudand Schneider’s work is that it does not easily extend to multi-core CPUs. Theproblem is that it is necessary, but difficult, to determine the order in which coresaccess shared memory. There have been some attempts to address this problem.For example, Flight Data Recorder [97] is a hardware module that sniffs cachecoherency traffic in order to record the order in which multiple processors accessshared memory. Similarly, Dunlap introduces a software approach in which theCREW protocol (concurrent read, exclusive write) is imposed on shared mem-ory via page protection [30]. While these approaches do make SMP determinis-tic replay possible, it is not clear if they make it feasible due to their high over-head, which increases at least linearly with the degree of concurrency. Our worksidesteps this problem entirely because it does not require deterministic replay.Virtual machine logging and replay. Virtual machine logging has been used forpurposes other than high availability. For example, in ReVirt [31], virtualization isused to provide a secure layer for logging state changes in the target system in orderto provide better forensic evidence for intrusion detection systems. The replayedsystem is a read-only copy of the original system, which is not meant to be runexcept in order to recreate the events involved in a system compromise. Logginghas also been used to build a time-travelling debugger [49] that, like ReVirt, replaysthe system for forensics only.Operating system replication. There are many operating systems, such as Ac-cent [74], Amoeba [63], MOSIX [9] and Sprite [70], which support process migra-tion, mainly for load balancing. The main challenge with using process migrationfor failure recovery is that migrated processes typically leave residual dependen-cies to the system from which they were migrated. Eliminating these dependenciesis necessary to tolerate the failure of the primary host, but the solution is elusivedue to the complexity of the system and the structure of these dependencies.Some attempts have been made to replicate applications at the operating systemlevel. Zap [68] attempts to introduce a virtualization layer within the linux kernel.This approach must be rebuilt for every operating system, and carefully maintainedacross versions.37Library approaches. Some application libraries provide support for process mi-gration and checkpointing. This support is commonly for parallel applicationframeworks such as CoCheck [82]. Typically process migration is used for loadbalancing and checkpointing is used to recover an entire distributed application inthe event of failure.Replicated storage. There has also been a large amount of work on checkpointablestorage for disaster recovery as well as forensics. The Linux Logical Volume Man-ager [55] provides a limited form of copy-on-write snapshots of a block store. Par-allax [92] significantly improves on this design by providing limitless lightweightcopy-on-write snapshots at the block level. The Andrew File System [44] allowsone snapshot at a time to exist for a given volume. Other approaches includeRSnapshot, which runs on top of a file system to create snapshots via a seriesof hardlinks, and a wide variety of backup software. DRBD [75] is a softwareabstraction over a block device which transparently replicates it to another server.Speculative execution. Using speculative execution to isolate I/O processing fromcomputation has been explored by other systems. In particular, SpecNFS [65] andRethink the Sync [67] use speculation in a manner similar to us in order to makeI/O processing asynchronous. Remus is different from these systems in that thesemantics of block I/O from the guest remain entirely unchanged: they are appliedimmediately to the local physical disk. Instead, our system buffers generated net-work traffic to isolate the externally visible effects of speculative execution untilthe associated state has been completely replicated.2.5 Future WorkThis section briefly discusses a number of directions that we intend to explore inorder to improve and extend Remus. As we have demonstrated in the previoussection, the overhead imposed by our high availability service is not unreasonable.However, the implementation described in this thesis is quite young. Several poten-tial areas of optimization remain to be explored. Upon completion of the targeted38optimizations discussed in Section 2.3.4, we intend to investigate more generalextensions such as those described below.Introspection optimizations. Remus currently propagates more state than is strictlynecessary. For example, buffer cache pages do not need to be replicated, since theycan simply be read in from persistent storage on the backup. To leverage this, thevirtual disk device could log the addresses of buffers provided to it for disk reads,along with the associated disk addresses. The memory-copying process could thenskip over these pages if they had not been modified after the completion of the diskread. The remote end would be responsible for reissuing the reads from its copy ofthe disk in order to fill in the missing pages. For disk-heavy workloads, this shouldresult in a substantial reduction in state propagation time.Hardware virtualization support. Due to the lack of equipment supporting hard-ware virtualization in our laboratory at the time of development, we have onlyimplemented support for paravirtualized guest virtual machines. But we have ex-amined the code required to support fully virtualized environments, and the out-look is quite promising. In fact, it may be somewhat simpler than the paravirtualimplementation due to the better encapsulation provided by virtualization-awarehardware.Cluster replication. It would be useful to extend the system to protect multipleinterconnected hosts. While each host can be protected independently, coordinatedprotection would make it possible for internal network communication to proceedwithout buffering. This has the potential to dramatically improve the throughput ofdistributed applications, including the three-tiered web application configurationprevalent in managed hosting environments. Support for cluster replication couldbe provided by a distributed checkpointing protocol such as that which is describedin our colleague Gang Peng’s master’s thesis [72], which used an early version ofthe checkpointing infrastructure provided by Remus.Disaster recovery. Remus is a product of the SecondSite [27] project, whose aimwas to provide geographically diverse mirrors of running systems in order survivephysical disaster. We are in the process of planning a multi-site deployment ofRemus in order to experiment with this sort of configuration. In a long distance39deployment, network delay will be an even larger concern. Additionally, networkreconfigurations will be required to redirect Internet traffic accordingly.Log-structured datacenters. We are extending Remus to capture and preservethe complete execution history of protected VMs, rather than just the most recentcheckpoint. By mapping guest memory into Parallax [58], our virtual block storedesigned for high-frequency snapshots, we hope to be able to efficiently store largeamounts of both persistent and transient state at very fine granularity. This datashould be very useful in building advanced debugging and forensics tools. It mayalso provide a convenient mechanism for recovering from state corruption whetherintroduced by operator error or by malicious agents (viruses and so forth).2.6 ConclusionRemus is a novel system for retrofitting high availability onto existing softwarerunning on commodity hardware. The system uses virtualization to encapsulate aprotected VM, and performs frequent whole-system checkpoints to asynchronouslyreplicate the state of a single speculatively executing virtual machine.Providing high availability is a challenging task and one that has traditionally re-quired considerable cost and engineering effort. Remus commodifies high avail-ability by presenting it as a service at the virtualization platform layer: HA maysimply be “switched on” for specific virtual machines. As with any HA system,protection does not come without a cost: The network buffering required to ensureconsistent replication imposes a performance overhead on applications that requirevery low latency. Administrators must also deploy additional hardware, which maybe used in N-to-1 configurations with a single backup protecting a number of ac-tive hosts. In exchange for this overhead, Remus completely eliminates the task ofmodifying individual applications in order to provide HA facilities, and it does sowithout requiring special-purpose hardware.Remus represents a previously unexplored point in the design space of HA formodern servers. The system allows protection to be simply and dynamically pro-vided to running VMs at the push of a button. We feel that this model is particularly40attractive for hosting providers, who desire to offer differentiated services to cus-tomers.41Chapter 3SecondSite: Disaster Tolerance asa Service3.1 IntroductionFailures in the cloud are spectacular. In January 2010, Heroku, a Ruby-on-Railsapplication hosting company experienced the complete failure of their 22 virtualmachines hosted on Amazon EC2 (elastic compute cloud). This outage, whichwas reportedly the result of a router failure, affected the 44,000 applications hostedon the site [13]. In May 2010, Amazon experienced four EC2 outages in a singleweek, the last of which resulted from a car crashing into a utility pole cutting powerto a portion of the system [60]. Outages such as these are hardly limited to Amazon,who through the introduction of “availability zones”, has taken steps to exposefailure domains explicitly to customers so that applications may be designed tosurvive outages. However, while techniques to expose fate sharing and exposure torisk, such as Amazon’s availability zones, help their customers engineer systemsto survive failures, they are no panacea.In April 2011, a “networking event” in the Virginia facility triggered a cascadingfailure in Amazon’s Elastic Block Store (EBS), and resulted in a serious outage42across multiple availability zones. An undisclosed number of virtual machines,including the hosts of a number of large, popular web-based applications, becameunavailable for the initial twelve hours of the outage. Complete service was notrestored for five days. Amazon published a detailed postmortem of the event,explaining that the outage stemmed from a series of bugs and unanticipated be-haviors in the failure recovery mechanisms for EBS [104]. One of the concludingobservations made in the document is that for the highest degree of availability,applications should be protected across multiple regions–geographically separatedata center facilities that have a much lower degree of fate sharing than availabilityzones within the same data center.The EBS failure is illustrative of the fact that while cloud computing environmentsare well maintained and professionally administered systems, they are not immuneto outages. The postmortem, released less than a week after the outage itself, rep-resents a surprisingly thorough and forthcoming explanation of the chain of eventsleading to failure. It also identifies a set of changes, to both software and admin-istrative procedures, that will be taken to avoid repeating this sort of outage in thefuture. However, the failure severely impacted customers–even those who weremaking efforts to engineer their systems to be resistant to outages within the host-ing environment–and left them without service for as many as five days. As such,the Virginia outage also demonstrates that while rare, outages are happening, andthey are happening at scale. Even the skilled developers of mature Internet-basedservices face challenges in designing systems that handle these failures gracefully.This chapter describes SecondSite, a high-availability and disaster tolerance ser-vice for virtual machines running in cloud environments. Based on the premise thatimplementing application- or OS-level fault tolerance is both difficult and expen-sive, we argue that facilities to enhance the availability of hosted software shouldbe commodified, and offered as part of the hosting infrastructure. Just as spot mar-kets [100] allow customers to receive a less reliable level of service at a cut rate, weargue that important workloads (and less sophisticated developers) would prefer aservice that transparently provides a higher degree of availability. Using Second-Site, the owner of a hosted virtual machine may elect to have that VM’s entire statecontinuously replicated to a backup image at an alternate geographic location. As43Local Area (Remus) Wide Area (SecondSite)Replication Link Bonded ethernet inter-facesRouted IP connectivityover WANReplication Latency Hundreds of microsec-ondsTens of millisecondsFailure Detection Complete link failure ofreplication channelQuorum among externalvantage pointsNetwork Relocation Ethernet MAC (e.g. viaARP)IP address (e.g. via BGP)Table 3.1: Differences Between Local and Wide-Area Networkswith the Remus [26] system on which it is based, in the event of failure SecondSiteallows the backup VM to take over in a completely transparent manner, requiresno changes to OS or application code and exposes no failure to connected clients.Failure ModelHigh-availability systems are typically complex: they require careful planning,complex integration with program logic, and ongoing maintenance as applicationcode evolves over time. SecondSite makes two simplifying design decisions re-garding its failure model that are intended to make it more practical for generalpurpose use.Provide protection from fail-stop failures. Like Remus before it, SecondSite pro-vides protection from fail-stop failures, the typical case in large-scale outages anddisasters. The system provides transparent, continuous replication of a runningVM to a passive backup host at another location. While fail-stop clearly includesdestructive events such as fires and power outages, we also promote network fail-ures to be fail-stop through the use of a watchdog that automatically removes theactive host from service in the case of network disconnection. This approach doesnot attempt to survive “wounded” applications experiencing partial failures as aresult of overload, environmental problems, or bugs. However, it can help to main-tain availability when a site is experiencing wide-area connectivity problems bytransparently migrating to a better-connected backup during the network outage.44Protect the complete state of the running application. The failures for whichSecondSite provides protection address concerns of both availability and durabil-ity. The established approach for disaster recovery in many environments todayinvolves the asynchronous replication of storage. In the event of failure, storage-based disaster recovery systems (a) lose some amount of data and (b) require thatthe hosts reboot successfully from a crash-consistent image. In SecondSite, thedisk image and complete running machine state are replicated, meaning that nodata will be lost in the event of failure, and that reboots, with associated file systemchecks, are not required. This approach mitigates the risk associated with failuresthat may occur during restart on the backup host, such as problems checking thelocal (crash-consistent) file system.In the next section, we characterize three challenges that we faced in evolving Re-mus to provide disaster tolerance as a service. The first of these concerns involvemaking replication work effectively over expensive and higher-latency wide-arealinks. The remaining two concerns address the network-related challenges of de-tecting failure and restoring service in an Internet, as opposed to local-area, en-vironment. Many of the individual techniques described here have been used inother systems: the contribution of this work is to describe the development of anactive disaster tolerant testbed that is currently deployed between two universitydata centers, approximately 260 kilometers apart. In addition to the architectureof the system, we describe several of the more painful corner cases that we expe-rienced during development and deployment, and provide details of a continuousregression test suite that we use to validate and demonstrate confidence in the sys-tem to others.3.2 Challenges of Wide-Area Disaster ToleranceRemus [26] provides transparent high availability for unmodified OS and applica-tion software running within virtual machines on the Xen [10] hypervisor. Remusachieves this by employing checkpoint-based fault tolerance: during normal op-eration, incremental checkpoints of the entire VM state on the primary host arereplicated to a backup host. These checkpoints are performed at a very high fre-45quency (20-40 checkpoints/second) to enable fine grained recovery. On failure ofthe primary host, the VM at the backup host resumes execution from the latestcompleted checkpoint. The failover process is totally transparent to clients: thebackup VM has the same IP address as the primary VM and, just as with live mi-gration [22], the backup host issues a gratuitous ARP to ensure network packetsgoing to the failed VM are automatically redirected to the backup VM as soon asit becomes available.Remus checkpoints capture the entire state of the primary VM, which includesdisk, memory, CPU and network device state. Checkpoint replication and acknowl-edgement is carefully pipelined and overlapped with execution to maintain lowoverhead, while still preserving consistency. The execution of the primary VMduring each checkpoint interval is speculative until the interval has been check-pointed, since it will be lost if a failure occurs. Therefore, Remus uses outputcommit [83] to ensure that the external world sees a consistent view of the server’sexecution, despite failovers. More specifically, Remus queues and holds any out-going network packets generated by the primary server until the completion of thenext checkpoint, ensuring that unprotected speculative state is never exposed.Output commit is also applied to isolate disk writes generated during a checkpointinterval. Writes received from the active server during an epoch are buffered and re-leased to disk only after a checkpoint is complete. In the case of failure, Remus en-sures that prior to resuming execution on the backup, all outstanding checkpointedwrites have been written to disk and that any writes associated with speculativeexecution are discarded. This resembles external synchrony [67] in that it providesthe same consistency as synchronous IO to external viewers without blocking thesystem internally, but the implementation is much simpler. Remus does not trackdependencies on storage I/O, it simply assumes that all output following an I/O isdependent.463.2.1 Challenges of Wide-Area NetworksRemus is designed for a LAN (local area network) environment, and is intended torespond to the increased exposure to failure that results from consolidating manyphysical servers into virtual machines on a single host. In this environment, thefailure of a single physical host carries elevated consequences, and Remus providedthe ability to transparently recover from this class of failure by replicating state toa second physical server.The environmental differences that face an availability system in moving from thelocal to the wide area are summarized in Table 3.1. First, the high bandwidth andlow latency network of the local area is replaced with a typically lower bandwidth,and certainly higher latency IP (internet protocol) connection between sites. As aconsequence, the system must be more frugal with replication traffic, as networkingresources are constrained and often also quite expensive.A second difference is that Remus aims to survive any single, fail-stop physicalfailure. In Remus, replication traffic travelled over a dedicated point-to-point Eth-ernet link from the primary to the backup. This meant that Remus could survivethe failure of any single physical component, even the loss of a replication NIC.In the wide area, the link between the active and backup hosts is often an IP routeand may be lost even though the rest of the network remains available, requiringmore careful consideration of failure notification. Finally, address relocation isconsiderably more challenging as a result of the need to alter upstream routing.The remaining subsections discuss each of these three issues in more detail.3.2.2 Bandwidth-Efficient ReplicationContinuously replicating virtual machine checkpoints can require a lot of band-width. Depending on the frequency of checkpoint replication and the amount ofstate changed since the previous checkpoint, replication can easily consume up to600 Mbps for memory intensive workloads. While a 1Gbps link supports this load,the cost of wide-area networking makes site-wide replication impractical at this47rate. Furthermore, having multiple high-bandwidth streams share a single link canincrease congestion and reduce the effective bandwidth of the link.In previous work on optimizing Remus for databases [61], we faced similar chal-lenges: database workloads can generate very large checkpoints to the point thateven fast LAN links may become performance bottlenecks. We developed threemain techniques to reduce checkpoint size and latency: Commit Protection re-laxes output buffering of some network messages. Read tracking avoids replicat-ing changes to the contents of memory where that state has previously been readfrom disk. Finally, Checkpoint Compression saves bandwidth by performing onlinecompression of the replication stream.3.2.3 Failure DetectionTo ensure correct execution, it is very important that only one of the replicas isactive at once. When the replicas lose contact with each other, we must ensurethat the active replica has failed (or become globally unreachable). Remus dealswith the problem of distinguishing between system failure and network partition bybonding together two physical NICs for the replication link, so that it will continueto operate even if a single NIC fails. This does not suffice across a WAN, wherethe connection between the primary and the backup is not necessarily under thecontrol of the high-availability service provider.Differentiating between network partition and host failure using only two hosts isnot easy, and we do not attempt to do it. Instead, we rely on an arbitration serviceresiding outside of the protected network. If the primary and backup hosts losecontact with each other, they attempt to contact the arbitrator to decide which ofthem should continue running. They will terminate if instructed to do so by thearbitrator, or if they cannot reach the arbitrator.483.2.4 Failure RecoveryWhen a virtual machine is relocated, network traffic between it and its clients musttake a different path. The transition between locations must have two propertiesto be effective: it must be comprehensive — as the VM is being recovered in ageographically distant location, traffic from all clients must be appropriately redi-rected to the new VM instance. Second, in order to maintain availability, it mustalso complete in a timely manner. In the LAN environment, these properties areeasy to achieve: As both primary and backup hosts are on a common Ethernetnetwork, relocation is not exposed to the IP layer beyond the use of ARP.Unlike the LAN environment, wide-area migrations involve BGP updates to redi-rect IP traffic from one site to another. BGP convergence times are challengingto reason about and slow BGP convergence may result in considerably longer pe-riods of unreachability for the backup host. Rather than attempting to tackle thegeneral issue of convergence delays related to IP address migration in BGP ourapproach uses a specific, but realistic BGP configuration to provide failover: weconfigure the protected network on the active and backup sites in the same mannerthat dual-homed IP networks are configured to survive link or ISP (internet serviceprovider) failures. Our network configuration can be thought of as a dual-homedset up in which not only is the backup link is redundant, but so are the servers thatthey connect to.3.3 SecondSite DesignSecondSite builds upon Remus and provides the same transparency and consis-tency guarantees. In a typical deployment, SecondSite runs on each physical hoston the primary site, replicating all virtual machines (VM) running on that hostacross a WAN to a host on the backup site. Each VM is checkpointed indepen-dently.When the primary site fails, the backup site performs the required network re-configuration (e.g., BGP updates) and resumes the execution of all the protected49virtual machines from their last consistent checkpoint. The failure and subsequentmigration to a remote location is completely transparent to the VMs.3.3.1 Reducing Replication OverheadRemusDB [61] introduced a set of optimizations to address the large overheadsof checkpoint-based HA for database applications. These optimizations addressedboth latency and replication bandwidth, and thus can be very beneficial for disasterrecovery. The following were the main optimizations introduced by RemusDB:• Checkpoint compression. Using Page Delta compression [84, 96] and anLRU cache of previously dirtied pages, this technique reduces the size of acheckpoint dramatically. The observation here was that updates to memorywere sparse but spread over a large working set.• Disk read tracking. Since the disk is kept synchronized by Remus duringreplication, pages read from the disk by Guest VM need not be includedin the checkpoint as long as it is unmodified during the checkpoint. At thebackup host, these pages are periodically read from disk into the backupVM’s memory.• Commit protection. Applications can dynamically override output com-mit through a setsockopt() option. The ability to dynamically switcha connection between buffered and unbuffered states, allows transactionalworkloads like OLTP to leverage consistency guarantees of Remus withoutincurring the latency overhead introduced by output buffering [83].We have only used one of these techniques (checkpoint compression) in this work.The read tracking optimization was developed against the Remus disk replicationmodule, which does not support resynchronization after failure. Preliminary inves-tigation indicated that it would not produce large improvements for the workloadswe examined (discussed in Section 3.4), and so we have not yet ported support toour new disk synchronization layer.50Commit protection can reduce client-perceived latency dramatically, but it requiresunderstanding of the protected application and cannot simply be switched on. Asa platform-level service, SecondSite aims for transparency for both the protectedVMs and the service provider, and so we have only evaluated techniques that con-form to that goal.3.3.2 Failure Detection in SecondSiteIn SecondSite, we only consider fail-stop failures. A node may suspect failure if thereplication stream times out, but the failure of the replication link does not provethat the remote node itself has failed, or even that it has become unreachable to itsclients. Incorrectly classifying link failure as node failure can lead to a split brainscenario, where both primary and backup become active at the same time. Con-versely, if a node is fenced if it becomes unreachable over the replication link, thenthe failure of that component will be promoted to complete availability loss. Fail-ure detection in distributed systems is a richly researched topic, with many possibleapproaches but no single perfect solution. Possible approaches include unreliablefailure detectors [20], minimal synchrony [29] and partial synchrony [33]. Severalvariants of unreliable failure detectors have been proposed, such as heartbeat [2,3],adaptive timeouts [35], gossip-based detection [88], and so on. Such approachesassume partition tolerance, which SecondSite does not require.SecondSite’s failure detector requires the following properties:• Quick detection. The failure detection logic contributes to the service down-time, as network packets from VMs are not released until a decision is made.• No false positives. Incorrectly classifying a live site as dead leads to splitbrain situation and loss of data consistency.• Conservative failover. Network congestion can produce transient commu-nication delays. Aggressively classifying such delays as failures and trigger-ing recovery is expensive and could lead to reduced availability.51Figure 3.1: State Transitions of a Quorum NodeQuorum-based failure detection techniques can be used to achieve these require-ments. There are several variants of the quorum technique. One simple techniqueis to use a shared quorum disk, where both nodes try to atomically reserve a diskpartition (e.g. SCSI Reserve & Release). Such a solution is infeasible over a wide-area network. Another commonly used solution in WAN environments [105] isthe ping-based approach. When the replication channel is broken, the backup siteattempts to ping the primary site over a different network address. If the ping issuccessful, failover is avoided. This technique requires that there be at least twodifferent routes between the primary and backup site, one to carry replication traf-fic and another for the ping test. This approach avoids requiring a third node byusing distinct network interfaces to create a quorum. A third approach is to usequorum protocols [39] with one or more quorum nodes outside the WAN clusteracting as arbitrators.52Procedure 1 Quorum Logic at Primary SiteRequire: ReplicationTimeout == Truesuspend VMs and stop replicationresult = SendHeartbeat(quorumNode)if result == PrimaryAlive then // Link/Backup Failureresume VMselse // Failover happenedshutdownend ifProcedure 2 Quorum Logic at Backup SiteRequire: ReplicationTimeout == Truestop replicationloopresult = SendHeartbeat(quorumNode)if result == PrimaryAlive then // Link Failureshutdownelse if result == TimeWaitPrimary thenwait T secs for Primary to respondelse if result == BackupAlive then // Link/Primary failureIssue BGP update to re-route network trafficresume VMs from latest checkpointend ifend loopIn SecondSite, we chose to build an arbitration service in which a majority of arbi-tration nodes distributed across the internet must be reachable by a site for it to beconsidered live. We deployed a quorum web service in Google App Engine [102].Figure 3.1 shows the state transitions of a quorum server used by SecondSite. Thecorresponding logic used by the primary and backup sites are illustrated in proce-dures 1 & 2 respectively. Using a quorum web service is just one example of howone could deploy a quorum service on the cloud cheaply. In fact, the quorum ser-53vice we deployed consumed very little resources that it was well within the “dailyfree quota” offered by Google. When a failure is suspected by one node, a quorumcheck is initiated. If the backup node initiates the quorum check first, it waits fora configurable time period for the primary node to make contact. The wait phaseavoids unnecessary failovers caused by transient connectivity loss between the twonodes. While one quorum server is sufficient, to provide better redundancy onecould deploy several quorum servers and use a simple majority quorum algorithmto decide which of the two nodes should be victimized in the event of networkpartition.There are three failure scenarios to consider in our setup:• One node failure. Recovery succeeds with the help of quorum servers. If thesurviving node is the backup node, failure recovery procedures are initiated.• Replication link failure. If a node obtains a quorum, it lives (or recovers).If not, it shuts down. It is thus possible that both nodes would shutdown ifthey fail to reach any quorum servers or the quorum nodes in any partition isinsufficient to satisfy a majority. For services that cannot tolerate partitionedoperation, this behavior ensures data consistency.• Quorum node(s) failure. Replication proceeds as usual.Table 3.2 enumerates different failure scenarios, the quorum outcome and time toachieve the quorum (i.e. service downtime).3.3.3 Failure RecoveryThe dominant new challenge in recovering from failures in a wide-area environ-ment is that of redirecting traffic: after failure detection determines that the backupsite should step in, the IP addresses must be relocated as quickly as possible, be-tween the two locations on the Internet. As Remus continues to perform outputcommit and the failure detection protocol ensures that at most one VM instance isever active at a time, there is no possibility that open TCP sessions will ever moveinto an irrecoverable state due to mismatched sequence numbers. If IP advertise-54Failing Component Quorum Outcome Resolution Time (ServiceDowntime)Replication Link Backup Shutdown RTT to Quorum ServicePrimary [+ Link] failover to Backup 2 RTTs to QuorumService + QuorumTimeout + NetworkRe-configurationBackup [+ Link] Primary survives RTT to Quorum ServiceQuorum Server Replication continues 0Primary + Quorum Backup Shutdown Service Failure (ManualIntervention)Backup + Quorum Primary Shutdown Service Failure (ManualIntervention)Link + Quorum Primary & Backup Shutdown Service Failure (ManualIntervention)Table 3.2: Failure Scenarios and Downtimes using Quorum Serversments change sufficiently quickly, failover will be almost completely transparent toclients; TCP sessions will remain open, and the only exposed aspect of failure willbe the possibility of a small number of dropped packets during failover. To achievethis, SecondSite requires network support to quickly move IP addresses betweenthe two sites. The system achieves this by interacting with the Border GatewayProtocol (BGP) to influence Internet routing decisions.BGP is a path vector protocol where every router selects the best route to destina-tions (IP address blocks) based on the routes advertised by neighboring routers. ABGP route advertisement for a given destination IP prefix contains the autonomoussystem numbers of all ASes it has traversed through, since originating from theAS responsible for that prefix. Route advertisements also have one or more at-tributes, that play a role in choosing the best possible route to a destination from aset of routes. Transitive attributes are passed on from one router to another, whilenon-transitive attributes are used to manipulate routing decisions between any twoadjacent routers. When there are multiple routes to the same destination AS, thechoice of the best route depends on the routing policies, AS path lengths and otherattributes assigned to the route.55Figure 3.2 shows our routing architecture. We leverage BGP multi-homing toachieve failure recovery. On a high level, the system works as follows: the setof hosts being protected by SecondSite can be reached through two different In-ternet routes, one leading to the primary site and another to the backup site. Onlythe primary site’s route is kept active (i.e. the best route) during normal operation.When a failover occurs, the backup site issues a new route advertisement that haspreferable attributes to those of the existing preferred route. BGP route attributemanipulation enables us to influence the routing decisions made by upstream ASes.An extensive measurement study done by Labovitz et al. [51] revealed that routeupdates from shorter to longer path lengths take longer time to converge comparedto the converse case. As we are concerned with ensuring that reachability to thebackup host be established as quickly as possible on failover, we have opted fora configuration in which a single upstream AS provides both links. We believethat this is realistic for many deployments, but also observe that the intuition fromLabovitz et al. is that it is desirable to limit the number of ASes that advertise-ments need to propagate through to the greatest degree possible in order to reduceconvergence times.In practice, BGPmulti-homing can be achieved in a number of ways. As examples,the following are some key attributes used to indicate route preferences:• Multiexit discriminator (MED). When an AS has multiple entry points, theoptional MED attribute can be used to provide a hint to neighbouring ASesabout the preferred link for inbound traffic. As such, the MED attribute toswitch inbound traffic from a primary to backup link. While MEDs are hon-ored by current BGP implementations, they are non-transitive, i.e. a MEDattribute received by an AS does not leave the AS. Thus, MED attributebased route manipulation only works in HA configurations that have a singleupstream AS.• AS-path prepend. AS-path prepending is a common way to influence BGProuting decisions of upstream routers. If there is no conflict with local rout-ing policies, BGP routers generally select the route with the shortest path toa given destination prefix. Routers can be configured to indicate a reduced56preference to the neighbouring ASes by advertising a longer AS-path forthat prefix. AS-paths can be lengthened by prepending it one or more timeswith the router’s AS number. AS-path prepending can be used as an HAmechanism for IP failover: The backup site advertises a longer path than theprimary site for the IP prefix. On failover, it removes prepended entries to ad-vertise a shorter path than the primary, resulting in a preferential route and sore-routing traffic to the backup site. This technique can be used even whenthere are multiple AS peerings at each site, however, the further upstreampath adjustments have to travel in order to successfully redirect traffic, themore exposed the system will be to long convergence times and potentialunreachability.• Communities and local preference. A number of extensions have beenadded to BGP, largely with the intention of allowing ISPs to delegate agreater degree of control over routing decisions to their customers. BGPcommunity and local preference attributes are two such extensions, and maybe used in conjunction with AS-path prepending if the routing preferencesshould be visible beyond the ISP. The local preference attribute is used to in-fluence routing decisions within an AS; routes with a higher local preferenceare favored over other routes. Communities allow collections of prefixes tobe grouped and scoped together, and so act as a useful point of indirection toapply local preference updates.To influence routing decisions within the ISP, the client and the ISP agree tomap certain BGP communities advertised by the client to local preferenceswithin the ISP. This approach allows the client to use a private AS number,but will not propagate route changes beyond the ISP’s AS without an analo-gous mapping of client BGP communities to ISP AS path prepends.SecondSite uses BGP communities mapped to local preferences and AS path prepends,for failure recovery. The local preferences are used to influence routing decisionswithin our upstream AS. The community mappings to AS-path prepends influencerouting decisions beyond our upstream AS. Both the primary and backup sites ad-vertise a BGP route for a /24 CIDR block. All the protected VMs are assigned IP57Figure 3.2: SecondSite Setup over WANaddresses from this block. The backup site’s AS-path to the IP prefix is longer thanthe primary, as indicated in Figure 3.2, allowing the primary site to receive all net-work traffic for the VMs. When a failover is initiated, the backup site advertises aroute with shorter AS-path and higher local preference, than the primary site. Thiscauses traffic to flow to the backup site, once the routing update converges.The approach that we have taken is hardly the only way to achieve route redirec-tion with BGP. Other techniques, such as the use of more specific advertisementsto redirect traffic are also possible. An earlier version of our system used this ap-proach successfully, but we were dissatisfied with the associated loss of IP addressspace that the technique required.58Further, BGP based IP multi-homing is not the only solution to transparently failovernetwork connections. For example, in VPN based setups, routing reconfigurationscould be done on the VPN concentrator [96], in order to maintain client connectiv-ity during VM migration. Our concern with overlay network-based solutions suchas these is that they often either introduced a single point of failure in the VPN con-centrator, or led to situations in which traffic had to be routed through both activeand passive sites during some classes of failure.3.3.4 Seamless FailbackAfter a crashed primary site comes back online, in order to restart SecondSite,storage has to be resynchronized from backup to the primary site, without caus-ing an outage to the VMs running in the backup site. Remus’ storage replicationdriver, based on Xen’s Blktap2 driver [106] does not provide a means for onlineresynchronization of storage. In order to understand the requirements of resyn-chronization, we turn the reader’s attention towards the disk replication componentof Remus.Remus replicates disk writes asynchronously over the network to the backup siteduring a checkpoint interval. The backup site buffers in memory all writes receivedduring a checkpoint. At the end of a checkpoint, primary site flushes any pendingdata in the socket and sends a commit message to the backup site. On reception ofthis message, the backup site sends an acknowledgement and then asynchronouslyflushes the buffered writes to disk. When the primary site fails during a check-point period, the backup site discards all buffered disk writes accumulated in thatunfinished checkpoint.A resynchronization module needs the following two functionalities:• Track blocks written by VMs at backup site after failover. When the pri-mary site is back online, these blocks have to be replicated from the backupto primary site, while the VMs continue performing i/o. This requires anonline algorithm for dirty block tracking and resynchronization.59• Discard writes made by primary site during the last unfinished check-point. While it is easy for the backup site to compute these blocks from itscheckpoint buffer and include them as part of dirty block resynchronization,it would be incomplete since replication is asynchronous. Only the primarysite would have complete knowledge of blocks written before failure. Thiswould necessitate the primary site to log the location of disk writes beforedoing the actual write. During resynchronization, the primary site would usethe log to identify writes in the unfinished checkpoint prior to failure andoverwrite them with data from the backup site.DRBD [75] provides storage replication in asynchronous or synchronous modewith efficient online resynchronization functionality. Its quick-sync bitmap featuretracks writes done by a backup node after primary failure. Meanwhile, the primarynode uses activity logging to keep track of active extents (4MB in size) duringnormal operation. It offers a variety of resynchronization options that suit Sec-ondSite’s needs. However, the current replication modes in DRBD do not supportthe checkpoint based asynchronous replication required by SecondSite. We mod-ified DRBD to add a new Remus style replication protocol. DRBD is configuredto perform a one-way resynchronization from backup to primary site, of all blocksindicated by the quick-sync bitmap at backup and activity log at primary.During resynchronization period, the VMs at backup site continue to operate nor-mally while the DRBD driver performs online storage resynchronization to bringthe peer node up-to-date. Once the resynchronization is complete, SecondSite repli-cation can be restarted very easily from either the primary site after live migratingthe VMs back to it, or by reversing roles of primary and backup site.3.4 Evaluation on a WAN TestbedWe have deployed SecondSite to provide continual protection for servers runningin their home location at UBC (University of British Columbia) in Vancouver, BC(British Columbia), backed up to a site 260 kilometers away at Thompson RiversUniversity in Kamloops, BC. The two sites are connected by a 1-Gigabit link with60Test Name Workload Test GoalMemoryStressContinuous malloc, dirty andfree. Check integrity of allo-cated memory upon failover.Trigger memory corrup-tion bugs in compressionlogic.PagetableStress Parameterized fork bomb.Trigger memory corrup-tion bugs not caught byMemoryStress.IOStressReads and writes on a large filein different I/O modes (buffered,direct).Trigger bugs in disk repli-cation logic.Table 3.3: The Regression Test Suitean average round trip time of 5ms. Figure 3.2 shows the topology of our con-figuration. The primary and backup servers are equipped with 8-core Intel Xeonprocessors and 16GB of RAM, and run Xen Linux 2.6.18-8 on Xen 4.0.0. Sec-ondSite was configured to checkpoint the VMs on the primary host at 50ms epochintervals, for a checkpoint frequency of 20 checkpoints per second.3.4.1 Regression TestsUsing a high-latency routed link for replication applies stress to the checkpoint-ing, failure detection and recovery systems that is not experienced over an Ethernetlink. We felt the best way to validate our system was to run continuous regressiontests under a real deployment between two distinct sites. The test suite consisted ofa cluster of VMs (each with 512MB RAM and 1 vCPU) protected by SecondSite.The VMs ran synthetic workloads that stressed disk, memory and network subsys-tems, as described in Table 3.3. As they ran, we regularly triggered site failures.On failover, the test programs verified the integrity of their memory and files toensure that data was not corrupted. When a crashed site came back online, its diskswere resynchronized while online using DRBD (Distributed Replicated Block De-vice). Once this phase completed, protection restarted with the recovered primaryhost acting as the new backup. The VMs continued to run their workload withoutinterruption throughout this process.61The regression test suite turned up some interesting corner cases:• Caching pagetable pages. Occasionally, the VM running the PagetableStressworkload suffered fatal crashes in the guest kernel, pointing to random pro-cesses as the fault origin. Once it became evident that the crash was causedby memory corruption, we were able to trace the culprit to stale entries in theLRU page cache of the compression module. During live migration, a guestVM’s pagetable pages are canonicalized before transmitting to target host,such that all references to the host’s machine frames are replaced by theircorresponding guest physical frames. On the receiving end, this process isreversed. The compression code only cached normal pages and transmittedpagetable pages as is, without updating the cache. Thus, when a normal pagebecame a pagetable page, the cache ended up having a stale version of thepage with the valid bit still set. When the page became a normal page again,a page delta would be taken against the wrong version of the page, resultingin memory corruption at the receiving end. Simply evicting the cache copyof the pagetable page fixed the issue.• Write after write race in disk replication. Write-heavy workloads some-times experienced disk data corruption on failover. However, we could notreproduce the bug when we repeated the experiment several times on a dif-ferent set of machines. We finally traced the problem to request re-orderingat the disk controller level. When the backup flushes a disk checkpoint,it simply queues up the writes to disk, merging the checkpoint flush withan ongoing one. When there are overlapping writes in adjacent checkpointflushes, the disk does not guarantee the order of these writes. Inserting aDiskI/O Barrier Request at the end of every checkpoint flush fixed the issue.3.4.2 Protecting a Cluster of VMsOur goal in this section is to evaluate the cost and effectiveness of SecondSite ina production environment. To that end, we have provisioned an aggressive mix ofmacrobenchmark workloads on a single server and enable concurrent protection62 0 500 1000 1500 2000 2500 3000 3500 0  10  20  30  40  50Operations/minElapsed Time(mins)OLTP2OLTP1WEB2WEB1ProtectedFailoverUnprotected Resync & Restart HA(138s)ProtectedService Downtime (13s)Figure 3.3: SecondSite Failover and Failback of a Set of HeterogeneousWorkloadson all of them. Specifically, we have installed 2 web servers and 2 databases withdifferent physical resource allocations onto a host with 8 physical CPUs. Domain-0 is configured with a vCPU for each physical CPU. Two VMs run the Apacheweb server with 1G of RAM and 2 vCPUs each, serving dynamic web pages. TheSPECweb 2005 Ecommerce benchmark [81] is used to generate the web serverworkloads. The remaining two VMs run the MySQL Database Server, with 2G ofRAM and 2 vCPUs each. The DVD-Store OLTP benchmark [101] is used to gener-ate the database workloads. The VMs are assigned IP addresses from a /24 CIDRblock that is protected by SecondSite using the BGP failover mechanism describedearlier. Each workload has 100 concurrent clients accessing the services via theirprotected IP addresses. The benchmarks were run with 10 minutes of warm-up in-terval and 30 minutes of measurement. The OLTP (online transaction processing)benchmark’s mean user think time was set to 10s. We present the benchmark re-sults in terms of operations per minute, where the operation represents a databasetransaction for OLTP or an HTTP (hypertext transfer protocol) page request forSPECweb.Figure 3.3 shows the throughput in terms of operations per minute observed bythe clients over a 50 minute measurement period. A failure was injected at theprimary site after 15 minutes of execution. With the quorum timeout set to 10seconds, the backup site took 13 seconds to fully recover all the services and re-establish network connectivity. Note that while an outage of this duration might63cause some applications to take recovery actions, in general it should not causeany interruption to the TCP connections from the clients, since TCP keepalives aredisabled by default, and the default timeout when keepalives are enabled is 2 hoursin at least Linux, FreeBSD, and Windows.Restarting HASecondSite takes advantage of DRBD to resynchronize storage with the recover-ing primary site, without interrupting the execution of VMs on the backup site.Once storage resynchronization completes, VM replication can be resumed fromthe backup site (now acting as primary) to the primary site (now acting as backup).If the primary site is the preferred home for the VMs, they may optionally be mi-grated back at this point.After a 15 minute outage period, the primary host was brought online. Resyn-chronization of each VM’s disk was initiated in parallel. Resynchronization took52s to complete after which HA was restarted sequentially for each service, withthe backup site now acting as the new primary. All VMs were fully protected af-ter 88 seconds, with a constant time overhead of 20 seconds per VM to achievesteady state replication. The throughput increases after failover because the VMsare running in an unprotected mode, and therefore do not incur any checkpoint-ing overhead. The drop in overall throughput starts during disk resynchronizationperiod and settles back to pre-failure levels after all VMs are protected.Disk resynchronizationIn order to restart HA for a VM, its disk has to be resynchronized with the recoveredprimary site. Our current recovery approach is to perform DRBD resynchroniza-tion of all VMs in parallel. This is a simple approach that is easy to validate andmoves the system through recovery stages in lock step. However, it can producecontention on a shared physical disk that reduces the overall throughput of resyn-chronization. We have also experimented with resynchronizing VMs sequentially,so that each VM can be re-protected as soon as its individual synchronization is64complete. This approach reduces the total resynchronization time and dramaticallyreduces the median unprotected window. In a test similar to the workload in Figure3, the total resynchronization time is reduced from 37 seconds to 29 seconds, andmedian unprotected time falls from 23 seconds to 7 seconds.3.4.3 DRBD Resynchronization Delays26M155M286M367M 0 5 10 15 20 25 30 355m 15m 30m 1hrTime to Resync(s)Outage PeriodFigure 3.4: OLTP Disk Resynchronization CostsFailback relies on DRBD to provide disk resynchronization, and cannot proceeduntil it completes. The time required for this process depends on the amount andlocality of data written to disk while the VM is disconnected from its backup. Togive an idea of the costs, Figure 3.4 presents the amount of data changed by theOLTP workload (a mixture of sequential and random I/O) as a function of theoutage period, along with the time required to resynchronize the disk.65Resource Unprotected ProtectedCPUUsage(Primary)Dom0 VMs1.1% 4.0%Dom0 VMs12.0% 11.6%CPUUsage(Backup)–Dom0 VMs8.4% –ReplicationBand-width– 238.82 MbpsTable 3.4: Resource Consumption in a Multi-VM EnvironmentOLTP2OLTP1WEB2WEB1 0 100 200 300 400 500 0  100  200  300  400  500  600  700  800  900Replication Bandwidth (Mbits/s)Elapsed Time(s)Figure 3.5: Replication Bandwidth Consumption before Failover(4 VMs x 100 Clients/VM)663.4.4 Resource Consumption & ThroughputFigure 3.5 shows the bandwidth consumption per VM when the entire site is pro-tected by SecondSite. To get an idea of the CPU overhead incurred by SecondSite,we sampled the CPU utilization for all the domains on the physical host using thexentop [107] tool. All replication related overheads are accounted to Domain–0,while other VMs are executing. The CPU was sampled every 3 seconds over aone hour execution period. Xentop reports an additive percentage utilization overall physical cores. We normalize this value to 100% and report the CPU utiliza-tion of Domain–0 and aggregate CPU used by the four VMs in Table 3.4 for bothunprotected and protected modes of operation. We also report the replication linkbandwidth consumed by SecondSite during the one hour execution period in Ta-ble 3.4.We also measure resources consumed by SecondSite (CPU and replication band-width) as a function of varying application loads. We vary the user’s “think time”parameter of the OLTP workload, which in turn determines the number of request-s/minute arriving at the server. We define three representative loads: High, Mediumand Low, corresponding to think time values of 5s , 10s and 15s respectively. Theaggregate CPU and network bandwidth consumed by Domain-0 are shown in Fig-ure 3.6a and Figure 3.6b respectively.67  0  5  10  15  20  25  30  35Low Medium HighCPU Utilization(%)UnprotectedProtected(a) Domain-0’s CPU UtilizationLow Medium High  0 2 4 6 8 10 12Replication Bandwidth (MBps)(b) Bandwidth usage on the ReplicationChannelFigure 3.6: Cost of HA as a Function of Application Load(OLTP Workloadwith 100 Concurrent Users)683.4.5 Throughput versus Replication LatencyFor a latency-sensitive workload like SPECweb, the throughput is limited by la-tency between the clients and the server. Our earlier work [26] examined the impactof network output buffering delay on SPECweb throughput, but did not examinethe effects of replication link latency (there was none). In SecondSite, higher repli-cation latencies result in longer times to commit the checkpoint at the backup site.Since the outgoing packets from the protected VM are buffered until the commitacknowledgement is received at the primary site, an increase in replication latencyincreases the protected server’s response time. For workloads like SPECweb, thisincreased response time results in lower throughput, as illustrated in Figure 3.7. 0 50 100 150 200 250 300 350 400 0  10  20  30  40  50  60  70Throughput (Ops/min)Replication Link Latency (ms)Figure 3.7: Impact of Replication Link Latency onApplication Throughput (SPECweb with 100 Clients)693.5 Related Work3.5.1 Replication OverheadVirtualization has been used to provide high availability for arbitrary applicationsrunning inside virtual machines, by replicating the entire virtual machine as itruns. Replication can be achieved either through event logging and execution re-play or whole machine checkpointing. While event logging requires much lessbandwidth than whole machine checkpointing, it is not guaranteed to be able toreproduce machine state unless execution can be made deterministic. Enforcingdeterminism on commodity hardware requires careful management of sources ofnon-determinism [12,31], and becomes infeasibly expensive to enforce on shared-memory multiprocessor systems [5,32,97]. Respec [52] does provide deterministicexecution recording and replay of multithreaded applications with good perfor-mance by lazily increasing the level of synchronization it enforces depending onwhether it observes divergence during replay, but it requires intricate modificationsto the operating system. It also requires re-execution to be performed on a differ-ent core of the same physical system, making it unsuitable for HA applications.For these reasons, the replay-based HA systems of which we are aware supportonly uniprocessor VMs [78]. SecondSite, like Remus [26] uses whole machinecheckpointing, so it supports multiprocessor VMs.Checkpoint-based replication builds on live migration [22]— the ability to migratevirtual machines from one physical host to another while they are running. The ev-erRun VM [103] product, from Marathon Technologies, provides automated faultdetection and failover for individual Windows VMs running over the Xen hyper-visor. While it employs techniques similar to Remus, it is heavily tailored towardsselect Windows services like Exchange Server, SQL Server, etc. SecondSite on theother hand provides high availability through whole virtual machine replicationwith Xen [10] in an application and operating system agnostic manner, runningover commodity hardware.703.5.2 Minimizing DowntimeThe established approach for disaster recovery in many environments today in-volves the synchronous or asynchronous replication of storage (e.g. SnapMir-ror [71], PipeCloud [95]). The Recovery Time Objective (RTO) and RecoveryPoint Objective (RPO) determines the degree of synchrony required and the over-head incurred during normal operation. Generally, some support is expected fromthe application level in order to restore the disk data to a consistent state.Live migration has been successfully exploited by many systems to minimize planneddowntime, when VMs are migrated across WAN. Storage migration over WAN hasbeen explored by Bradford et al. [11] and Hirofuchi et al. [43]. Sva¨rd et al. [84]evaluate the advantages of page delta compression for live migration for large en-terprise class workloads.CloudNet [96] demonstrated a viable cloud migration solution across data cen-ters using storage migration, page delta compression and custom VPN (virtual pri-vate network) concentrator components to maintain client connectivity during themigration. CloudNet primarily aims to tackle resource mobility and cloud burstscenarios for enterprise clouds. SecondSite could be integrated into CloudNet toachieve both dynamic resource pooling and fault tolerance across a WAN.3.5.3 IP Migration across WANHarney et al. [41] use mobile IPv6 to solve the IP address migration issue. Thissystem requires ISP network-level support for mobile IPv6 which is not yet widelyavailable. Bradford et al. [11] propose the use of IP tunneling and dynamic DNS(domain name service) for maintaining network connectivity during live migrationof a VM across WAN. Such a system is not resilient against site-wide failures orDNS caching at local proxies.WOW [37] creates a VLAN (virtual local area network) on a P2P (peer to peer)overlay network in which connectivity to the migrating virtual machine is main-tained by forwarding it to the new destination. In WOW, the connectivity problem71is restricted in scope to maintaining access only among peers of the P2P network.VIOLIN [47] uses a similar technique to create a VLAN on an overlay network,with software routers and switches to route traffic to appropriate nodes in the net-work. VIOLIN decouples the overlay virtual network from the underlying physicalnetwork, rendering changes in the underlying network topology transparent to theapplication. As with WOW, network transparency is only provided to participantsin the overlay networkVM Turntable [87] explores the feasibility of long-haul live migration over WAN/-MANs (metropolitan area networks) across two continents. Clients are required toestablish IP Tunnels to the virtual machines and the network infrastructure at thedestination takes care of reconfiguring the tunnel endpoint when the VM migratesacross network domains. In contrast, SecondSite provides location transparencywithout requiring cooperation from clients.3.6 ConclusionNetflix weathered the April 2011 AWS (Amazon web services) outage with verylittle impact on its business [24]. Some key aspects of its cloud architecture includecompletely stateless services and NoSQL based data-stores that sacrifice consis-tency for availability and durability. SecondSite may not be a good choice for suchstateless services. Building a scalable stateless e-commerce services like Netflix ishard and it requires an engineering skill that is generally not affordable by smalland medium businesses. Application programmers use off-the-shelf solutions torapidly develop and deploy web applications that often end up being stateful. Webelieve that SecondSite is a good fit for such applications as it eliminates the bur-den of maintaining consistency while adding high availability, transparent to theapplication.72Chapter 4Strata: ScalableHigh-Performance Storage onVirtualized Non-volatile Memory4.1 IntroductionFlash-based storage devices are fast, expensive and demanding: a single deviceis capable of saturating a 10Gb/s network link (even for random IO), consumingsignificant CPU resources in the process. That same device may cost as much as(or more than) the server in which it is installed1. The cost and performance char-acteristics of fast, non-volatile media have changed the calculus of storage systemdesign and present new challenges for building efficient and high-performance dat-acenter storage.This chapter describes the architecture of a commercial flash-based network-attachedstorage system, built using commodity hardware. In designing the system around1Enterprise-class PCIe (peripheral component interconnect express) flash drives in the 1 TB (ter-abyte) capacity range currently carry list prices in the range of $3-5 thousand USD (United Statesdollars). Large-capacity, high-performance cards are available for list prices of up to $160K.73PCIe flash, we begin with two observations about the effects of high-performancedrives on large-scale storage systems. First, these devices are fast enough that inmost environments, many concurrent workloads are needed to fully saturate them,and even a small degree of processing overhead will prevent full utilization. Thus,we must change our approach to the media from aggregation to virtualization. Sec-ond, aggregation is still necessary to achieve properties such as redundancy andscale. However, it must avoid the performance bottleneck that would result fromthe monolithic controller approach of a traditional storage array, which is designedaround the obsolete assumption that media is the slowest component in the system.Further, to be practical in existing datacenter environments, we must remain com-patible with existing client-side storage interfaces and support standard enterprisefeatures like snapshots and deduplication.Device Virtualization Layer (§4)Network Attached Disks (NADs) Responsibility: Virtualize a PCIe ash device into multiple address spaces and allow direct client access with controlled sharing. Protocol Virtualization Layer (§6)Scalable Protocol Presentation Responsibility: Allow the transparently scalable implementation of traditional IP- and Ethernet-based storage protocols.  Scalable NFSv3 Presents a single external NFS IP address, integrates with SDNswitch to transparently scale and manage connections acrosscontroller instances hosted on each microArray.CLOS (Coho Log-structured Object Store) Implements a !at object store, virtualizing the PCIe !ash device’s address space and presents an OSD-like interface toclients.libDataPath NFSv3 instance on each microarray links as a dispatch library.Data path descriptions are read from a cluster-wide registryand instantiated as dispatch state machines.  NFS forwards requests through these SMs, interacting directly with NADs.Central services update data paths in the face of failure, etc.Global Address Space Virtualization Layer (§3,5)Delegated Data Paths Responsibility: Compose device level objects into richer storage primitives.  Allow clients to dispatch requests directly to NADs while preserving centralized control over placement, recon!guration, and failure recovery. Layer name, core abstraction, and responsibility: Implementation in Strata: Figure 4.1: Strata Network Storage ArchitectureIn this chapter we explore the implications of these two observations on the de-sign of a scalable, high-performance NFSv3 (network file system version 3) imple-mentation for the storage of virtual machine images. Our system is based on thebuilding blocks of PCIe flash in commodity x86 servers connected by 10 gigabitswitched Ethernet. We describe two broad technical contributions that form thebasis of our design:1. A delegated mapping and request dispatch interface from client data to phys-ical resources through global data address virtualization, which allows clients74to directly address data while still providing the coordination required foronline data movement (e.g., in response to failures or for load balancing).2. SDN (software defined networking)-assisted storage protocol virtualizationthat allows clients to address a single virtual protocol gateway (e.g., NFSserver) that is transparently scaled out across multiple real servers. We havebuilt a scalable NFS server using this technique, but it applies to other proto-cols such as iSCSI (internet small computer system interface), SMB (servermessage block), and FCoE (fibre channel over ethernet) as well.At its core, Strata uses device-level object storage and dynamic, global address-space virtualization to achieve a clean and efficient separation between controland data paths in the storage system. Flash devices are split into virtual addressspaces using an object storage-style interface, and clients are then allowed to di-rectly communicate with these address spaces in a safe, low-overhead manner. Inorder to compose richer storage abstractions, a global address space virtualizationlayer allows clients to aggregate multiple per-device address spaces with mappingsthat achieve properties such as striping and replication. These delegated addressspace mappings are coordinated in a way that preserves direct client communi-cations with storage devices, while still allowing dynamic and centralized controlover data placement, migration, scale, and failure response.Serving this storage over traditional protocols like NFS imposes a second scalabil-ity problem: clients of these protocols typically expect a single server IP address,which must be dynamically balanced over multiple servers to avoid being a perfor-mance bottleneck. In order to both scale request processing and to take advantageof full switch bandwidth between clients and storage resources, we developed ascalable protocol presentation layer that acts as a client to the lower layers of ourarchitecture, and that interacts with a software-defined network switch to scale theimplementation of the protocol component of a storage controller across arbitrar-ily many physical servers. By building protocol gateways as clients of the addressvirtualization layer, we preserve the ability to delegate scale-out access to devicestorage without requiring interface changes on the end hosts that consume the stor-age.754.2 ArchitectureThe performance characteristics of emerging storage hardware demand that wecompletely reconsider storage architecture in order to build scalable, low-latencyshared persistent memory. The reality of deployed applications is that interfacesmust stay exactly the same in order for a storage system to have relevance. Strata’sarchitecture aims to take a step toward the first of these goals, while keeping apragmatic focus on the second.Figure 4.1 characterizes the three layers of Strata’s architecture. The goals and ab-stractions of each layer of the system are on the left-hand column, and the concreteembodiment of these goals in our implementation is on the right. At the base, wemake devices accessible over an object storage interface, which is responsible forvirtualizing the device’s address space and allowing clients to interact with indi-vidual virtual devices. This approach reflects our view that system design for thesestorage devices today is similar to that of CPU virtualization ten years ago: de-vices provide greater performance than is required by most individual workloadsand so require a lightweight interface for controlled sharing in order to allow multi-tenancy. We implement a per-device object store that allows a device to be virtual-ized into an address space of 2128 sparse objects, each of which may be up to 264bytes in size. Our implementation is similar in intention to the OSD (object stor-age device) specification, itself motivated by network attached secure disks [38].While not broadly deployed to date, device-level object storage is receiving re-newed attention today through pNFS (parallel network file system)’s use of OSDas a backend, the NVMe (non-volatile memory express) namespace abstraction,and in emerging hardware such as Seagate’s Kinetic drives [110]. Our object stor-age interface as a whole is not a significant technical contribution, but it does havesome notable interface customizations described in Section 4.4. We refer to thislayer as a Network Attached Disk, or NAD.The middle layer of our architecture provides a global address space that supportsthe efficient composition of IO processors that translate client requests on a vir-tual object into operations on a set of NAD-level physical objects. We refer to thegraph of IO processors for a particular virtual object as its data path, and we main-76tain the description of the data path for every object in a global virtual addressmap. Clients use a dispatch library to instantiate the processing graph describedby each data path and perform direct IO on the physical objects at the leaves ofthe graph. The virtual address map is accessed through a coherence protocol thatallows central services to update the data paths for virtual objects while they arein active use by clients. More concretely, data paths allow physical objects to becomposed into richer storage primitives, providing properties such as striping andreplication. The goal of this layer is to strike a balance between scalability andefficiency: it supports direct client access to device-level objects, without sacrific-ing central management of data placement, failure recovery, and more advancedstorage features such as deduplication and snapshots.Finally, the top layer performs protocol virtualization to allow clients to accessstorage over standard protocols (such as NFS) without losing the scalability of di-rect requests from clients to NADs. The presentation layer is tightly integratedwith a 10Gb software-defined Ethernet switching fabric, allowing external clientsthe illusion of connecting to a single TCP endpoint, while transparently and dy-namically balancing traffic to that single IP address across protocol instances onall of the NADs. Each protocol instance is a thin client of the layer below, whichmay communicate with other protocol instances to perform any additional synchro-nization required by the protocol (e.g., to maintain NFS namespace consistency).The mapping of these layers onto the hardware that our system uses is shown inFigure 4.2. Requests travel from clients into Strata through an OpenFlow-enabledswitch, which dispatches them according to load to the appropriate protocol han-dler running on a MicroArray (µArray) — a small host configured with flash de-vices and enough network and CPU to saturate them, containing the software stackrepresenting a single NAD. For performance, each of the layers is implemented asa library, allowing a single process to handle the flow of requests from client to me-dia. The NFSv3 implementation acts as a client of the underlying dispatch layer,which transforms requests on virtual objects into one or more requests on phys-ical objects, issued through function calls to local physical objects and by RPCto remote objects. While the focus of the rest of this chapter is on this concreteimplementation of scale-out NFS, it is worth noting that the design is intended to77VMware ESX HostVMware ESX HostVMware ESX HostVirtual NFS server Protocol Virtualizaiton(Scalable NFSv3) Arrows show NFS connections andassociated requests.Middle host connectionomited for clarity.Global Address Space Virtualization (libDataDispatch)Device Virtualization(CLOS)microArrayNFS InstancelibDataPathCLOSmicroArrayNFS InstancelibDataPathCLOSmicroArrayNFS InstancelibDataPathCLOS10Gb SDN SwitchFigure 4.2: Hardware View of a Strata Deploymentallow applications the opportunity to link directly against the same data path librarythat the NFS implementation uses, resulting in a multi-tenant, multi-presentationstorage system with a minimum of network and device-level overhead.Scope of this WorkThere are three aspects of our design that are not considered in detail within thispresentation. First, we only discuss NFS as a concrete implementation of protocolvirtualization. Strata has been designed to host and support multiple protocols andtenants, but our initial product release is specifically NFSv3 for VMware clients,so we focus on this type of deployment in describing the implementation. Second,Strata was initially designed to be a software layer that is co-located on the same78physical servers that host virtual machines. We have moved to a separate physicalhosting model where we directly build on dedicated hardware, but there is nothingthat prevents the system from being deployed in a more co-located (or “converged”)manner. Finally, our full implementation incorporates a tier of spinning disks oneach of the storage nodes to allow cold data to be stored more economically behindthe flash layer. However, in this chapter we configure and describe a single-tier, all-flash system to simplify the exposition.In the next sections we discuss three relevant aspects of Strata—address spacevirtualization, dynamic reconfiguration, and scalable protocol support—in moredetail. We then describe some specifics of how these three components interact inour NFSv3 implementation for VM image storage before providing a performanceevaluation of the system as a whole.4.3 Data PathsStrata provides a common library interface to data that underlies the higher-level,client-specific protocols described in Section 4.6. This library presents a notion ofvirtual objects, which are available cluster-wide and may comprise multiple phys-ical objects bundled together for parallel data access, fault tolerance, or other rea-sons (e.g., data deduplication). The library provides a superset of the object storageinterface provided by the NADs (Section 4.4), with additional interfaces to managethe placement of objects (and ranges within objects) across NADs, to maintain datainvariants (e.g., replication levels and consistent updates) when object ranges arereplicated or striped, and to coordinate both concurrent access to data and concur-rent manipulation of the virtual address maps describing their layout.To avoid IO bottlenecks, users of the data path interface (which may be nativeclients or protocol gateways such as our NFS server) access data directly. To do so,they map requests from virtual objects to physical objects using the virtual addressmap. This is not simply a pointer from a virtual object (id, range) pair to a setof physical object (id, range) pairs. Rather, each virtual range is associated witha particular processor for that range, along with processor-specific context. Strata79uses a dispatch-oriented programming model in which a pipeline of operations isperformed on requests as they are passed from an originating client, through a setof transformations, and eventually to the appropriate storage device(s). Our modelborrows ideas from packet processing systems such as X-Kernel [46], Scout [62],and Click [50], but adapts them to a storage context, in which modules along thepipeline perform translations through a set of layered address spaces, and may forkand/or collect requests and responses as they are passed.The dispatch library provides a collection of request processors, which can standalone or be combined with other processors. Each processor takes a storage re-quest (e.g., a read or write request) as input and produces one or more requeststo its children. NADs expose isolated sparse objects; processors perform transla-tions that allow multiple objects to be combined for some functional purpose, andpresent them as a single object, which may in turn be used by other processors. Theidea of request-based address translation to build storage features has been used inother systems [55, 59, 109], often as the basis for volume management; Strata dis-entangles it from the underlying storage system and treats it as a first-class dispatchabstraction.The composition of dispatch modules bears similarity to Click [50], but the ap-plication in a storage domain carries a number of differences. First, requests aregenerally acknowledged at the point that they reach a storage device, and so as aresult they differ from packet forwarding logic in that they travel both down andthen back up through a dispatch stack; processors contain logic to handle both re-quests and responses. Second, it is common for requests to be split or merged asthey traverse a processor — for example, a replication processor may duplicate arequest and issue it to multiple nodes, and then collect all responses before pass-ing a single response back up to its parent. Finally, while processors describe fast,library-based request dispatching logic, they typically depend on additional facili-ties from the system. Strata allows processor implementations access to APIs forshared, cluster-wide state which may be used on a control path to, for instance,store replica configuration. It additionally provides facilities for background func-tionality such as NAD failure detection and response. The intention of the proces-sor organization is to allow dispatch decisions to be pushed out to client implemen-80tations and be made with minimal performance impact, while still benefiting fromcommon system-wide infrastructure for maintaining the system and responding tofailures. The responsibilities of the dispatch library are described in more detail inthe following subsections.4.3.1 The Virtual Address Map/objects/112:type=regular dispatch={object=111type=dispatch}/objects/111:type=dispatchstripe={stripecount=8 chunksize=5242880={object=103 type=dispatch}1={object=104 type=dispatch}}/objects/103:type=dispatchrpl={policy=mirror storecount=2{storeid=a98f2... state=in-sync}{storeid=fc89f... state=in-sync}}Figure 4.3: Virtual Object to Physical Object Range MappingFigure 4.3 shows the relevant information stored in the virtual address map for atypical object. Each object has an identifier, a type, some type-specific context, andmay contain other metadata such as cached size or modification time information(which is not canonical, for reasons discussed below).The entry point into the virtual address map is a regular object. This contains nolocation information on its own, but delegates to a top-level dispatch object. InFigure 4.3, object 112 is a regular object that delegates to a dispatch processorwhose context is identified by object 111 (the IDs are in reverse order here becausethe dispatch graph is created from the bottom up, but traversed from the top down).Thus when a client opens file 112, it instantiates a dispatcher using the data inobject 111 as context. This context informs the dispatcher that it will be delegatingIO through a striped processor, using 2 stripes for the object and a stripe width of512K. The dispatcher in turn instantiates 8 processors (one for each stripe), each81configured with the information stored in the object associated with each stripe(e.g., stripe 0 uses object 103). Finally, when the stripe dispatcher performs IO onstripe 0, it will use the context in the object descriptor for object 103 to instantiatea replicated processor, which mirrors writes to the NADs listed in its replica set,and issues reads to the nearest in sync replica (where distance is currently simplylocal or remote).In addition to the striping and mirroring processors described here, the map cansupport other more advanced processors, such as erasure coding, or byte-rangemappings to arbitrary objects (which supports among other things data deduplica-tion).4.3.2 DispatchIO requests are handled by a chain of dispatchers, each of which has some com-mon functionality. Dispatchers may have to fragment requests into pieces if theyspan the ranges covered by different subprocessors, or clone requests into multiplesubrequests (e.g., for replication), and they must collect the results of subrequestsand deal with partial failures.The replication and striping modules included in the standard library are represen-tative of the ways processors transform requests as they traverse a dispatch stack.The replication processor allows a request to be split and issued concurrently toa set of replica objects. The request address remains unchanged within each ob-ject, and responses are not returned until all replicas have acknowledged a requestas complete. The processor prioritizes reading from local replicas, but forwardsrequests to remote replicas in the event of a failure (either an error response ora timeout). It imposes a global ordering on write requests and streams them toall replicas in parallel. It also periodically commits a light-weight checkpoint toeach replica’s log to maintain a persistent record of synchronization points; thesecheckpoints are used for crash recovery (Section 4.5.1).The striping processor distributes data across a collection of sparse objects. It is pa-rameterized to take a stripe size (in bytes) and a list of objects to act as the ordered82stripe set. In the event that a request crosses a stripe boundary, the processor splitsthat request into a set of per-stripe requests and issues those asynchronously, col-lecting the responses before returning. Static, address-based striping is a relativelysimple load balancing and data distribution mechanism as compared to placementschemes such as consistent hashing [48]. Our experience has been that the ap-proach is effective, because data placement tends to be reasonably uniform withinan object address space, and because using a reasonably large stripe size (we de-fault to 512KB) preserves locality well enough to keep request fragmentation over-head low in normal operation.4.3.3 CoherenceStrata clients also participate in a simple coordination protocol in order to allowthe virtual address map for a virtual object to be updated even while that objectis in use. Online reconfiguration provides a means for recovering from failures,responding to capacity changes, and even moving objects in response to observedor predicted load (on a device basis — this is distinct from client load balancing,which we also support through a switch-based protocol described in Section 4.6.2).The virtual address maps are stored in a distributed, synchronized configurationdatabase implemented over Apache Zookeeper, which is also available for anylow-bandwidth synchronization required by services elsewhere in the softwarestack. The coherence protocol is built on top of the configuration database. Itis currently optimized for a single writer per object, and works as follows: when aclient wishes to write to a virtual object, it first claims a lock for it in the configu-ration database. If the object is already locked, the client requests that the holderrelease it so that the client can claim it. If the holder does not voluntarily releaseit within a reasonable time, the holder is considered unresponsive and fenced fromthe system using the mechanism described in Section 4.6.2. This is enough to al-low movement of objects, by first creating new, out of sync physical objects at thedesired location, then requesting a release of the object’s lock holder if there is one.The user of the object will reacquire the lock on the next write, and in the processdiscover the new out of sync replica and initiate resynchronization. When the new83replica is in sync, the same process may be repeated to delete replicas that are atundesirable locations.4.4 Network Attached DisksThe unit of storage in Strata is a Network Attached Disk (NAD), consisting of abalanced combination of CPU, network and storage components. In our currenthardware, each NAD has two 10 gigabit Ethernet ports, two PCIe flash cards capa-ble of 10 gigabits of throughput each, and a pair of Xeon processors that can keepup with request load and host additional services alongside the data path. EachNAD provides two distinct services. First, it efficiently multiplexes the raw stor-age hardware across multiple concurrent users, using an object storage protocol.Second, it hosts applications that provide higher level services over the cluster.Object rebalancing (Section 4.5.2) and the NFS protocol interface (Section 4.6.1)are examples of these services.At the device level, we multiplex the underlying storage into objects, named by128-bit identifiers and consisting of sparse 264 byte data address spaces. Theseaddress spaces are currently backed by a garbage-collected log-structured objectstore, but the implementation of the object store is opaque to the layers aboveand could be replaced if newer storage technologies made different access patternsmore efficient. We also provide increased capacity by allowing each object to flushlow priority or infrequently used data to disk, but this is again hidden behind theobject interface. The details of disk tiering, garbage collection, and the layout ofthe file system are beyond the scope of this thesis.The physical object interface is for the most part a traditional object-based storagedevice [110, 111] with a CRUD (create, read, update, delete) interface for sparseobjects, as well as a few extensions to assist with our clustering protocol (Sec-tion 4.5.1). It is significantly simpler than existing block device interfaces, such asthe SCSI command set, but is also intended to be more direct and general purposethan even narrower interfaces such as those of a key-value store. Providing a low-level hardware abstraction layer allows the implementation to be customized to84accommodate best practices of individual flash implementations, and also allowsmore dramatic design changes at the media interface level as new technologiesbecome available.Network IntegrationAs with any distributed system, we must deal with misbehaving nodes. We ad-dress this problem by tightly coupling with managed Ethernet switches, which wediscuss at more length in Section 4.6.2. This approach borrows ideas from sys-tems such as Sane [16] and Ethane [15], in which a managed network is used toenforce isolation between independent endpoints. The system integrates with bothOpenFlow-based switches and software switching at the VMM to ensure that Strataobjects are only addressable by their authorized clients.Our initial implementation used Ethernet VLANs, because this form of hardware-supported isolation is in common use in enterprise environments. In the currentimplementation, we have moved to OpenFlow, which provides a more flexible tun-neling abstraction for traffic isolation.We also expose an isolated private virtual network for out-of-band control andmanagement operations internal to the cluster. This allows NADs themselves toaccess remote objects for peer-wise resynchronization and reorganization underthe control of a cluster monitor.4.5 Online ReconfigurationThere are two broad categories of events to which Strata must respond in orderto maintain its performance and reliability properties. The first category includesfaults that occur directly on the data path. The dispatch library recovers from suchfaults immediately and automatically by reconfiguring the affected virtual objectson behalf of the client. The second category includes events such as device fail-ures and load imbalance. These are handled by a dedicated cluster monitor whichperforms large-scale reconfiguration tasks to maintain the health of the system as85a whole. In all cases, reconfiguration is performed online and has minimal impacton client availability.4.5.1 Object ReconfigurationA number of error recovery mechanisms are built directly into the dispatch library.These mechanisms allow clients to quickly recover from failures by reconfiguringindividual virtual objects on the data path.IO ErrorsThe replication IO processor responds to read errors in the obvious way: by im-mediately resubmitting failed requests to different replicas. In addition, clientsmaintain per-device error counts; if the aggregated error count for a device exceedsa configurable threshold, a background task takes the device offline and coordinatesa system-wide reconfiguration (Section 4.5.2).IO processors respond to write errors by synchronously reconfiguring virtual ob-jects at the time of the failure. This involves three steps. First, the affected replicais marked out of sync in the configuration database. This serves as a global, persis-tent indication that the replica may not be used to serve reads because it containspotentially stale data. Second, a best-effort attempt is made to inform the NADof the error so that it can initiate a background task to resynchronize the affectedreplica. This allows the system to recover from transient failures almost immedi-ately. Finally, the IO processor allocates a special patch object on a separate deviceand adds this to the replica set. Once a replica has been marked out of sync, no fur-ther writes are issued to it until it has been resynchronized; patches prevent devicefailures from impeding progress by providing a temporary buffer to absorb writesunder these degraded conditions. With the patch object allocated, the IO processorcan continue to meet the replication requirements for new writes while out of syncreplicas are repaired in the background. A replica set remains available as long asan in sync replica or an out of sync replica and all of its patches are available.86ResynchronizationIn addition to providing clients direct access to devices via virtual address maps,Strata provides a number of background services to maintain the health of individ-ual virtual objects and the system as a whole. The most fundamental of these is theresync service, which provides a background task that can resynchronize objectsreplicated across multiple devices.Resync is built on top of a special NAD resync API (application program inter-face) that exposes the underlying log structure of the object stores. NADs maintaina Log Serial Number (LSN) with every physical object in their stores; when arecord is appended to an object’s log, its LSN is monotonically incremented. TheIO processor uses these LSNs to impose a global ordering on the changes made tophysical objects that are replicated across stores and to verify that all replicas havereceived all updates.If a write failure causes a replica to go out of sync, the client can request the systemto resynchronize the replica. It does this by invoking the resync RPC (remoteprocedure call) on the NAD which hosts the out of sync replica. The server thenstarts a background task which streams the missing log records from an in syncreplica and applies them to the local out of sync copy, using the LSN to identifywhich records the local copy is missing.During resync, the background task has exclusive write access to the out of syncreplica because all clients have been reconfigured to use patches. Thus the resynctask can chase the tail of the in sync object’s log while clients continue to write.When the bulk of the data has been copied, the resync task enters a final stop-and-copy phase in which it acquires exclusive write access to all replicas in the replicaset, finalizes the resync, applies any client writes received in the interim, marks thereplica as in sync in the configuration database, and removes the patch.It is important to ensure that resync makes timely progress to limit vulnerability todata loss. Very heavy client write loads may interfere with resync tasks and, in theworst case, result in unbounded transfer times. For this reason, when an object isunder resync, client writes are throttled and resync requests are prioritized.87Crash RecoverySpecial care must be taken in the event of an unclean shutdown. On a cleanshutdown, all objects are released by removing their locks from the configurationdatabase. Crashes are detected when replica sets are discovered with stale locks(i.e., locks identifying unresponsive IO processors). When this happens, it is notsafe to assume that replicas marked in sync in the configuration database are trulyin sync, because a crash might have occured midway through a the configurationdatabase update; instead, all the replicas in the set must be queried directly to de-termine their states.In the common case, the IO processor retrieves the LSN for every replica in the setand determines which replicas, if any, are out of sync. If all replicas have the sameLSN, then no resynchronization is required. If different LSNs are discovered, thenthe replica with the highest LSN is designated as the authoritative copy, and allother replicas are marked out of sync and resync tasks are initiated.If a replica cannot be queried during the recovery procedure, it is marked as di-verged in the configuration database and the replica with the highest LSN from theremaining available replicas is chosen as the authoritative copy. In this case, writesmay have been committed to the diverged replica that were not committed to anyothers. If the diverged replica becomes available again some time in the future,these extra writes must be discarded. This is achieved by rolling the replica backto its last checkpoint and starting a resync from that point in its log. Consistencyin the face of such rollbacks is guaranteed by ensuring that objects are successfullymarked out of sync in the configuration database before writes are acknowledgedto clients. Thus write failures are guaranteed to either mark replicas out of sync inthe configuration database (and create corresponding patches) or propagate back tothe client.4.5.2 System ReconfigurationStrata also provides a highly-available monitoring service that watches over thehealth of the system and coordinates system-wide recovery procedures as neces-88sary. Monitors collect information from clients, SMART diagnostic tools, andNAD RPCs to gauge the status of the system. Monitors build on the per-objectreconfiguration mechanisms described above to respond to events that individualclients do not address, such as load imbalance across the system, stores nearingcapacity, and device failures.RebalanceStrata provides a rebalance facility which is capable of performing system-widereconfiguration to repair broken replicas, prevent NADs from filling to capacity,and improve load distribution across NADs. This facility is in turn used to recoverfrom device failures and expand onto new hardware.Rebalance proceeds in two stages. In the first stage, the monitor retrieves the cur-rent system configuration, including the status of all NADs and virtual address mapof every virtual object. It then constructs a new layout for the replicas accordingto a customizable placement policy. This process is scriptable and can be easilytailored to suit specific performance and durability requirements for individual de-ployments (see Section 4.7.3 for some analysis of the effects of different placementpolicies). The default policy uses a greedy algorithm that considers a number of cri-teria designed to ensure that replicated physical objects do not share fault domains,capacity imbalances are avoided as much as possible, and migration overheads arekept reasonably low. The new layout is formulated as a rebalance plan describingwhat changes need to be applied to individual replica sets to achieve the desiredconfiguration.In the second stage, the monitor coordinates the execution of the rebalance plan byinitiating resync tasks on individual NADs to effect the necessary data migration.When replicas need to be moved, the migration is performed in three steps:1. A new replica is added to the destination NAD2. A resync task is performed to transfer the data3. The old replica is removed from the source NAD89This requires two reconfiguration events for the replica set, the first to extend it toinclude the new replica, and the second to prune the original after the resync hascompleted. The monitor coordinates this procedure across all NADs and clients forall modified virtual objects.Device FailureStrata determines that a NAD has failed either when it receives a hardware failurenotification from a responsive NAD (such as a failed flash device or excessiveerror count) or when it observes that a NAD has stopped responding to requests formore than a configurable timeout. In either case, the monitor responds by takingthe NAD offline and initiating a system-wide reconfiguration to repair redundancy.The first thing the monitor does when taking a NAD offline is to disconnect itfrom the data path VLAN. This is a strong benefit of integrating directly against anEthernet switch in our environment: prior to taking corrective action, the NAD issynchronously disconnected from the network for all request traffic, avoiding thedistributed systems complexities that stem from things such as overloaded com-ponents appearing to fail and then returning long after a timeout in an inconsis-tent state. Rather than attempting to use completely end-host mechanisms suchas watchdogs to trigger reboots, or agreement protocols to inform all clients ofa NAD’s failure, Strata disables the VLAN and requires that the failed NAD re-connect on the (separate) control VLAN in the event that it returns to life in thefuture.From this point, the recovery logic is straight forward. The NAD is marked asfailed in the configuration database and a rebalance job is initiated to repair anyreplica sets containing replicas on the failed NAD.Elastic Scale OutStrata responds to the introduction of new hardware much in the same way thatit responds to failures. When the monitor observes that new hardware has been90installed, it uses the rebalance facility to generate a layout that incorporates thenew devices. Because replication is generally configured underneath striping, wecan migrate virtual objects at the granularity of individual stripes, allowing a sin-gle striped file to exploit the aggregated performance of many devices. Objects,whether whole files or individual stripes, can be moved to another NAD even whilethe file is online, using the existing resync mechanism. New NADs are populatedin a controlled manner to limit the impact of background IO on active client work-loads.4.6 Storage ProtocolsStrata supports legacy protocols by providing an execution runtime for hostingprotocol servers. Protocols are built as thin presentation layers on top of the dis-patch interfaces; multiple protocol instances can operate side by side. Implemen-tations can also leverage SDN-based protocol scaling to transparently spread mul-tiple clients across the distributed runtime environment.4.6.1 Scalable NFSStrata is designed so that application developers can focus primarily on implement-ing protocol specifications without worrying much about how to organize data ondisk. We expect that many storage protocols can be implemented as thin wrappersaround the provided dispatch library. Our NFS implementation, for example, mapsvery cleanly onto the high-level dispatch APIs, providing only protocol-specific ex-tensions like RPC marshalling and NFS-style access control. It takes advantage ofthe configuration database to store mappings between the NFS namespace and thebackend objects, and it relies exclusively on the striping and replication processorsto implement the data path. Moreover, Strata allows NFS servers to be instantiatedacross multiple backend nodes, automatically distributing the additional processingoverhead across backend compute resources.914.6.2 SDN Protocol ScalingScaling legacy storage protocols can be challenging, especially when the protocolswere not originally designed for a distributed back end. Protocol scalability limi-tations may not pose significant problems for traditional arrays, which already sitbehind relatively narrow network interfaces, but they can become a performancebottleneck in Strata’s distributed architecture.A core property that limits scale of access bandwidth of conventional IP storageprotocols is the presentation of storage servers behind a single IP address. For-tunately, emerging “software defined” network (SDN) switches provide interfacesthat allow applications to take more precise control over packet forwarding throughEthernet switches than has traditionally been possible.Using the OpenFlow protocol, a software controller is able to interact with theswitch by pushing flow-specific rules onto the switch’s forwarding path. Open-Flow rules are effectively wild-carded packet filters and associated actions that tella switch what to do when a matching packet is identified. SDN switches (ourimplementation currently uses an Arista Networks 7050T-52) interpret these flowrules and push them down onto the switch’s TCAM (ternary content-addressablememory) or L2/L3 (layer 2/layer 3) forwarding tables.By manipulating traffic through the switch at the granularity of individual flows,Strata protocol implementations are able to present a single logical IP address tomultiple clients. Rules are installed on the switch to trigger a fault event whenevera new NFS session is opened, and the resulting exception path determines whichprotocol instance to forward that session to initially. A service monitors networkactivity and migrates client connections as necessary to maintain an even workloaddistribution.The protocol scaling API wraps and extends the conventional socket API, allowinga protocol implementation to bind to and listen on a shared IP address across allof its instances. The client load balancer then monitors the traffic demands acrossall of these connections and initiates flow migration in response to overload on anyindividual physical connection.92In its simplest form, client migration is handled entirely at the transport layer.When the protocol load balancer observes that a specific NAD is overloaded, it up-dates the routing tables to redirect the busiest client workload to a different NAD.Once the client’s traffic is diverted, it receives a TCP RST from the new NAD andestablishes a new connection, thereby transparently migrating traffic to the newNAD.Strata also provides hooks for situations where application layer coordination isrequired to make migration safe. For example, our NFS implementation registers apre-migration routine with the load balancer, which allows the source NFS server toflush any pending, non-idempotent requests (such as create or remove) beforethe connection is redirected to the destination server.4.7 EvaluationIn this section we evaluate our system both in terms of effective use of flash re-sources, and as a scalable, reliable provider of storage for NFS clients. First, weestablish baseline performance over a traditional NFS server on the same hardware.Then we evaluate how performance scales as nodes are added and removed fromthe system, using VM-based workloads over the legacy NFS interface, which isoblivious to cluster changes. In addition, we compare the effects of load balancingand object placement policy on performance. We then test reliability in the faceof node failure, which is a crucial feature of any distributed storage system. Wealso examine the relation between CPU power and performance in our system as ademonstration of the need to balance node power between flash, network and CPU.4.7.1 Test EnvironmentEvaluation was performed on a cluster of the maximum size allowed by our 48-portswitch: 12 NADs, each of which has two 10 gigabit Ethernet ports, two 800 GBIntel 910 PCIe flash cards (with advertised performance of 180,000 IOPS or 2000MBps at 100% read workloads, and 75,000 IOPS or 1000 MBps for 100% writes),93Server Read IOPS Write IOPSStrata 40287 9960KNFS 23377 5796Table 4.1: Random IO Performance on Strata versus KNFS6 3 TB SATA drives, 64 GB of RAM, and 2 Xen E5-2620 processors at 2 GHzwith 6 cores/12 threads each, and 12 clients, in the form of Dell PowerEdge R420servers running ESXi 5.0, with two 10 gigabit ports each, 64 GB of RAM, and 2Xeon E5-2470 processors at 2.3 GHz with 8 cores/16 threads each. We configuredthe deployment to maintain two replicas of every stored object, without striping(since it unnecessarily complicates placement comparisons and has little benefitfor symmetric workloads). Garbage collection is active, and the deployment is inits standard configuration with a disk tier enabled, but the workloads have beenconfigured to fit entirely within flash, as the effects of cache misses to magneticmedia are not relevant to this thesis.4.7.2 Baseline PerformanceTo provide some performance context for our architecture versus a typical NFSimplementation, we compare two minimal deployments of NFS over flash. We setStrata to serve a single flash card, with no replication or striping, and mounted itloopback. We ran a fio [108] workload with a 4K IO size 80/20 read-write mixat a queue depth of 128 against a fully allocated file. We then formatted the flashcard with ext4, exported it with the linux kernel NFS server, and ran the same test.The results are in Table 4.1. As the table shows, we offer good NFS performanceat the level of individual devices. In the following section we proceed to evaluatescalability.4.7.3 ScalabilityIn this section we evaluate how well performance scales as we add NADs to thecluster. We begin each test by deploying 96 VMs (8 per client) into a cluster of 294Seconds0 420 840 1260 1680 2100 2520 2940 3360 3780 4200 4620 5040 5460 5880 6300 6720 7140IOPS     010000020000030000040000050000060000070000080000090000010000001100000Figure 4.4: IOPS over Time, Read-Only WorkloadNADs. We choose this number of VMs because ESXi limits the queue depth for aVM to 32 outstanding requests, but we do not see maximum performance until aqueue depth of 128 per flash card. The VMs are each configured to run the same fioworkload for a given test. In the following set of figures, each color band representsthe IO of a single NFS connection from an ESXi client, of which there are two perhost (one for each NIC). In Figure 4.4, fio generates 4K random reads to focus onIOPS (input/output operations per second) scalability. In Figure 4.5, fio generatesan 80/20 mix of reads and writes at 128K block size in a Pareto distribution suchthat 80% of requests go to 20% of the data. This is meant to be more representativeof real VM workloads, but with enough offered load to completely saturate thecluster. To simplify presentation, we have fixed the block size for the tests in thissection. Therefore, bandwidth graphs would be identical to IOPS graphs asidefrom the absolute numbers on the Y axis, and so we have omitted them.95Seconds0 360 720 1080 1440 1800 2160 2520 2880 3240 3600 3960 4320 4680 5040 5400 5760 6120 6480 6840IOPS     0 10000 20000 30000 40000 50000 60000 70000Figure 4.5: IOPS Over Time, 80/20 Read/Write WorkloadAs the tests run, we periodically add NADs, two at a time, up to a maximum oftwelve2. When each pair of NADs comes online, a rebalancing process automat-ically begins to move data across the cluster so that the amount of data on eachNAD is balanced. This is visible as the variable performance between two steadystate performance steps in this set of graphs. When it completes, we run in asteady state for two minutes and then add the next pair. In both figures, the periodswhere rebalancing is in progress are reflected by a temporary drop in performance(as the rebalance process competes with client workloads for resources), followedby a rapid increase in overall performance when the new nodes are marked avail-able, triggering the switch to load-balance clients to them. A cluster of 12 NADsachieves over 1 million IOPS in the IOPS test, and 10 NADs achieve 70,000 IOPS(representing more than 9 gigabytes/second of throughput) in the 80/20 test.We also test the effect of placement and load balancing on overall performance.If the location of a workload source is unpredictable (as in a VM data center with2ten for the read/write test due to an unfortunate test harness problem96Seconds0 420 840 1260 1680 2100 2520 2940 3360 3780 4200 4620 5040 5460 5880 6300 6720 7140 7560IOPS     0100000200000300000400000Figure 4.6: IOPS Over Time, Read-Only Workload with Random Placementvirtual machine migration enabled), we need to be able to migrate clients quicklyin response to load. However, if the configuration is more static or can be predictedin advance, we may benefit from attempting to place clients and data together toreduce the network overhead incurred by remote IO requests. As discussed inSection 4.5.2, the load-balancing and data migration features of Strata make bothapproaches possible. Figure 4.4 is the result of an aggressive local placement pol-icy, in which data is placed on the same NAD as its clients, and both are moved asthe number of devices changes. This achieves the best possible performance at thecost of considerable data movement. In contrast, Figure 4.6 shows the performanceof an otherwise identical test configuration when data is placed randomly (whilestill satisfying fault tolerance and even distribution constraints), rather than beingmoved according to client requests. The pareto workload (Figure 4.5) is also con-figured with the default random placement policy, which is the main reason that itdoes not scale linearly: as the number of nodes increases, so does the probabilitythat a request will need to be forwarded to a remote NAD.974.7.4 Node FailureAs a counterpoint to the scalability tests run in the previous section, we also testedthe behaviour of the cluster when a node is lost. We configured a 10 NAD clusterwith 10 clients hosting 4 VMs each, running the 80/20 Pareto workload describedearlier. Figure 4.7 shows the behaviour of the system during this experiment. Afterthe VMs had been running for a short time, we powered off one of the NADs byIPMI (intelligent platform management iInterface), waited 60 seconds, then pow-ered it back on. During the node outage, the system continued to run uninterruptedbut with lower throughput. When the node came back up, it spent some time resyn-chronizing its objects to restore full replication to the system, and then rejoined thecluster. The client load balancer shifted clients onto it and throughput was restored(within the variance resulting from the client load balancer’s placement decisions).Seconds0 60 120 180 240 300 360 420GB/s0123456789101112Figure 4.7: Aggregate Bandwidth for 80/20 Clients during Failover andRecovery98CPU IOPS Freq (Cores) PriceE5-2620 127K 2 GHz (6) $406E5-2640 153K (+20%) 2.5 GHz (6) $885E5-2650v2 188K (+48%) 2.6 GHz (8) $1166E5-2660v2 183K (+44%) 2.2 GHz (10) $1389Table 4.2: Achieved IOPS on an 80/20 Random 4K Workload across 2MicroArrays4.7.5 Protocol OverheadThe benchmarks up to this point have all been run inside VMs whose storage isprovided by a virtual disk that Strata exports by NFS to ESXi. This configurationrequires no changes on the part of the clients to scale across a cluster, but doesimpose overheads. To quantify these overheads we wrote a custom fio engine thatis capable of performing IO directly against our native dispatch interface (that is,the API by which our NFS protocol gateway interacts with the NADs). We thencompared the performance of a single VM running a random 4k read fio workload(for maximum possible IOPS) against a VMDK (virtual machine disk) exported byNFS to the same workload run against our native dispatch engine. In this experi-ment, the VMDK-based experiment produced an average of 50240 IOPS, whereasdirect access achieved 54060 IOPS, for an improvement of roughly 8%.4.7.6 Effect of CPU on PerformanceA workload running at full throttle with small requests completely saturates theCPU. This remains true despite significant development effort in performance de-bugging, and a great many improvements to minimize data movement and con-tention. In this section we report the performance improvements resulting fromfaster CPUs. These results are from random 4K NFS requests in an 80/20 read-write mix at 128 queue depth over four 10Gb links to a cluster of two NADs, eachequipped with 2 physical CPUs.99Table 4.2 shows the results of these tests. In short, it is possible to “buy” addi-tional storage performance under full load by upgrading the CPUs into a more“balanced” configuration. The wins are significant and carry a non-trivial increasein the system cost. As a result of this experimentation, we elected to use a higherperformance CPU in the shipping version of the product.4.8 Related WorkStrata applies principles from prior work in server virtualization, both in the form ofhypervisor [10, 94] and lib-OS [34] architectures, to solve the problem of sharingand scaling access to fast non-volatile memories among a heterogeneous set ofclients. Our contributions build upon the efforts of existing research in severalareas.Recently, researchers have begin to investigate a broad range of system perfor-mance problems posed by storage class memory in single servers [7], includingcurrent PCIe flash devices [89], next generation PCM [4], and byte addressabil-ity [25]. Moneta [17] proposed solutions to an extensive set of performance bottle-necks over the PCIe bus interface to storage, and others have investigated improv-ing the performance of storage class memory through polling [98], and avoidingsystem call overheads altogether [18]. We draw from this body of work to op-timize the performance of our dispatch library, and use this baseline to deliver ahigh performance scale-out network storage service. In many cases, we wouldbenefit further from these efforts—for example, our implementation could be op-timized to offload per-object access control checks, as in Moneta-D [18]. There isalso a body of work on efficiently using flash as a caching layer for slower, cheaperstorage in the context of large file hosting. For example, S-CAVE [54] optimizescache utilization on flash for multiple virtual machines on a single VMware hostby running as a hypervisor module. This work is largely complementary to ours;we support using flash as a caching layer and would benefit from more effectivecache management strategies.100Prior research into scale-out storage systems, such as FAWN [6], and Corfu [8] hasconsidered the impact of a range of NV memory devices on cluster storage perfor-mance. However, to date these systems have been designed towards lightweightprocessors paired with simple flash devices. It is not clear that this balance isthe correct one, as evidenced by the tendency to evaluate these same designs onsignificantly more powerful hardware platforms than they are intended to oper-ate [8]. Strata is explicitly designed for dense virtualized server clusters backed byperformance-dense PCIe-based nonvolatile memory. In addition, like older com-modity disk-oriented systems including Petal [53, 86] and FAB [77], prior storagesystems have tended to focus on building aggregation features at the lowest level oftheir designs, and then adding a single presentation layer on top. Strata in contrastsisolates shares each powerful PCIe-based storage class memory as its underlyingprimitive. This has allowed us to present a scalable runtime environment in whichmultiple protocols can coexist as peers without sacrificing the raw performance thattoday’s high performance memory can provide. Many scale-out storage systems,including NV-Heaps [23], Ceph/RADOS [93], and even PNFS [42] are unable tosupport the legacy formats in enterprise environments. Our agnosticism to any par-ticular protocol is similar to approach used by Ursa Minor [1], which also boasteda versatile client library protocol to share access to a cluster of magnetic disks.Strata does not attempt to provide storage for datacenter-scale environments, un-like systems including Azure [14], FDS [66], or Bigtable [21]. Storage systems inthis space differ significantly in their intended workload, as they emphasize highthroughput linear operations. Strata’s managed network would also need to beextended to support datacenter-sized scale out. We also differ from in-RAM ap-proaches such a RAMCloud [69] and memcached [36], which offer a different classof durability guarantee and cost.4.9 ConclusionStorage system design faces a sea change resulting from the dramatic increase inthe performance density of its component media. Distributed storage systems com-posed of even a small number of network-attached flash devices are now capable of101matching the offered load of traditional systems that would have required multipleracks of spinning disks.Strata is an enterprise storage architecture that responds to the performance char-acteristics of PCIe storage devices. Using building blocks of well-balanced flash,compute, and network resources and then pairing the design with the integrationof SDN-based Ethernet switches, Strata provides an incrementally deployable, dy-namically scalable storage system.Strata’s initial design is specifically targeted at enterprise deployments of VMwareESX, which is one of the dominant drivers of new storage deployments in enter-prise environments today. The system achieves high performance and scalabil-ity for this specific NFS environment while allowing applications to interact di-rectly with virtualized, network-attached flash hardware over new protocols. Thisis achieved by cleanly partitioning our storage implementation into an underly-ing, low-overhead virtualization layer and a scalable framework for implementingstorage protocols. Over the next year, we intend to extend the system to providegeneral-purpose NFS support by layering a scalable and distributed metadata ser-vice and small object support above the base layer of coarse-grained storage prim-itives.102Chapter 5ConclusionIn this thesis we used virtualization to allow individual virtual devices to trans-parently make use of multiple physical devices. In this way, the virtual hardwarecan be freed of some of the physical limitations of the devices beneath it, suchas vulnerability to failure or resource exhaustion. We have developed this thesisthrough three successive projects that explore different ways to aggregate physicalhardware and different virtualization models.Each project described in this thesis has been rewarding, not just as a develop-ment of the theme of the thesis, but also as a standalone contribution. As theoldest project, Remus has had time to be explored by other researchers, as wellas industrial development. Since its publication, Remus has been influential bothacademically and in industry. It has been cited more than 500 times accordingto Google Scholar, and I have it on good authority that it influenced VMware toconvert from deterministic replay to checkpoints in its own commercial product. Ithink that one of the reasons it was successful is that it was able to create a verysimple abstraction for what was previously a very subtle and complex feature. Ialso spent time getting the project merged into the upstream Xen project, and thishas made it very fruitful for research teams at other institutions, not just becausethey have easy access to the code, but also because it is maintained and becauseupstream publication required it to be polished beyond the level required of most103academic code. Finally, it has been personally gratifying to see Remus included inthe syllabuses of several graduate courses.SecondSite demonstrates both how far we can go in spanning physical deviceswithin virtual hardware, and also some of its limitations. The technical improve-ments to state replication were I believe quite successful, but the final systemis complex to operate due primarily to the way it interacts with internet routingthrough BGP. The problem of making virtual machines location independent overthe wide area is still an area of active research [80]. I think the essential difficulty isthat network addresses are still too tightly coupled to their physical providers: theBGP autonomous system model of address assignment ties IP addresses to infras-tructure. None of the various ways of multihoming seem to be capable of enoughdynamism to allow a network address to be truly location independent.Virtualization is usually taken to mean whole machine virtualization, but we havealso demonstrated that the same principles work at finer granularity, by buildingaggregate virtual disks as well as virtual machines. In Strata we transferred thevirtualization approach used for processors to storage devices in order to makebetter use of their rapidly increasing power, and we found that in addition to makingmore efficient use of hardware, virtualization was a simple, effective way to buildfault tolerance, load balancing, and resource scaling.5.1 Future WorkThe projects described in this thesis demonstrate some of the benefits of aggre-gating devices through virtualization, and they also suggest directions for furtherresearch. As one possibility, start with the observation that the relationship betweenvirtualization and hardware is bidirectional. When hardware first becomes power-ful enough to warrant virtualization, software approaches develop that are largelydictated by the interface provided by the hardware. But because the hardware wasnot designed for it, there are inevitably parts of its interface that are difficult orexpensive to virtualize. Eventually, after virtualization has demonstrated itself tobe prevalent enough to be worth explicitly supporting, the hardware adapts, pro-104viding new or modified interfaces specifically designed for the virtualization layer.This happened with CPUs, with the development of virtualization extensions suchas hardware-assisted paging. Over time, hardware support has grown from narrowchanges to support particularly painful or performance-sensitive spots in an ABI(application binary interface) into a much more general design. For example, PCIdevices developed a standard called Single Root I/O Virtualization, which dividesthem into a privileged interface that can configure the device, and multiple vir-tual devices that are isolated from each other. This resembles a hypervisor modelapplied to individual devices.If, as this thesis argues, it is indeed valuable for virtual devices to be decoupledfrom physical devices, then it may become worthwhile for hardware to adapt tosupport this goal. It is interesting to think about what kind of hardware changesmight be helpful for this. As one possibility, perhaps hardware could make devicestate easier to move between devices. Basic virtualization entails encapsulating thestate of a device, so that it can be saved and restored as different virtual machinesbecome active. This is often hardware-assisted, but when it is, the state is oftenopaque, encapsulated by the hardware and thus tied to a specific device. If thehardware could not only preserve virtual state but also allow it to be exported andimported, then replicating or relocating it could perhaps be done more simply andefficiently, allowing it to support a wider variety of applications and devices.Another direction I would like to explore is rack-scale computing. Much of thiswork has been concerned with managing increasingly powerful individual deviceswithin the framework of a classical server consisting of some static allocationof physical resources like RAM, CPU, and network and storage devices. In thismodel, accessing a device remotely is much slower and more expensive than lo-cally. Although we have worked very hard to overcome this performance gap ineach of the projects above, it does limit how seamlessly we can combine physicaldevices into virtual ones. Recently, hardware improvements have made it possibleto construct rack-sized systems in which dozens or hundreds of processors, net-work and storage devices communicate with each other over a shared high-speedbackplane. These devices are still vulnerable to failure and resource exhaustion,but the physical cost of device aggregation is largely eliminated. This may make it105possible to develop more powerful virtual devices than those we have already built,as well as making them faster and more transparent.106Bibliography[1] ABD-EL-MALEK, M., COURTRIGHT, II, W. V., CRANOR, C., GANGER,G. R., HENDRICKS, J., KLOSTERMAN, A. J., MESNIER, M., PRASAD,M., SALMON, B., SAMBASIVAN, R. R., SINNAMOHIDEEN, S., STRUNK,J. D., THERESKA, E., WACHS, M., AND WYLIE, J. J. Ursa minor: Versa-tile cluster-based storage. In Proceedings of the 4th Conference on USENIXConference on File and Storage Technologies - Volume 4 (Berkeley, CA,USA, 2005), FAST’05, USENIX Association, pp. 5–5. → pages 101[2] AGUILERA, M. K., CHEN, W., AND TOUEG, S. Heartbeat: A timeout-freefailure detector for quiescent reliable communication. Tech. rep., Ithaca,NY, USA, 1997. → pages 51[3] AGUILERA, M. K., CHEN, W., AND TOUEG, S. Using the heartbeat fail-ure detector for quiescent reliable communication and consensus in parti-tionable networks. Theor. Comput. Sci. 220 (June 1999), 3–30. → pages51[4] AKEL, A., CAULFIELD, A. M., MOLLOV, T. I., GUPTA, R. K., ANDSWANSON, S. Onyx: a protoype phase change memory storage array. InProceedings of the 3rd USENIX conference on Hot topics in storage and filesystems (Berkeley, CA, USA, 2011), HotStorage’11, USENIX Association,pp. 2–2. → pages 100[5] ALTEKAR, G., AND STOICA, I. ODR: output-deterministic replay for mul-ticore debugging. In SOSP ’09: Proceedings of the ACM SIGOPS 22ndsymposium on Operating systems principles (New York, NY, USA, 2009),ACM, pp. 193–206. → pages 70[6] ANDERSEN, D. G., FRANKLIN, J., KAMINSKY, M., PHANISHAYEE, A.,TAN, L., AND VASUDEVAN, V. Fawn: a fast array of wimpy nodes. In107Proceedings of the ACM SIGOPS 22nd symposium on Operating systemsprinciples (2009), SOSP ’09, pp. 1–14. → pages 101[7] BAILEY, K., CEZE, L., GRIBBLE, S. D., AND LEVY, H. M. Operatingsystem implications of fast, cheap, non-volatile memory. In Proceedings ofthe 13th USENIX conference on Hot topics in operating systems (Berkeley,CA, USA, 2011), HotOS’13, USENIX Association, pp. 2–2. → pages 100[8] BALAKRISHNAN, M., MALKHI, D., PRABHAKARAN, V., WOBBER, T.,WEI, M., AND DAVIS, J. D. Corfu: a shared log design for flash clusters. InProceedings of the 9th USENIX conference on Networked Systems Designand Implementation (2012), NSDI’12. → pages 101[9] BARAK, A., AND WHEELER, R. Mobility. ACM Press/Addison-WesleyPublishing Co., NewYork, NY, USA, 1999, ch. MOSIX: An Integrated Mul-tiprocessor UNIX, pp. 41–53. → pages 37[10] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO,A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art ofvirtualization. In SOSP ’03: Proceedings of the nineteenth ACM symposiumon Operating systems principles (New York, NY, USA, 2003), ACM Press,pp. 164–177. → pages 13, 45, 70, 100[11] BRADFORD, R., KOTSOVINOS, E., FELDMANN, A., AND SCHIO¨BERG,H. Live wide-area migration of virtual machines including local persis-tent state. In VEE ’07: Proceedings of the 3rd international conference onVirtual execution environments (New York, NY, USA, 2007), ACM Press,pp. 169–179. → pages 36, 71[12] BRESSOUD, T. C., AND SCHNEIDER, F. B. Hypervisor-based fault-tolerance. In Proceedings of the Fifteenth ACM Symposium on OperatingSystem Principles (December 1995), pp. 1–11. → pages 10, 12, 36, 70[13] BROOKS, C. Heroku learns the hard way from amazon ec2 out-age. http://searchcloudcomputing.techtarget.com/news/1378426/Heroku-learns-from-Amazon-EC2-outage. Visited November 2016. → pages 42[14] CALDER, B., WANG, J., OGUS, A., NILAKANTAN, N., SKJOLSVOLD,A., MCKELVIE, S., XU, Y., SRIVASTAV, S., WU, J., SIMITCI, H., HARI-DAS, J., UDDARAJU, C., KHATRI, H., EDWARDS, A., BEDEKAR, V.,MAINALI, S., ABBASI, R., AGARWAL, A., HAQ, M. F. U., HAQ, M.I. U., BHARDWAJ, D., DAYANAND, S., ADUSUMILLI, A., MCNETT, M.,108SANKARAN, S., MANIVANNAN, K., AND RIGAS, L. Windows azure stor-age: a highly available cloud storage service with strong consistency. InProceedings of the Twenty-Third ACM Symposium on Operating SystemsPrinciples (2011), SOSP ’11, pp. 143–157. → pages 101[15] CASADO, M., FREEDMAN, M. J., PETTIT, J., LUO, J., MCKEOWN, N.,AND SHENKER, S. Ethane: Taking control of the enterprise. In In SIG-COMM Computer Comm. Rev (2007). → pages 85[16] CASADO, M., GARFINKEL, T., AKELLA, A., FREEDMAN, M. J., BONEH,D., MCKEOWN, N., AND SHENKER, S. Sane: a protection architecture forenterprise networks. In Proceedings of the 15th conference on USENIXSecurity Symposium - Volume 15 (Berkeley, CA, USA, 2006), USENIX-SS’06, USENIX Association. → pages 85[17] CAULFIELD, A. M., DE, A., COBURN, J., MOLLOW, T. I., GUPTA, R. K.,AND SWANSON, S. Moneta: A high-performance storage array architecturefor next-generation, non-volatile memories. In Proceedings of the 2010 43rdAnnual IEEE/ACM International Symposium on Microarchitecture (2010),MICRO ’43, pp. 385–395. → pages 100[18] CAULFIELD, A. M., MOLLOV, T. I., EISNER, L. A., DE, A., COBURN,J., AND SWANSON, S. Providing safe, user space access to fast, solid statedisks. In Proceedings of the seventeenth international conference on Ar-chitectural Support for Programming Languages and Operating Systems(2012), ASPLOS XVII, pp. 387–400. → pages 100[19] CHANDRA, S., AND CHEN, P. M. The impact of recovery mechanismson the likelihood of saving corrupted state. In ISSRE ’02: Proceedingsof the 13th International Symposium on Software Reliability Engineering(ISSRE’02) (Washington, DC, USA, 2002), IEEE Computer Society, p. 91.→ pages 15[20] CHANDRA, T. D., AND TOUEG, S. Unreliable failure detectors for reliabledistributed systems. J. ACM 43 (March 1996), 225–267. → pages 51[21] CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH, W. C., WALLACH, D. A.,BURROWS, M., CHANDRA, T., FIKES, A., AND GRUBER, R. E. Bigtable:A distributed storage system for structured data. ACM Trans. Comput. Syst.26, 2 (June 2008), 4:1–4:26. → pages 101[22] CLARK, C., FRASER, K., HAND, S., HANSEN, J. G., JUL, E., LIMPACH,C., PRATT, I., AND WARFIELD, A. Live migration of virtual machines.109In Proceedings of the 2nd conference on Symposium on Networked SystemsDesign & Implementation (Berkeley, CA, USA, 2005), USENIX Associa-tion. → pages 10, 16, 18, 36, 46, 70[23] COBURN, J., CAULFIELD, A. M., AKEL, A., GRUPP, L. M., GUPTA,R. K., JHALA, R., AND SWANSON, S. Nv-heaps: making persistent objectsfast and safe with next-generation, non-volatile memories. In Proceedingsof the sixteenth international conference on Architectural support for pro-gramming languages and operating systems (New York, NY, USA, 2011),ASPLOS XVI, ACM, pp. 105–118. → pages 101[24] COCKROFT, A., HICKS, C., AND ORZELL, G. Lessons Netflix Learnedfrom the AWS Outage. http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html, April 2011. Visited November 2016. →pages 72[25] CONDIT, J., NIGHTINGALE, E. B., FROST, C., IPEK, E., LEE, B.,BURGER, D., AND COETZEE, D. Better i/o through byte-addressable,persistent memory. In Proceedings of the ACM SIGOPS 22nd symposiumon Operating systems principles (New York, NY, USA, 2009), SOSP ’09,ACM, pp. 133–146. → pages 100[26] CULLY, B., LEFEBVRE, G., MEYER, D., FEELEY, M., HUTCHINSON, N.,AND WARFIELD, A. Remus: high availability via asynchronous virtual ma-chine replication. In NSDI’08: Proceedings of the 5th USENIX Symposiumon Networked Systems Design and Implementation (Berkeley, CA, USA,2008), USENIX Association, pp. 161–174. → pages iii, 44, 45, 69, 70[27] CULLY, B., AND WARFIELD, A. Secondsite: disaster protection for thecommon server. In HOTDEP’06: Proceedings of the 2nd conference onHot Topics in System Dependability (Berkeley, CA, USA, 2006), USENIXAssociation. → pages 39[28] CULLY, B., WIRES, J., MEYER, D., JAMIESON, K., FRASER, K., DEE-GAN, T., STODDEN, D., LEFEBVRE, G., FERSTAY, D., AND WARFIELD,A. Strata: Scalable high-performance storage on virtualized non-volatilememory. In Proceedings of the 12th USENIX Conference on File and Stor-age Technologies (Berkeley, CA, USA, 2014), FAST’14, USENIX Associ-ation, pp. 17–31. → pages iii110[29] DOLEV, D., DWORK, C., AND STOCKMEYER, L. On the minimal synchro-nism needed for distributed consensus. J. ACM 34 (January 1987), 77–97.→ pages 51[30] DUNLAP, G. Execution Replay for Intrusion Analysis. PhD thesis, Univer-sity of Michigan, 2006. → pages 12, 37[31] DUNLAP, G. W., KING, S. T., CINAR, S., BASRAI, M. A., AND CHEN,P. M. Revirt: Enabling intrusion analysis through virtual-machine loggingand replay. In Proceedings of the 5th Symposium on Operating SystemsDesign & Implementation (OSDI 2002) (2002). → pages 37, 70[32] DUNLAP, G. W., LUCCHETTI, D. G., FETTERMAN, M. A., AND CHEN,P. M. Execution replay of multiprocessor virtual machines. In VEE ’08:Proceedings of the fourth ACM SIGPLAN/SIGOPS international confer-ence on Virtual execution environments (New York, NY, USA, 2008), ACM,pp. 121–130. → pages 70[33] DWORK, C., LYNCH, N., AND STOCKMEYER, L. Consensus in the pres-ence of partial synchrony. J. ACM 35 (April 1988), 288–323. → pages51[34] ENGLER, D. R., KAASHOEK, M. F., AND O’TOOLE, JR., J. Exokernel:an operating system architecture for application-level resource management.In Proceedings of the fifteenth ACM symposium on Operating systems prin-ciples (1995), SOSP ’95, pp. 251–266. → pages 100[35] FETZER, C., RAYNAL, M., AND TRONEL, F. An adaptive failure detectionprotocol. In Proceedings of the 2001 Pacific Rim International Symposiumon Dependable Computing (Washington, DC, USA, 2001), PRDC ’01, IEEEComputer Society, pp. 146–. → pages 51[36] FITZPATRICK, B. Distributed caching with memcached. Linux J. 2004, 124(Aug. 2004), 5–. → pages 101[37] GANGULY, A., AGRAWAL, A., BOYKIN, P., AND FIGUEIREDO, R. WOW:Self-Organizing Wide Area Overlay Networks of Virtual Workstations.High-Performance Distributed Computing, International Symposium on 0(2006), 30–42. → pages 71[38] GIBSON, G. A., AMIRI, K., AND NAGLE, D. F. A case for network-attached secure disks. Tech. Rep. CMU-CS-96-142, Carnegie-Mellon Uni-111versity.Computer science. Pittsburgh (PA US), Pittsburgh, 1996. → pages76[39] GIFFORD, D. K. Weighted voting for replicated data. In Proceedings of theseventh ACM symposium on Operating systems principles (New York, NY,USA, 1979), SOSP ’79, ACM, pp. 150–162. → pages 52[40] GUPTA, D., YOCUM, K., MCNETT, M., SNOEREN, A. C., VAHDAT, A.,AND VOELKER, G. M. To infinity and beyond: time warped network em-ulation. In SOSP ’05: Proceedings of the twentieth ACM symposium onOperating systems principles (2005). → pages 33[41] HARNEY, E., GOASGUEN, S., MARTIN, J., MURPHY, M., AND WEST-ALL, M. The efficacy of live virtual machine migrations over the internet.In Proceedings of the 2nd international workshop on Virtualization tech-nology in distributed computing (New York, NY, USA, 2007), VTDC ’07,ACM, pp. 8:1–8:7. → pages 71[42] HILDEBRAND, D., AND HONEYMAN, P. Exporting storage systems in ascalable manner with pnfs. In Proceedings of the 22nd IEEE/13th NASAGoddard Conference on Mass Storage Systems and Technologies (MSST(2005). → pages 101[43] HIROFUCHI, T., NAKADA, H., OGAWA, H., ITOH, S., AND SEKIGUCHI,S. A live storage migration mechanism over wan and its performance eval-uation. In Proceedings of the 3rd international workshop on Virtualizationtechnologies in distributed computing (New York, NY, USA, 2009), VTDC’09, ACM, pp. 67–74. → pages 71[44] HOWARD, J. H., KAZAR, M. L., MENEES, S. G., NICHOLS, D. A.,SATYANARAYANAN, M., SIDEBOTHAM, R. N., AND WEST, M. J. Scaleand performance in a distributed file system. ACM Transactions on Com-puter Systems 6, 1 (1988), 51–81. → pages 38[45] HP. NonStop Computing. http://h20223.www2.hp.com/non-stopcomputing/cache/76385-0-0-0-121.aspx. Visited August 2007.→ pages 9[46] HUTCHINSON, N. C., AND PETERSON, L. L. The x-kernel: An architec-ture for implementing network protocols. IEEE Trans. Softw. Eng. 17, 1(Jan. 1991), 64–76. → pages 80112[47] JIANG, X., AND XU, D. VIOLIN: Virtual Internetworking on OverlayInfrastructure. In ISPA (2004), pp. 937–946. → pages 72[48] KARGER, D., LEHMAN, E., LEIGHTON, T., PANIGRAHY, R., LEVINE,M., AND LEWIN, D. Consistent hashing and random trees: distributedcaching protocols for relieving hot spots on the world wide web. In Pro-ceedings of the twenty-ninth annual ACM symposium on Theory of comput-ing (1997), STOC ’97, pp. 654–663. → pages 83[49] KING, S. T., DUNLAP, G. W., AND CHEN, P. M. Debugging operatingsystems with time-traveling virtual machines. In ATEC’05: Proceedings ofthe USENIX Annual Technical Conference 2005 on USENIX Annual Techni-cal Conference (Berkeley, CA, USA, 2005), USENIXAssociation.→ pages37[50] KOHLER, E., MORRIS, R., CHEN, B., JANNOTTI, J., AND KAASHOEK,M. F. The click modular router. ACM Trans. Comput. Syst. 18, 3 (Aug.2000), 263–297. → pages 80[51] LABOVITZ, C., AHUJA, A., BOSE, A., AND JAHANIAN, F. Delayed inter-net routing convergence. In in Proc. ACM SIGCOMM (2000), pp. 175–187.→ pages 56[52] LEE, D., WESTER, B., VEERARAGHAVAN, K., NARAYANASAMY, S.,CHEN, P. M., AND FLINN, J. Respec: efficient online multiprocessor re-playvia speculation and external determinism. In ASPLOS ’10: Proceedingsof the fifteenth edition of ASPLOS on Architectural support for program-ming languages and operating systems (New York, NY, USA, 2010), ACM,pp. 77–90. → pages 70[53] LEE, E. K., AND THEKKATH, C. A. Petal: distributed virtual disks. InProceedings of the seventh international conference on Architectural sup-port for programming languages and operating systems (1996), ASPLOSVII, pp. 84–92. → pages 101[54] LUO, T., MA, S., LEE, R., ZHANG, X., LIU, D., AND ZHOU, L. S-cave: Effective ssd caching to improve virtual machine storage performance.In Parallel Architectures and Compilation Techniques (2013), PACT ’13,pp. 103–112. → pages 100[55] Lvm2 resource page. https://www.sourceware.org/lvm2/. Visited November2016. → pages 38, 80113[56] MARQUES, D., BRONEVETSKY, G., FERNANDES, R., PINGALI, K., ANDSTODGHILL, P. Optimizing checkpoint sizes in the c3 system. In 19th In-ternational Parallel and Distributed Processing Symposium (IPDPS 2005)(April 2005). → pages 10[57] MCHARDY, P. Linux imq. http://www.linuximq.net/. Visited August 2007.→ pages 22[58] MEYER, D., AGGARWAL, G., CULLY, B., LEFEBVRE, G., HUTCHIN-SON, N., FEELEY, M., AND WARFIELD, A. Parallax: Virtual disks forvirtual machines. In EuroSys ’08: Proceedings of the ACM SIGOPS/Eu-roSys European Conference on Computer Systems 2008 (New York, NY,USA, 2008), ACM. → pages 40[59] MEYER, D. T., CULLY, B., WIRES, J., HUTCHINSON, N. C., ANDWARFIELD, A. Block mason. In Proceedings of the First conference onI/O virtualization (2008), WIOV’08. → pages 80[60] MILLER, R. Car crash triggers amazon power outage.http://www.datacenterknowledge.com/archives/2010/05/13/car-crash-triggers-amazon-power-outage/. Visited May 2010. → pages 42[61] MINHAS, U. F., RAJAGOPALAN, S., CULLY, B., ABOULNAGA, A.,SALEM, K., AND WARFIELD, A. Remusdb: Transparent high availabil-ity for database systems. PVLDB 4, 11 (2011), 738–748. → pages 48, 50[62] MOSBERGER, D., AND PETERSON, L. L. Making paths explicit in thescout operating system. In Proceedings of the second USENIX sympo-sium on Operating systems design and implementation (1996), OSDI ’96,pp. 153–167. → pages 80[63] MULLENDER, S. J., VAN ROSSUM, G., TANENBAUM, A. S., VAN RE-NESSE, R., AND VAN STAVEREN, H. Amoeba: A distributed operatingsystem for the 1990s. Computer 23, 5 (1990), 44–53. → pages 37[64] netem. http://linux-net.osdl.org/index.php/Netem. Visited August 2007. →pages 31[65] NIGHTINGALE, E. B., CHEN, P. M., AND FLINN, J. Speculative executionin a distributed file system. In SOSP ’05: Proceedings of the twentieth ACMsymposium on Operating systems principles (New York, NY, USA, 2005),ACM Press, pp. 191–205. → pages 38114[66] NIGHTINGALE, E. B., ELSON, J., FAN, J., HOFMANN, O., HOWELL, J.,AND SUZUE, Y. Flat datacenter storage. In Proceedings of the 10th USENIXconference on Operating Systems Design and Implementation (Berkeley,CA, USA, 2012), OSDI’12, USENIX Association, pp. 1–15. → pages 101[67] NIGHTINGALE, E. B., VEERARAGHAVAN, K., CHEN, P. M., AND FLINN,J. Rethink the sync. In USENIX’06: Proceedings of the 7th conferenceon USENIX Symposium on Operating Systems Design and Implementation(Berkeley, CA, USA, 2006), USENIX Association. → pages 38, 46[68] OSMAN, S., SUBHRAVETI, D., SU, G., AND NIEH, J. The design andimplementation of zap: a system for migrating computing environments.SIGOPS Oper. Syst. Rev. 36, SI (2002), 361–376. → pages 37[69] OUSTERHOUT, J., AGRAWAL, P., ERICKSON, D., KOZYRAKIS, C., LEV-ERICH, J., MAZIE`RES, D., MITRA, S., NARAYANAN, A., ONGARO, D.,PARULKAR, G., ROSENBLUM, M., RUMBLE, S. M., STRATMANN, E.,AND STUTSMAN, R. The case for ramcloud. Commun. ACM 54, 7 (July2011), 121–130. → pages 101[70] OUSTERHOUT, J. K., CHERENSON, A. R., DOUGLIS, F., NELSON,M. N., AND WELCH, B. B. The sprite network operating system. Computer21, 2 (1988), 23–36. → pages 37[71] PATTERSON, R. H., MANLEY, S., FEDERWISCH, M., HITZ, D.,KLEIMAN, S., AND OWARA, S. SnapMirror: File-System-Based Asyn-chronous Mirroring for Disaster Recovery. In FAST ’02: Proceedings of the1st USENIX Conference on File and Storage Technologies (Berkeley, CA,USA, 2002), USENIX Association, p. 9. → pages 71[72] PENG, G. Distributed checkpointing. Master’s thesis, University of BritishColumbia, 2007. → pages 39[73] RAJAGOPALAN, S., CULLY, B., O’CONNOR, R., AND WARFIELD, A.Secondsite: Disaster tolerance as a service. In Proceedings of the 8th ACMSIGPLAN/SIGOPS Conference on Virtual Execution Environments (NewYork, NY, USA, 2012), VEE ’12, ACM, pp. 97–108. → pages iii[74] RASHID, R. F., AND ROBERTSON, G. G. Accent: A communication ori-ented network operating system kernel. In SOSP ’81: Proceedings of theeighth ACM symposium on Operating systems principles (New York, NY,USA, 1981), ACM Press, pp. 64–75. → pages 37115[75] REISNER, P., AND ELLENBERG, L. Drbd v8 – replicated storage withshared disk semantics. In Proceedings of the 12th International Linux Sys-tem Technology Conference (October 2005). → pages 38, 60[76] RUSSELL, R. Netfilter. http://www.netfilter.org/. Visited November 2016.→ pages 22[77] SAITO, Y., FRØLUND, S., VEITCH, A., MERCHANT, A., AND SPENCE,S. Fab: building distributed enterprise disk arrays from commodity compo-nents. In Proceedings of the 11th international conference on Architecturalsupport for programming languages and operating systems (New York, NY,USA, 2004), ASPLOS XI, ACM, pp. 48–58. → pages 101[78] SCALES, D. J., NELSON, M., AND VENKITACHALAM, G. The design andevaluation of a practical system for fault-tolerant virtual machines. Tech.Rep. VMWare-RT-2010-001, VMWare, Inc., Palo Alto, CA 94304, May2010. → pages 70[79] SCHINDLER, J., AND GANGER, G. Automated disk drive characterization.Tech. Rep. CMU SCS Technical Report CMU-CS-99-176, Carnegie MellonUniversity, December 1999. → pages 24[80] SHEN, Z., JIA, Q., SELA, G.-E., RAINERO, B., SONG, W., VAN RE-NESSE, R., AND WEATHERSPOON, H. Follow the sun through the clouds:Application migration for geographically shifting workloads. In Proceed-ings of the Seventh ACM Symposium on Cloud Computing (New York, NY,USA, 2016), SoCC ’16, ACM, pp. 141–154. → pages 104[81] Specweb2005. http://www.spec.org/web2005/. Visited November 2016. →pages 63[82] STELLNER, G. CoCheck: Checkpointing and Process Migration for MPI. InProceedings of the 10th International Parallel Processing Symposium (IPPS’96) (Honolulu, Hawaii, 1996). → pages 38[83] STROM, R., AND YEMINI, S. Optimistic recovery in distributed systems.ACM Trans. Comput. Syst. 3, 3 (1985). → pages 46, 50[84] SVA¨RD, P., HUDZIA, B., TORDSSON, J., AND ELMROTH, E. Evaluationof delta compression techniques for efficient live migration of large virtualmachines. In Proceedings of the 7th ACM SIGPLAN/SIGOPS internationalconference on Virtual execution environments (New York, NY, USA, 2011),VEE ’11, ACM, pp. 111–120. → pages 50, 71116[85] SYMANTEC CORPORATION. Veritas Cluster Server for VMwareESX. http://eval.symantec.com/mktginfo/products/Datasheets/High Availability/vcs22vmware datasheet.pdf, 2006. Visited Novem-ber 2016. → pages 16[86] THEKKATH, C. A., MANN, T., AND LEE, E. K. Frangipani: a scalabledistributed file system. In Proceedings of the sixteenth ACM symposium onOperating systems principles (1997), SOSP ’97, pp. 224–237. → pages 101[87] TRAVOSTINO, F., DASPIT, P., GOMMANS, L., JOG, C., DE LAAT, C.,MAMBRETTI, J., MONGA, I., VAN OUDENAARDE, B., RAGHUNATH, S.,AND WANG, P. Y. Seamless live migration of virtual machines over theMAN/WAN. Future Gener. Comput. Syst. 22 (October 2006), 901–907. →pages 72[88] VAN RENESSE, R., MINSKY, Y., AND HAYDEN, M. A gossip-style failuredetection service. In Proceedings of the IFIP International Conference onDistributed Systems Platforms and Open Distributed Processing (London,UK, 1998), Middleware ’98, Springer-Verlag, pp. 55–70. → pages 51[89] VASUDEVAN, V., KAMINSKY, M., AND ANDERSEN, D. G. Using vectorinterfaces to deliver millions of iops from a networked key-value storageserver. In Proceedings of the Third ACM Symposium on Cloud Computing(New York, NY, USA, 2012), SoCC ’12, ACM, pp. 8:1–8:13. → pages 100[90] VMWARE, INC. Vmware high availability (ha).http://www.vmware.com/products/vi/vc/ha.html, 2007. Visited August2007. → pages 16[91] WARFIELD, A. Virtual Devices for Virtual Machines. PhD thesis, Universityof Cambridge, 2006. → pages 24[92] WARFIELD, A., ROSS, R., FRASER, K., LIMPACH, C., AND HAND, S.Parallax: managing storage for a million machines. InHOTOS’05: Proceed-ings of the 10th conference on Hot Topics in Operating Systems (Berkeley,CA, USA, 2005), USENIX Association. → pages 38[93] WEIL, S. A., WANG, F., XIN, Q., BRANDT, S. A., MILLER, E. L., LONG,D. D. E., AND MALTZAHN, C. Ceph: A scalable object-based storagesystem. Tech. rep., 2006. → pages 101117[94] WHITAKER, A., SHAW, M., AND GRIBBLE, S. D. Denali: A scalableisolation kernel. In Proceedings of the Tenth ACM SIGOPS European Work-shop (2002). → pages 100[95] WOOD, T., LAGAR-CAVILLA, H. A., RAMAKRISHNAN, K. K., SHENOY,P., AND VAN DER MERWE, J. Pipecloud: using causality to overcomespeed-of-light delays in cloud-based disaster recovery. In Proceedings ofthe 2nd ACM Symposium on Cloud Computing (NewYork, NY, USA, 2011),SOCC ’11, ACM, pp. 17:1–17:13. → pages 71[96] WOOD, T., RAMAKRISHNAN, K. K., SHENOY, P., AND VAN DERMERWE, J. CloudNet: dynamic pooling of cloud resources by live WANmigration of virtual machines. In Proceedings of the 7th ACM SIG-PLAN/SIGOPS international conference on Virtual execution environments(New York, NY, USA, 2011), VEE ’11, ACM, pp. 121–132. → pages 50,59, 71[97] XU, M., BODIK, R., AND HILL, M. D. A ”flight data recorder” for en-abling full-system multiprocessor deterministic replay. In ISCA ’03: Pro-ceedings of the 30th annual international symposium on Computer architec-ture (New York, NY, USA, 2003), ACM Press, pp. 122–135. → pages 37,70[98] YANG, J., MINTURN, D. B., AND HADY, F. When poll is better than in-terrupt. In Proceedings of the 10th USENIX conference on File and StorageTechnologies (Berkeley, CA, USA, 2012), FAST’12, USENIX Association,pp. 3–3. → pages 100[99] YANG, Q., XIAO, W., AND REN, J. Trap-array: A disk array architectureproviding timely recovery to any point-in-time. In ISCA ’06: Proceedings ofthe 33rd annual international symposium on Computer Architecture (Wash-ington, DC, USA, 2006), IEEE Computer Society, pp. 289–301. → pages33[100] Amazon EC2 Spot Instances. http://aws.amazon.com/ec2/spot-instances/.Visited November 2016. → pages 43[101] Dell DVD Store Database Test Suite.http://www.delltechcenter.com/page/DVD+Store. Visited November2016. → pages 63[102] Google app engine. http://code.google.com/appengine/. Visited November2016. → pages 53118[103] Marathon Technologies: everRun DR.http://www.marathontechnologies.com/. Visited November 2016. →pages 70[104] Summary of the Amazon EC2 and Amazon RDS Service Disruption in theUS East Region. http://aws.amazon.com/message/65648/. Visited Novem-ber 2016. → pages 43[105] VMware KB: Configuring Split-Brain Avoidance in a WAN.http://kb.vmware.com/kb/1008606. Visited November 2016. → pages 52[106] Xen Blktap2 Driver. http://wiki.xensource.com/xenwiki/blktap2. VisitedFebruary 2012. → pages 59[107] Xentop. http://linux.die.net/man/1/xentop. Visited November 2016. →pages 67[108] Flexible io tester. http://git.kernel.dk/?p=fio.git;a=summary. VisitedNovember 2016. → pages 94[109] Linux device mapper resource page. http://sourceware.org/dm/. VisitedNovember 2016. → pages 80[110] Seagate kinetic open storage documentation.https://developers.seagate.com/display/KV/Kinetic Open Storage Doc-umentation Wiki. Visited November 2016. → pages 76, 84[111] Scsi object-based storage device commands - 2.http://www.incits.org/scopes/1729.htm, 2011. Visited February 2014.→ pages 84119


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items