Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Generalized high availability via virtual machine replication Cully, Brendan 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


831-ubc_2007-0373.pdf [ 4.53MB ]
JSON: 831-1.0052067.json
JSON-LD: 831-1.0052067-ld.json
RDF/XML (Pretty): 831-1.0052067-rdf.xml
RDF/JSON: 831-1.0052067-rdf.json
Turtle: 831-1.0052067-turtle.txt
N-Triples: 831-1.0052067-rdf-ntriples.txt
Original Record: 831-1.0052067-source.json
Full Text

Full Text

Generalized High Availability via Virtual Machine Replication by Brendan Cully B . S c , New York University, 2001 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in The Faculty of Graduate Studies (Computer Science) The University of British Columbia October, 2007 © Brendan Cully 2007 11 Abstract Allowing applications to survive hardware failure is a expensive undertaking, which generally involves re-engineering software to include complicated recov-ery logic as well as deploying special-purpose hardware; this represents a severe barrier to improving the dependability of large or legacy applications. We de-scribe the construction of a general and transparent high-availability service that allows existing, unmodified software to be protected from the failure of the physical machine on which it runs. Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of downtime, completely preserving host state such as active network connections. We describe our approach, which encapsulates protected software in a virtual machine, asynchronously propagates V M state to a backup host at frequencies as high as forty times a second, and uses speculative execution to concurrently run the active V M slightly ahead of the replicated system state. i i i Contents A b s t r a c t i i Conten ts • i i i L i s t o f F igures v Acknowledgemen t s y i 1 I n t r o d u c t i o n 1 1.1 Goals 3 1.2 Approach 4 1.2.1 Vir tual machine replication 5 1.2.2 Speculative execution 6 1.2.3 Asynchronous replication 7 2 D e s i g n 8 2.1 Failure model 11 2.2 Pipelined checkpoints 12 2.3 Master control program 13 2.4 Memory and C P U 13 2.4.1 Migration enhancements 14 2.4.2 Checkpoint support 18 2.4.3 Asynchronous transmission 19 2.4.4 Guest modifications 19 2.5 Network buffering 20 Contents iv 2.6 Disk buffering 2 2 2.7 Detecting failure 2 6 3 E v a l u a t i o n 2 ^ 3.1 Test environment 2 8 3.2 Failover performance 29 3.3 Benchmarks 29 3.3.1 Kernel compilation 2 9 3.3.2 SPECweb2005 31 3.3.3 Postmark 33 3.4 Potential optimizations 36 3.4.1 Deadline scheduling 36 3.4.2 Page compression 37 3.4.3 Copy-on-write checkpoints 41 4 R e l a t e d W o r k 4 2 4.1 Vir tual machine migration 42 4.2 Vir tual machine logging and replay 43 4.3 Operating system replication 44 4.4 Library approaches 44 4.5 Replicated storage • 44 4.6 Speculative execution 45 5 Fu tu r e W o r k 46 6 C o n c l u s i o n 49 B i b l i o g r a p h y 51 V List of Figures 1.1 Speculative execution and asynchronous replication 5 2.1 High-level architecture 10 2.2 Normal suspend path 17 2.3 Accelerated suspend protocol 18 2.4 Network buffering 21 2.5 Disk write buffering 23 3.1 Kernel build times at various checkpoint frequencies 30 3.2 SPECweb performance at various checkpoint frequencies 32 3.3 The effect of network latency on SPECweb performance 34 3.4 The effect of disk replication on Postmark performance 35 3.5 SPECweb2005 checkpoint latency versus pages dirtied per epoch. 38 3.6 Checkpoint latency versus dirty rate 39 3.7 Effectiveness of page compression 40 vi Acknowledgements I am deeply indebted to my supervisors, Mike Feeley and Andrew Warfield. Their wisdom, humour and encouragement were an inspiration, as was the seem-ingly boundless generosity with which they offered their time. I would also like to extend my sincere thanks to Norm Hutchinson for his enthusiasm and sup-port, and for being my second reader. Thanks as well to the distributed systems group, and in particular Geoffrey Lefebvre, Dutch Meyer and K a n Cai . Chapter 1 i Introduct ion Highly available systems are the purview of the very rich and the very scared. However, the desire for reliability is pervasive, even among system designers with modest resources. Unfortunately, high availability is hard — it requires that systems be con-structed with redundant components and that they be able to both detect com-ponent failure and seamlessly activate its backup. Commercial high availability systems that aim to protect modern servers generally use specialized hardware, customized software, or both (e.g., [13]). In each case, the technology required in order to be able to transparently survive failure is complex and expensive enough that it prohibits deployment on common servers. To better support the increasing use of network-based applications and systems, it is desirable to deliver the level of availability traditionally associated only with higher-end installations to those based on commodity hardware and software. This thesis describes Remus, an entirely software-based system that pro-vides high availability on commodity hardware without operating system or application-specific knowledge. Its approach capitalizes on the demonstrated ability of virtualization to migrate active, running virtual machines between physical hosts [7], extending that technique to replicate snapshots of an entire running system at very high frequencies — as often as once every 25ms — be-tween a pair of physical machines. Using this technique, our system discretizes the execution of a virtual machine into a series of replicated step-wise snapshots. Externally-visible events, in particular transmitted outbound network packets, are not released until the system state responsible for their generation has been Chapter 1. Introduction 2 replicated. The contribution of this thesis is a practical one. Whole-system replication is a well-known approach to providing high availability. However, it is com-monly considered to be significantly more expensive than checkpointing tech-niques that use application-specific knowledge in order to replicate only relevant data [16]. Unfortunately, such approaches require careful coding for every pro-tected application, making them unsuitable for rapid development or for the use of off-the-shelf applications. Attempts have been made to implement somewhat more general protocols in library code, but application developers must be care-ful to observe the constraints of the framework they choose. When they don't, replication may silently fail or worse, the replica may become inconsistent. Op-erating system-level techniques such as process migration solve some of these problems, but they are only possible when the operating system source code is available. They are difficult to implement and must be rigorously analyzed for correctness with each revision of the underlying operating system. For these reasons, we have revisited machine-level replication. In this thesis, we demon-strate a system constructed on commodity hardware which uses whole-system replication to transparently survive host failures without requiring modifications to application or operating system code. This approach may be used to bring high availability "to the masses" as a service offered to virtual machines by the virtualization platform. Virtualization makes it possible to create a copy of a running machine, but it does not guarantee that the process will be efficient. Synchronously propagat-ing state at every change is impractical: it effectively reduces the throughput of local state to that of the network device performing replication. Therefore, rather than running two hosts in lock-step [4] we allow a single host to execute speculatively and then checkpoint and replicate its state asynchronously. Exter-nally visible state is not made visible until a checkpoint has been committed on a backup host — we achieve high-speed replicated performance by effectively running the system tens of milliseconds in the past. In spite of the hardware and software constraints under which we have delib-Chapter 1. Introduction 3 erately designed this system, it is able to provide protection equal to or better than significantly more expensive commercial offerings. Many existing systems only actively mirror persistent storage, requiring applications to perform re-covery from crash-consistent persistent state. In contrast, Remus ensures that regardless of the moment at which the primary fails, no externally-visible state is ever lost. The remainder of the thesis presents the architecture of our system and dis-cusses the challenges that were involved in capturing high-frequency checkpoints of running V M s on commodity hardware. The resulting system is completely functional and allows high availability to be offered as a service by the virtual-ization platform, retrofitting dependability onto existing software. 1.1 Goals When designing Remus, we first considered the characteristics of an idealized high availability solution targeting mid- to low-end systems. We believe that such a system must focus on the following guiding principles: G e n e r a l i t y - Performing customization of hardware, operating systems, libraries and applications is costly and difficult. For resource-constrained or-ganizations, it can be prohibitively expensive. Therefore, any modification re-quired to common off-the-shelf and open-source software and hardware should be minimized, or if possible, eliminated entirely. Transparency - The natural granularity at which to provide redundancy is at the level of whole-machine state. Operating at finer granularity is (from a purely functional perspective) both unnecessary and complicated, because it forces designers to determine what aspects of whole-system state (e.g., file descriptors) must additionally be restored in the event of an error, and to both capture and recreate this state wherever it may reside in the host. This requires intimate cooperation between the provider of redundancy and the operator of the system to be protected. It would be preferable to be able to separate these two entities into separate administrative domains, making high availability as Chapter 1. Introduction 4 simple to add to an existing server as, for example, a new hard drive. Reasonable performance - The system should not consume excessive resources, and should run on commodity equipment. While it is unrealistic to expect a system to provide high availability without some overhead, the additional resources required to provide performance equivalent to unprotected operation should be no more than some small factor of those of the protected system. In short, the system should be deployable for real workloads. M u l t i p r o c e s s o r suppor t - Concurrent multi-processors are now the norm for data centers. Therefore approaches in which overheads increase in proportion to the degree of multiprocessing should be considered detrimental to the success of any practical high availability solution. Seamless failure recovery - No externally visible state should ever be lost in the case of single-host failure. Furthermore, failure recovery should proceed rapidly enough that it appears as nothing more than temporary packet loss from the perspective of external users. Established T C P connections should not be lost or reset. 1.2 Approach Remus is able to achieve the above goals by replicating a host in a primary-backup fashion. We employ three major techniques in order to overcome the difficulties traditionally associated with this approach. First, we base our system on a virtualized infrastructure to facilitate whole-system replication. Second, we increase system performance through speculative execution, which decouples external output from synchronization points. Finally, asynchronous replication enables the primary server to remain productive, even as synchronization with the replicated server is occurring. The basic stages of operation in Remus are given in Figure 1.1. Chapter 1. Introduction 5 Primary Host 1 Checkpoint 3 Sync 2 Transmit 4 Release Completed Execution 1 Speculative Execution Backup Host i State Buffer 3 Client's View , 4 r Committed State Figure 1.1: Speculative execution and asynchronous replication. 1.2.1 Virtual machine replication Hypervisors have been considered as a mechanism for building high availability systems in the past [4]. In this previous work, virtualization has been used to run a pair of systems in lock-step, ensuring that virtual machines on a pair of physical hosts follow an identical, deterministic path of execution. Machine state changes very rapidly; attempting to synchronously propagate every change to local memory and CPU state between hosts would be wildly impractical. To reduce the required bandwidth between the primary and the backup server to manageable levels, some systems choose to replicate only the input events them-selves, and then to replay them on the backup host. This design leverages the natural bandwidth and latency symmetry between input and output channels. Unfortunately, it is not easy to log events with enough precision that they will produce repeatable output when replayed - it is much easier to virtualize a system than to make it execute deterministically. Interrupt jitter may change the order in which processes are scheduled. Multiple processors may access shared memory in different order due to clock skew. It is a difficult task (and one which must be performed again at almost every hardware revision) even Chapter 1. Introduction 6 to log the exact point in an execution stream at which a given external input is delivered; it is harder still to reproduce that exact moment during replay. There have been some impressive attempts to do this [10, 14], but they do not replay in real time. Very few have attempted the considerably harder problem of replaying an S M P system [9], and none that we know of have achieved it without overheads that often make multiprocessor performance worse than uniprocessor. Systems based upon event replay also require that the backup virtual ma-chine be continuously active: all computation performed by the primary host must be executed again on the backup in order to bring it to the same state. Replicating the state itself avoids this requirement, allowing the backup to oper-ate as a passive state recorder which consumes few resources. This permits the construction of n-to-1 redundant systems, allowing the system designer a high degree of freedom to balance resource expenditure against failure protection. 1.2.2 Speculative execution Due to the difficulties involved in performing deterministic machine replay discussed in the previous section, particularly in the case of multi-processor systems, Remus has been designed to tolerate non-deterministic computation. There is a very real possibility that the output produced by a system after a given checkpoint will be different if the system is rolled back to that check-point and its input is replayed. Therefore, Remus requires that the backup be synchronized with the primary at all times, from the perspective of an external viewer. This means that the state of the replica must be synchronized with that of the primary only when the output of the primary has become externally vis-ible. Instead of letting the normal output stream dictate when synchronization must occur, we can buffer the output until a more convenient time, performing computation speculatively ahead of synchronization points. This allows a favor-able trade-off to be made between output latency and runtime overhead, the degree of which may be controlled by the administrator. Remus is particularly concerned with network and disk output. Other de-Chapter 1. Introduction 7 vices, such as the console or the serial port, are presumed to be used for local administration and therefore would not require buffering. However, nothing intrinsically prevents these devices from being buffered as well. 1.2.3 Asynchronous replication Buffering output at the primary server allows replication to be performed asyn-chronously. The primary host can resume execution at the instant that a con-sistent checkpoint of its machine state has been captured, without waiting for acknowledgment from the remote end. Any output associated with the check-point is simply queued until it has committed. Overlapping normal execution with the replication process yields substantial performance benefits, and per-mits reasonably efficient operation even when checkpointing at intervals on the order of tens of milliseconds. Chapter 2 D e s i g n Figure 2.1 shows a high-level view of our system. We begin by encapsulating the machine to be protected within a virtual machine. Our implementation is based on the Xen virtual machine monitor [2], and extends Xen's support for live migration to provide fine grained checkpoints. A n initial subset of our checkpointing support has been accepted into the upstream Xen source. Remus achieves high availability by propagating frequent checkpoints of an active virtual machine to a backup physical host. On the backup, the virtual machine image is resident in memory and may begin execution immediately if failure of the active system is detected. Because the backup is only periodically consistent with the primary, all network output must be buffered until state is synchronized on the backup. When a complete, consistent image of the host has been received, this buffer is released to external clients. The checkpoint, buffer, and release cycle can be performed very frequently. We have tested it against heavy workloads at frequencies as high as forty times a second, representing a whole-machine checkpoint of network and on-disk state every 25 milliseconds. Unlike transmitted network traffic, disk state is not externally visible. It must, however, be propagated to the remote host as part of a complete and consistent snapshot. In order to achieve replication of block data, all writes to the primary disk are transmitted asynchronously to the backup, where they are buffered in R A M until the corresponding memory checkpoint has arrived. At that point, the complete checkpoint is acknowledged to the primary, which then releases outbound network traffic, and the disk writes are applied from the buffer to the backup disk. Chapter 2. Design 9 It is worth emphasizing that the virtual machine does not actually execute on the backup host until a failure occurs. It simply acts as a receptacle for check-points of the active virtual machine. As a consequence, only a small amount of resources are used on the backup host, allowing it to concurrently protect virtual machines running on multiple physical hosts in an n-to-l-style redun-dant configuration. This provides administrators with the flexibility to balance redundancy and resource costs when provisioning high availability. (Other) Active Hosts Protec ted V M Repl ica t ion Eng ine Heartbeat Memory External Devices V M M external network " 0 = Active Host Repl ica t ion Server Heartbeat Memory Storage Backup Host Backup V M V M M Figure 2.1: High-level architecture. Chapter 2: Design 11 2.1 Failure model This system provides the following properties: 1. The fail-stop failure of any single host is tolerable. 2. Should both the primary and backup hosts fail concurrently, the protected system's data will be left in a crash-consistent state. 3. No state generated in speculation will be made externally visible until it has been acknowledged as committed to the backup. Remus aims to provide completely transparent recovery from fail-stop fail-ures of a single physical host. The compelling aspect of this system is that high-availability may be easily retrofitted onto existing software running on commodity hardware. The system uses a pair of commodity host machines, connected over redundant gigabit Ethernet connections, and survives the fail-ure of any one of these components. It does not require the use of expensive, shared, network-attached storage for disk images, because it includes block de-vice requests in the state replication stream. We do not aim to recover from software errors or non-fail-stop conditions. As observed in [5], the methodology we have chosen provides complete system state capture and replication, and as such wil l propagate application errors to the backup! This is a necessary consequence of providing both transparency and generality. Our failure model is identical to that of commercial high availability prod-ucts, which provide protection for virtual machines today [31, 32]. However, the degree of protection offered by these products is substantially less than that provided by Remus: existing commercial products respond to the failure of a physical host by simply rebooting the virtual machine on another host from its crash-consistent disk state. Our approach survives failure on time frames similar to those of live migration, and leaves the virtual machine running and network connections intact. Exposed transient state is not lost and file system consistency is preserved. Chapter 2. Design 12 2.2 Pipelined checkpoints Checkpointing a running virtual machine at rates of up to forty times a second places extreme demands on the checkpoint system. Remus addresses this by aggressively pipelining the checkpoint operation. We use an epoch-based system in which execution of the active virtual machine is bounded by brief pauses in execution during which dirty pages are written to a local buffer. Concurrently with the execution of the virtual machine, this buffer is replicated to the backup host, which acknowledges a checkpoint when the buffer has been received in its entirety. Finally, the backup may apply new memory and C P U state to the local virtual machine image, and write out any disk blocks associated with committed checkpoints. Referring back to Figure 1.1, this procedure can be divided into four stages: (1) Once per epoch, pause the running V M and copy any changed state into a buffer. This process is effectively the stop-and-copy stage of live migration [7], but as described later in this section it has been dramatically optimized for high-frequency checkpoints. W i t h state changes preserved in a buffer, the virtual machine is unpaused and speculative execution resumes. (2) Buffered state is transmitted and stored in memory on the backup host. (3) Once the complete set of state has been received, the checkpoint is acknowledged to the primary. Finally, (4) buffered network output is released. The result of this approach is that execution is effectively discretized at checkpoint boundaries; the acknowledgment of a completed checkpoint by the backup triggers the release of network traffic that has been buffered and repre-sents an atomic transition into the new epoch. If failure of the primary node is detected, execution resumes on the backup from the point of the most recently committed checkpoint. Likewise, if the backup fails, then the primary node disables replication and reverts to unpro-tected execution. The following sections describe the replication control process, and then describe in more detail the nuances of managing the memory, network and disk subsystems. Chapter 2. Design 13 2.3 Master control program Virtual machine failure protection is initiated by a command-line tool called protect, which first synchronizes initial disk state between the primary and the backup, and then starts the primary virtual machine, engaging the network and disk buffering subsystems described in the following sections. After opening a control channel to the disk buffer, it spawns a customized version of the live migration process, and then enters a loop in which it requests checkpoints at an interval requested by the operator. When the migration process receives a checkpoint request, it suspends the guest domain and then notifies protect, which then issues a checkpoint message to the disk buffer as described in Section 2.6. When the disk buffer has completed the checkpoint it notifies the control process, which then sends a message to the network buffer to release packets queued from the now-completed epoch. This cycle is repeated until the domain terminates or a failure event is detected. 2.4 Memory and C P U Checkpointing is implemented above Xen's existing machinery for performing live migration [7]. Live migration is a technique by which a virtual machine is relocated to another physical host with only slight interruption in service. To do this, memory is copied to the new location while the virtual machine con-tinues to run at the original location. During migration, writes to memory are intercepted, and dirtied pages are copied to the new location in rounds. After a specified number of rounds, or when no forward progress is being made (because the virtual machine is writing to memory at least as fast as the migration pro-cess can copy it out), the guest is suspended and the remaining dirty memory is copied out along with the current C P U state. At this point the image on the new location is activated. Total downtime depends on the amount of memory remaining to be copied when the guest is suspended, but is typically under 100 milliseconds. Total migration time is a function of the amount of memory in Chapter 2: Design 14 use by the guest and the size of its writable working set [7]. During live migration, writes to virtual memory are tracked by the hyper-visor, using a mechanism called shadow page tables. When this mode is acti-vated, the virtual machine monitor maintains a private ("shadow") version of the guest's page tables and exposes these to the hardware memory management unit. Hardware page protection is used to trap guest access to its internal ver-sion of page tables. This allows the hypervisor to track page table updates, which it propagates to the shadowed versions as appropriate before returning control to the guest. To provide migration, this technique is extended to transparently (to the guest) mark all virtual machine memory as read only. The hypervisor is then able to trap all writes that a virtual machine makes to memory in order to maintain a map of pages that have been dirtied since the previous round. Each round, the migration process atomically reads and resets this dirty map, and the iterative migration process involves chasing dirty pages until progress can no longer be made. As mentioned above, the live migration process eventually suspends execution of the virtual machine and enters a final "stop-and-copy" round, during which any remaining pages are transmitted before execution is resumed on the destination host. Remus implements checkpointing as repeated executions of the final stage of live migration: at the end of each epoch, the guest is paused and updated memory and C P U state are copied to a buffer. The guest then resumes execution on the current host, rather than on the destination. Several modifications to the migration process are required in order to provide sufficient performance and to ensure that a consistent image is always available at the remote location. These are described below. 2.4.1 Migration enhancements In live migration, guest memory is iteratively copied over a number of rounds and may consume minutes of execution time; the brief service interruption Chapter 2. Design 15 caused by the singular stop-and-copy phase is not a significant component of total migration overhead. This is not the case when capturing frequent virtual machine checkpoints: every checkpoint is just the final stop-and-copy phase of migration, and so this represents a critical point of optimization in reducing checkpoint overheads. A n examination of Xen's checkpoint code revealed that the majority of the time spent while the guest is in the suspended state is lost to scheduling, largely due to inefficiencies in the implementation of the xenstore daemon that provides administrative communication between guest virtual ma-chines and domain 0. For the final round of memory copying, the guest must be suspended to create a consistent image. Paravirtual guests perform several steps in addition to pausing themselves before the final round 1 , including disconnecting devices and the connection to the Xenstore registry, and preparing a suspend record for use during restore, including information about the location of pages shared with Xen and the state of the guest's V C P U s . Thus suspending the guest is a cooperative process in which the migration process requests that the guest suspend itself, then awaits notification that it has done so. This procedure, depicted in Figure 2.2, is surprisingly complicated. 1. The migration process sends a message to xend via stdout, asking it to suspend the guest. 2. Xend writes an entry into the "control/shutdown" node in the guest's portion of xenstore, on which the guest has registered a watch. 3. Xenstore fires the watch in the guest via an event channel. 4. The guest reads the contents of the watch and then calls its suspend routine. 1 Paravirtual Xen guests contain code specifically for suspend requests that are responsible for cleaning up Xen-related state such as shared memory mappings used by virtual devices. In the case of hardware-virtualized (e.g., Windows) VMs, this state is completely encapsulated by Xen's device model, and these in-guest changes are unnecessary. Chapter 2. Design 16 5. At the end of the suspend function, the guest issues a suspend hypercall, which will not return control to the guest until domO issues a hypercall requesting it. 6. After the monitor has paused the domain, it sends a notification to xen-store that a domain has changed state. 7. Xenstore iterates through each active domain until it finds one that has changed state. It then writes this information into a xenstore node on which xend has registered a watch, and fires the watch. 8. Xend reads the watch, discovers that the domain has been suspended, and notifies the migration process via stdout. At this point it may also perform other suspend-related activities, such as invoking external scripts to checkpoint the state of attached disks or other devices. This process requires cooperation from several independent processes. Some of these processes (particularly xenstore, but also xend) can introduce large amounts of delay, and the signalling between them can cause significant jitter due to scheduling artifacts in domain 0. Although these delays are not a serious problem for low-frequency, heavyweight operations like domain save, they form a significant portion of the downtime during live migration, and dwarf most of the other costs of frequent checkpointing. This convoluted process can take a nearly arbitrary amount of time - we frequently measured latencies in the range of 30 to 40 milliseconds, but saw delays as long as 500 milliseconds in some cases. Remus's optimized suspend code streamlines this process by creating an event channel in the guest specifically for receiving suspend requests, which the migration process can invoke directly. Additionally, a new hypercall is provided to allow processes to register an event channel for callbacks notifying them of the completion of virtual machine suspension. The streamlined suspend process is depicted in Figure 2.3. In concert, these two notification mechanisms reduce the time required to suspend a virtual machine to about one hundred microseconds Chapter 2. Design 17 domain 0 Figure 2.2: Normal suspend path. Chapter 2. Design 18 domain 0 Figure 2.3: Accelerated suspend protocol. - an improvement of two orders of magnitude over the previous implementation. 2.4.2 Checkpoint support Providing checkpoint support in Xen required two primary changes to the ex-isting suspend-to-disk and live migration code. First, support was added for resuming execution of a domain after it had been suspended; Xen previously did not allow "live checkpoints" and instead destroyed the virtual machine in the process of writing its state out. Second, the suspend program was converted from a one-shot procedure into a daemon process. This allows checkpoint rounds after the first to copy only newly-dirty memory. Supporting resumption requires two basic changes. The first is a new hyper-call to mark the domain as schedulable again (Xen removes suspended domains from scheduling consideration, because the only action performed on suspended domains previously was to destroy them after their state had been replicated). A similar operation is necessary in order to re-arm watches in xenstore. Chapter 2. Design 19 2.4.3 Asynchronous transmission One of the guiding principles of our design is that checkpointing should be as asynchronous as possible. At the frequency with which we perform checkpoints, any pause in the operation of the guest, however slight, will have an impact on its throughput 2. Therefore, in order to allow the guest to resume operation as quickly as possible, the migration process was modified to copy touched pages to a staging buffer rather than delivering them directly to the network while the domain is paused. This results in a significant throughput increase: the time required for the kernel build benchmark discussed in Section 3.3.1 was reduced by approximately 10% at 20 checkpoints per second. While buffering improves performance by reducing the amount of time that a machine must be paused, it also highlights a limitation of our current imple-mentation. Although the guest can continue to run while the buffer is being propagated, subsequent checkpoints cannot begin until the current checkpoint has completed. A larger buffer will naturally take longer to transmit; an exces-sively large buffer may delay the next checkpoint beyond the requested check-point frequency. The optimal size of the buffer is approximately BW/(CF + T), where BW is the bandwidth of the replication channel, CF is the checkpoint frequency, and T is the time required to copy out state to the buffer. For exam-ple, when checkpointing every 40ms over a 1 Gbps link, and given a conservative copy-out time of 10ms, the checkpoint process should use a buffer no larger than 50Mb or 6 M B . 2.4.4 Guest modifications As discussed above, paravirtual guests in Xen contain a suspend handler that cleans up device state upon receipt of a suspend request. In addition to the notification optimizations described earlier in this section, the suspend request 2 T h i s is, in fact, the only part of the replication process which is t ruly speed-critical. Because the remaining stages of replication are performed asynchronously, latencies of a few milliseconds are tolerable. For this reason our current implementation performs reasonably well in spite of the fact that it is wri t ten in Py thon . Chapter 2. Design 20 handler has also been modified to reduce the amount of work done prior to suspension. In the original code, suspension entailed disconnecting all devices and unplugging all but one C P U . This is unnecessary work when the domain is simply being checkpointed, and so it was removed. These modifications were applied to Xen for use in the case that a migration attempt or a save operation fails, and they have been available since version 3.1.0. These changes are not strictly required for correctness, but they do improve the performance of the checkpoint noticeably, and involve very local modifica-tions to the guest kernel. Total changes were under 100 lines of code in the paravirtual suspend handler. As mentioned earlier, these modifications are not necessary in the case of non-paravirtualized virtual machines. 2.5 Network buffering Most networks cannot be counted on for reliable data delivery. Therefore, net-worked applications either accept packet loss, duplication and reordering, or they use a high-level protocol such as T C P which provides stronger service guarantees. This fact simplifies the network buffering problem considerably: transmitted packets do not require replication: their loss will appear as a tran-sient network failure and not affect the correctness of the' protected system. However, it is crucial that packets queued for transmission be held until the checkpointed state of the epoch in which they were generated is committed to the backup; if the primary fails, these generated packets reflect speculative state that is discarded, and so they must not become visible in this circumstance. Network buffering is implemented as a linux queueing discipline, which per-forms buffering on the network interface representing the virtual machine in domain 0. Its operation is depicted in Figure 2.4. A l l outbound packets are queued until an external commit message arrives via the linux netlink protocol. When the commit message is received, the tail of the queue is marked, and pack-ets are released to the network. When the previously-marked tail packet has been delivered, the device returns to buffering mode. This protocol allows the Chapter 2. Design 21 C l i e n t VM > Figure 2.4: Network buffering. flush to proceed asynchronously, with the guest permitted to use the interface while the commit proceeds. There are two minor wrinkles in this implementation. The first is that in linux, queueing disciplines only operate on outgoing traffic. Under Xen, the network interface consists of two separate devices, one in the guest and one in domain 0. Outbound traffic from the guest appears as inbound traffic on the bridge device in domain 0. Therefore in order to queue the traffic, we must convert the inbound traffic to outbound by routing it through a special device called an intermediate queueing device [17]. This module is designed to work at the IP layer via iptables [28], but it was not difficult to extend it to work at the bridging layer used to provide virtual machine network access in our implementation. The second wrinkle is due to Xen's implementation of network packet deliv-ery between virtual machines. For performance, the memory used by outbound networking traffic is not copied between guest domains and domain 0, but rather shared through grant table mappings managed by Xen. The hypervisor only Chapter 2. Design 22 permits a small number of pages to be shared at any one time. If messages are in transit between a guest and domain 0 for only a brief time, this limitation is not noticeable. However, the network output buffer can result in messages being in flight for a significant amount of time, with the result that the guest network device becomes blocked after a small amount of traffic has been sent. Therefore when messages are queued, the driver first copies them into local memory and then releases the local mappings to shared data. 2.6 Disk buffering Disks present a rather different challenge than do networks in providing consis-tent state replication. While Remus is designed to recover from a single host failure, it must preserve the consistency semantics that the guest expects of persistent storage even if both hosts should fail. Moreover, the goal of providing a general-purpose system precluded the use of expensive mirrored storage hard-ware designed for high-availability applications. Instead, Remus maintains a complete mirror of the active virtual machine's disks on the backup host. Prior to engaging the protection system, the current state of the disk on the primary is mirrored to the backup host. Once protection has been engaged, writes to persistent storage are tracked and checkpointed similarly to updates to memory. Figure 2.5 gives a high-level overview of the disk replication mechanism Much like memory updates, writes to disk from the speculative virtual ma-chine are treated as write-through: they are immediately applied to the local disk image, and concurrently mirrored to an in-memory buffer on the backup. This approach provides two direct benefits: First, it ensures that the local disk image remains crash-consistent at all times; in the case of that both hosts fail, the local disk wil l reflect the crashed state of the active virtual machine at the time of failure. Second, writing directly to disk accurately represents the latency and throughput required to talk to the physical device. This obvious-seeming property is of considerable value; Remus first began with an in-memory buffer which encountered some difficulty accurately characterizing disk responsiveness. Chapter 2. Design 23 S e c o n d a r y H o s t C i f P r i m a r y H o s t 12 1 Disk writes issued directly to local disk 2 Simultaneously, writes sent to backup buffer 3 Writes released to disk after checkpoint Figure 2.5: Disk write buffering. Earlier implementations either buffered writes, under-representing the time re-quired to commit data to disk and allowing the speculating virtual machine to race ahead of real persistent storage, or conservatively overestimated write latencies which resulted in a loss of performance. Modeling disk response time is notoriously challenging [29]; Remus avoids this by using the disk itself as a model. At the time the backup acknowledges a checkpoint as committed, disk up-dates reside completely in memory. The backup must not overwrite the on-disk state produced by any speculative execution, because this would prevent it from recovering the state of the disk at the time of the last committed checkpoint. Once a checkpoint commit has been received, the disk request buffer may be ap-plied to disk. In the event of a failure event at the primary host, Remus will wait until all buffered writes have been applied before resuming execution. Although the backup could begin execution immediately using the request buffer as an overlay on the physical disk, this would violate the disk semantics presented to Chapter 2-. Design 24 the protected virtual machine: A crash of the backup after such a resumption and before data had completed writing out to disk could potentially result in inconsistent on-disk state. Only one of the two disk mirrors managed by Remus is valid at any given time. In the case that both the primary and the backup hosts fail, this point becomes critical. This property is provided through the use of a persistent on-disk assertion, which indicates at all times which physical disk is active. After flushing buffered writes to disk and before resuming execution of the protected V M , Remus writes a timestamped activation record asserting that the local copy of the disk is current and consistent. When recovering from the failure of both the primary and the backup, this assertion may be used to identify which disk contains the valid, crash-consistent image. The disk buffer is implemented as a client of the block tap device [33, 34]. The Xen block tap device allows a userspace process to interpose on the block request/response stream between a virtual machine and its underlying device. Remus uses a custom tap module which interposes on disk write requests from the protected virtual machine and asynchronously mirrors them to the backup as they are written to the local disk. To provide checkpoint synchronization, this device opens two control FIFOs at startup, allowing the protection service to insert checkpoint records into the disk stream at the appropriate time, and to receive acknowledgment from the backup before releasing queued outbound network traffic. On the primary host, checkpoint requests are simply passed along to the replica, and acknowledgments are likewise passed from the replica to the notification F I F O . On the replica, the tapdisk module maintains state in two separate R A M disks. One is the speculation buffer, which holds write requests associated with speculative execution on the primary host. The other is the write-out buffer, which contains the state of the most recent checkpoint as it writes itself to disk. Whenever it receives a checkpoint message, any state currently in the speculation buffer is moved into the write-out buffer, superseding any data at Chapter 2. Design 25 the same sector addresses which may already be in the write-out buffer3, and a commit receipt is immediately reported to the primary: If the primary fails, it is sufficient to have the contents of the disk associated with the most recent checkpoint in R A M , as long as the replica is not activated until that state has been applied to disk. In the case of failure, the speculation buffer is discarded and execution at the backup is permitted once the write-out buffer has been entirely drained. The contents of this buffer are applied to disk according to the following algorithm: 1. Sort outstanding writes by sector address. 2. Merge contiguous sectors into single requests. 3. Submit as many requests as the underlying driver will accept. 4. Wait for a completion callback from the underlying driver. When the completion callback is received, repeat step 1. 5. When the final request has completed, the disk state is consistent. The sectors are sorted for two reasons: first, it is very likely to improve phys-ical locality for the underlying disk, whose performance is generally dominated by the number of seeks it must perform; second, this allows the checkpoint process to aggressively merge requests. Although the number of outstanding requests is bounded by the tapdisk interface, the size of each request is not. Merged requests provide a substantial performance benefit. It should be noted that the write ordering produced by this algorithm does not necessarily match that of the guest. It is possible that an interrupted write-out process will leave the disk in a non-crash-consistent state. The disk validity assertion described above protects the system as a whole from relying on such a disk. 3 In normal operation this will never happen — the protection service waits for the previous checkpoint commit to be acknowledged before issuing a new one. But it would be possible to receive a new checkpoint before the last one had committed, and the driver is designed with this in mind. Chapter 2. Design 26 2.7 Detecting failure Remus's focus is on demonstrating that advanced high availability may be pro-vided in a general and transparent way using commodity hardware and without modifying the applications that it protects. We currently use a simple failure detector that is directly integrated in the checkpointing stream: a timeout wait-ing for the backup to respond to commit requests will result in the primary assuming that the backup has crashed and disabling protection. Similarly, a timeout during transmission of checkpoint data from the primary will result in the backup assuming that the primary has crashed and resuming execution from the most recent checkpoint. The system is configured to use a pair of bonded network interfaces, and the two physical hosts are connected using a pair of Ethernet crossover cables (or independent switches) on the protection NICs. Should both of these network paths fail, Remus does not currently provide mechanism to fence execution. Traditional techniques for resolving partitioning (i.e. quorum protocols) are notoriously difficult to apply in two host configurations. It is the responsibility of the system administrator to ensure that external network connections share fate with the replication channel. \ Chapter 3 27 Evaluation Remus has been designed with the primary objective of making high availabil-ity sufficiently generic and transparent that it may be deployed on today's com-modity hardware. We believe that the system we have implemented meets these goals admirably. But to be practically useful, the overhead it entails must not be prohibitive. In the following section, we evaluate our implementation against three very different workloads in order to reveal its performance characteristics across a wide range of possible operating environments. Before evaluating the overhead imposed by our high availability service, we first measure the behaviour of the system when a single host fails. In case of failure at the primary, we find that the backup becomes fully active within approximately 1.2 seconds, without any loss of visible state - established T C P connections continue without interruption. We then evaluate the overhead of the system on application performance. We do this by measuring the performance of a protected system compared to an unprotected system for a few common workloads. In order to better un-derstand these results we present a set of microbenchmarks which pinpoint the mechanisms most affecting performance. We find that a general-purpose task such as kernel compilation performs at somewhat better than half native speed when checkpointed 20 times per second, while network-dependent workloads as represented by SPECweb perform at somewhat more than one quarter native speed. The additional overhead in this case is largely due to added network latency. Based on this analysis, we conclude that although Remus is efficient at state Chapter 3. . Evaluation 28 replication, it does introduce significant network latency, particularly for appli-cations that exhibit poor locality in memory writes. Thus, applications that are very sensitive to network latency may not be well-suited to this type of high availability service. However, it is worth mentioning that our implemen-tation currently lacks a number of optimizations which have the potential to significantly reduce network latency, some of which are discussed in more detail following the benchmark results. It should also be pointed out that SPECweb latency is especially high due to its sustained memory burn rate; bursty appli-cations wil l generally experience less latency. More general purpose workloads, as represented by the kernel compilation benchmark, can be protected with an overhead of roughly 100%. 3.1 Test environment Unless otherwise stated, all tests were run on I B M eServer x306 systems, con-sisting of a 3.2 GHz Pentium 4 processor with hyperthreading enabled, 1 G B of R A M , 3 Intel elOOO G b E network interfaces, and an 80 G B S A T A hard drive. The hypervisor is Xen 3.1.0 modified as described in Section 2.4, and the op-erating system for all virtual machines was linux 2.6.18 as distributed in Xen 3.1.0, with the modifications described in Section 2.4.4. Each domain is pro-vided one V C P U . Domain 0 is pinned to the first physical hyperthread, and the guest domain to the second. The first network interface connects the primary and secondary physical hosts via a private switch. It is used for the replication protocol. The second interface is publicly routed, and is used for administra-tive access. The third interface is used for application data. Vir tual machine networking is provide in bridged mode on the third interface. Vir tual disks are provided from disk image files on local S A T A drives, exported to the guests using the tapdisk AIO driver. Although we did not have S M P equipment available for testing mul t i -CPU checkpointing, the only additional overhead this should entail is a small addi-tional operation during guest suspension in order to quiesce other CPUs . This Chapter 3. Evaluation 29 overhead is trivial relative to that required to stop and copy the complete set of changed state per round, and therefore highly unlikely to have a visible effect on checkpointing performance. 3.2 Failover performance To evaluate failover performance, we logged into the primary host via SSH and began a kernel compilation. We simultaneously started a "ping" process to measure round-trip time from an external host to the protected server every 200ms. We then disconnected power from the primary host at arbitrary times during the build, and observed that (1) the SSH session to the primary host remained unbroken after power failure and recovery, and (2) the kernel build continued to successful completion. Using the ping test, we recorded the elapsed time during which the server was unavailable to be approximately 1.2 seconds. 3.3 Benchmarks In the following section, we evaluate the performance of our system using a variety of macrobenchmarks which are meant to be representative of a range of real-world workload mixtures. The primary workloads we examine are a kernel compilation test, the SPECweb2005 benchmark, and the Postmark disk bench-mark. Kernel compilation is a balanced workload which stresses the virtual memory system, the disk and the C P U . SPECweb primarily exercises network-ing performance and memory throughput. Postmark focuses exclusively on disk performance, which is not a significant component of the overhead of the other workloads. 3.3.1 Kernel compilation The kernel compile test measures the wall-clock time required to build linux kernel version 2.6.18 using the default configuration and the bzlmage target. Compilation uses G C C version 4.1.2, and make version 3.81. The measured Chapter 3. Evaluation 30 0 10 20 30 40 Checkpoints per second Figure 3.1: Kernel build times at various checkpoint frequencies. times are preceded by an untimed build followed by a make clean in order to reduce cold cache effects. This is a balanced workload that tests C P U , memory and disk performance. Figure 3.1 shows protection overhead when configured to checkpoint at rates of 10, 20, 30 and 40 times per second. Overhead scales linearly with checkpoint frequency. At a frequency of 20 times per second, total measured overhead is 84%. We believe this represents a realistic overhead for general-purpose systems. Chapter 3. Evaluation 31 3.3.2 SPECweb2005 The SPECweb benchmark is composed of at least three separate systems: a web server, an application server, and one or more web client simulators. We configure these as three V M s on distinct physical machines. The application server and the client are configured with 640 M B out of 1024 M B total available R A M . The web server and backup are provisioned with 2048 M B of R A M , of which 1024 is allocated to the web server V M , which is the system under test. Although this leaves domain 0 with more than 900 M B of R A M , the replication engine only requires 20-30 M B . The SPECweb scores we mention in this section are the highest results we achieved with the SPECweb e-commerce test maintaining 95% "good" and 99% "tolerable" times. Figure 3.2 shows SPECweb performance at various checkpoint frequencies relative to an unprotected server. These scores are primarily a function of the la-tency imposed by the network buffer between the server and the client. Although they are configured for a range of frequencies, SPECweb touches memory rapidly enough that the time required to propagate the memory dirtied between check-points sometimes exceeds 100ms, regardless of checkpoint frequency. Because the network buffer cannot be released until the checkpoint state is propagated, the effective network latency is significantly higher than the configured check-point frequency. The size of the checkpoint buffer constrains to some degree how far ahead of replication the domain may run, but it is only a coarse-grained throttle. Because the effective checkpoint frequency is lower than the configured frequency, and network latency dominates the SPECweb score, performance is relatively flat across the range of configured frequencies. A t configured rates of 10, 20, 30 and 40 checkpoints per second, the average checkpoint rates achieved were 9.98, 16.38, 20.25 and 23.34 respectively, for average latencies of 100ms, 61ms, 49ms and 43ms respectively. To confirm that SPECweb performance was due to network latency, we performed the tests again with network buffering disabled. In this case the overhead, primarily due to the stop-and-copy phase of checkpointing, was 44%. Chapter 3. Evaluation 32 190 - i 180 170 -160 -150 -140 -O 130 & 120 -g 110 -O 100 --g 90 -5 80 -a 70 60 H 50 -40 -30 -20 10 o H with netbuf no netbuf Checkpoints per second Figure 3.2: SPECweb performance at various checkpoint frequencies, with and without network output buffering. Native score: 305. Chapter 3. Evaluation 33 SPECweb is a RAM-hungry workload which is also very dependent on net-work latency. This makes it a poor fit for our current implementation, which trades network latency for memory throughput. Figure 3.3 demonstrates the dramatic effect of latency between the client V M and the web server. We used the linux netem [19] queueing discipline to add varying degrees of delay to the outbound link from the web server (virtualized with 1024 M B R A M , but not running under our protection service). For comparison, we also show protection overhead when network buffering is disabled, to better isolate network latency from other forms of checkpoint overhead (again, the flat profile is due to the ef-fective checkpoint rate being lower than the configured rate). Deadline schedul-ing, discussed in Section 3.4.1, and page compression, discussed in Section 3.4.2 are two possible techniques for reducing checkpoint latency and transmission time. Either or both would reduce checkpoint latency, and therefore be likely to increase SPECweb performance considerably. 3 . 3 . 3 P o s t m a r k The previous sections have characterized network and memory performance un-der protection, but the benchmarks used put only moderate load on the disk subsystem. In order to better understand the effects of the disk buffering mech-anism, we ran the Postmark disk benchmark (version 1.51). This benchmark is sensitive to both disk response time and throughput. To isolate the cost of disk replication, we did not engage memory or network protection during these tests. Configuration was identical to an unprotected system, with the exception that the virtual disk was provided by the tapdisk replication module. Figure 3.4 shows the total time required to perform 10000 postmark transactions with no disk replication, and with a replicated disk committing at frequencies of 10, 20, 30 and 40 times per second. The results indicate that replication has no significant impact on disk performance. Chapter 3. 'Evaluation 34 ft 60 T 70 I 80 I 90 100 Network latency (ms) Figure 3.3: The effect of network latency on SPECweb performance. Chapter 3. Evaluation 35 350 4 0 10 20 30 40 Checkpoints per second Figure 3 .4 : The effect of disk replication on Postmark performance. Chapter 3. Evaluation 36 3.4 Potential optimizations Although we believe that the performance overhead as measured earlier in this chapter is reasonable for the level of protection that it provides, we are eager to reduce it further, particularly for latency-sensitive workloads. In addition to simply streamlining the existing code (which is largely unoptimized python in our current implementation), we believe the following techniques have the potential to greatly increase performance. 3.4.1 Deadline scheduling Remus allows checkpoint requests at any frequency greater than the scheduler quantum of the domain 0 operating system (typically about 10ms), but the time required to perform a checkpoint and propagate it to the backup may exceed the interval between checkpoints, depending on the amount of state which must be copied. Figure 3.5 illustrates the strong correlation between the number of pages dirtied per epoch and the amount of time required to propagate the checkpoint, based on samples from a SPECweb benchmark in which the requested checkpoint frequency was 20 checkpoints per second. Although it is theoretically possible to take new checkpoints before the last checkpoint has completely propagated, this would not reduce effective latency: network output release must wait until propagation. Because memory bandwidth is far higher than disk bandwidth, epochs in which memory is heavily written to may delay subsequent checkpoints beyond their scheduled time. Figure 3.6 illustrates the effective checkpoint rate during kernel compilation compared to the requested rate. At higher frequencies, the protected system has less time in which to touch memory per epoch, but there is also less time to propagate it. At lower frequencies, the time available for both activities is increased, and furthermore multiple writes to the same memory location are coalesced during checkpointing. Theoretically, checkpoint frequency can be guaranteed at the interval of M/B where M is the total amount of physical memory available to the guest and B is the bandwidth of the replication link. This rate is far too low to support Chapter 3. Evaluation 37 realistic high availability deployment, but may be of use in other applications, such as periodic checkpoints for restartable computations. It would be desirable to provide stricter scheduling guarantees. One possible solution would be to adjust the rate at which the guest operates [11], deliber-ately slowing it down between checkpoints depending on the number of pages it touches. Applications which prioritize latency over throughput, such as those modeled by the SPECweb benchmark discussed in Section 3.3.2, could enable such throttling for improved performance. In order to perform this type of op-eration, the shadow page handler may be extended to pause the guest when the number of pages currently dirty exceeds a high water mark. It would also be possible for Xen to invoke a callback by an event channel when the num-ber of dirty pages exceeds some high water mark, allowing a user-level process to apply more sophisticated scheduling adjustments which may be configured through Xen's extensible domain scheduler. 3.4.2 Page compression Writes to memory rarely change every byte on a page. It has been observed that disk writes typically only alter 5-20% of a data block [37]. This fact can be exploited to reduce the amount of data transmitted by sending only the delta from a previous transmission of the same page. To evaluate the potential benefits of compressing the replication stream, we have implemented a proof-of-concept compression engine. Before transmitting a page, this system checks for its presence in an address-indexed L R U cache of previously transmitted pages. On a cache hit, the page is X O R e d with the previous version and the differences are run-length encoded. This provides ef-fective compression when page writes do not change the majority of the page. Although this is true for much of the data stream, there remains a significant fraction of pages that have been modified to the point where X O R compression is not effective. In these general-purpose algorithm such as that used by gzip may achieve a higher degree of compression. Chapter 3. Evaluation 38 139 - | o u ^ O L n o i n i n i n o L n o i o i f i O L n o i n L n o i o i X 5 T - c D C M i ^ c \ i r ^ c o o o - * c r ) - * O L n ' > - c D - i - r ^ c v i ' - C O ' t l O N O l O W n i n f f i l B O T - O ^ t D N O ) i - - r - T - i - i - i - C M C M ( N C M < M < M C V I 4K pages replicated Figure 3.5: SPECweb2005 checkpoint latency versus pages dirtied per epoch. Chapter 3. Evaluation 39 2730 2520 •o 2 3 1 0 g 2100 '1 1890 ^ 1680 2 1470 w 1260 H " 1050 3 840 * 630 420 210 0 Requested checkpoint interval: every 100ms 4K blocks dirtied ..checkpoint time.... 7i r | i -j.||ji!i||..,(i 2730 2520 T3 2 3 1 0 g 2100 | 1890 g 1680 2 1470 M 1260 o 1050 5 840 * 630 420 210 0 Requested checkpoint interval: every 50ms 'SVil'i--iv-' TTilTii iiii it! i ii 2730 . 2520 -| -o 2 3 1 0 g 2100 g 1890 . £ 1680 -| 5 1470 t> 1260 8 1 0 5 0 I 5 840 -| * 630 , "* 420 -| 210 0 J Requested checkpoint interval: every 33ms 2730 2520 -o 2 3 1 0 g 2100 g 1890 £ 1680 2 1470 m 1260 g 1050 5 840 * 630 420 210 0 Requested checkpointjrrtejyaj^eyery 25ms Figure 3.6: Checkpoint latency relative to the number of pages dirtied per epoch during kernel compilation. Chapter 3. Evaluation 4 0 Figure 3.7: Comparison of the effectiveness of various page compression mech-anisms. We found that by using a hybrid approach, in which each page is prefer-entially XOR-compressed, but falls back to gzip compression if the XOR com-pression ratio falls below 5:1 or the previous page is not present in the cache, we could obtain a typical compression ratio of 10:1 on the replication stream. Figure 3.7 shows the bandwidth in megabytes per second for 60 seconds of a kernel build test on a protected domain with 1GB of R A M . The cache size was 8192 pages and the average cache hit rate was 99%. Replication stream compression will consume additional memory and CPU Chapter 3. Evaluation 41 resources on the replicating host, but lightweight schemes such as the X O R com-pression technique mentioned above, which may be trivially vectorized to exe-cute efficiently on modern processors, should easily pay for themselves through a reduction in bandwidth required for replication, and its consequent improve-ment in network buffering latency. 3.4.3 Copy-on-write checkpoints In order to capture a consistent checkpoint, Remus in its current implementation must pause the domain for an amount of time linear to the number of pages which have been dirtied since the last checkpoint. This overhead could be mitigated by marking dirty pages as copy-on-write and resuming the domain immediately. This would reduce the time during which the domain must be paused to a fixed small cost proportional to the total amount of R A M available to the guest. We intend to implement copy-on-write by providing the Xen shadow paging code with a buffer of domain 0 memory into which it could copy touched pages before restoring read-write access. The replication process could then extract any pages marked as copied from the C O W buffer instead of directly mapping guest domain memory. When it has finished replicating pages, their space in the buffer can be marked for reuse by the Xen C O W module. If the buffer becomes full, the guest may simply be paused, resulting in a graceful degradation of service from C O W to stop-and-copy operation. Chapter 4 42 Related Work State replication may be performed at several levels, each of which balances efficiency and generality differently. At the lowest level, hardware-based repli-cation is potentially the most robust solution. Hardware, however, is much more expensive to develop than software and thus hardware replication is at a significant economic disadvantage. Virtualization-layer replication has many of the advantages of the hardware approach, but comes at lower cost because it is implemented in software. Like hardware, however, the virtualization layer has no semantic understanding of the operating-system and application state it replicates. As a result it can be less flexible than process checkpointing in the operating system, in application libraries or in applications themselves, because it must replicate the entire system instead of individual processes. It can also be less efficient, because it may replicate unnecessary state. The challenge for these higher-level approaches, however, is that interdependencies among state elements that comprise a checkpoint are insidiously difficult to identify and un-tangle from the rest of the system and thus these checkpointing mechanisms are significantly more complex than checkpointing in the virtualization layer. 4.1 Virtual machine migration As described earlier, Remus is built on top of the Xen support for live migra-tion [7], extended significantly to support frequent, remote checkpointing. Brad-ford et al. extended Xen's live migration support in another direction: migrating persistent state along with the migrating guest so that it can be restarted on a remote node that does not share network storage with the originating system[3]. Chapter 4. Related Work 43 Similar to Remus, other projects have used virtual machines to provide high availability. The closest to our work is Bressoud and Schneider's [4]. They use the virtual machine monitor to forward the input events seen by a primary system to a backup system where they are deterministically replayed to replicate the primary's state. Deterministic replay requires much stricter constraints on the target architecture than simple virtualization and it requires an architecture-specific implementation in V M M . Another significant drawback of deterministic replay exemplified by Bressoud and Schneider's work is that it does not easily extend to multi-core CPUs . The problem is that it is necessary, but difficult, to determine the order in which cores accesses the shared memory. There have been attempts to address this problem. For example, Flight Data Recorder [36] is a hardware module that sniffs cache coherency traffic in order to record the order in which multiple processors access shared memory. Similarly, Dunlap introduces a software approach in which the C R E W protocol (concurrent read, exclusive write) is imposed on shared memory via page protection [9]. While these approaches do make S M P deterministic replay possible, it is not clear if they make it feasible due to their high overhead, which increases at least linearly with the degree of concurrency. Our work sidesteps this problem entirely because it does not require deterministic replay. 4.2 Vi r tua l machine logging and replay Virtual machine logging has been used for purposes other than high availability. For example, in ReVirt [10], virtualization is used to provide a secure layer for logging state changes in the target system in order to provide better forensic evidence for intrusion detection systems. The replayed system is a read-only copy of the original system, which is not meant to be run except in order to recreate the events involved in a system compromise. Logging has also been used to build a time-travelling debugger [14] that, Like ReVirt , replays the system for forensics only. Chapter 4. Related Work 44 4.3 Operating system replication There are many research operating systems that support process migration, primarily for load balancing. Some examples are Accent [26], Amoeba [18], MOSIX [1], Sprite [24] and V [6]. The main challenge with using process migration for failure recovery is that migrated processes typically leave residual dependencies to the system from which they were migrated. Eliminating these dependencies is necessary to tolerate the failure of the primary host, but the solution has proved to be elusive due to the complexity of the system and the structure of these dependencies. Some attempts have been made to replicate applications at the operating system level. Zap [23] attempts to introduce a virtualization layer within the linux kernel. This approach must be rebuilt for every operating system, and carefully maintained across versions. 4.4 Library approaches Some application libraries provide support for process migration and checkpoint-ing. This support is common for parallel applications for example CoCheck [30]. Typically process migration is used for load balancing and checkpointing is used to recover an entire distributed application in the event of failure. 4.5 Replicated storage There has also been a large amount of work on checkpointable storage for dis-aster recovery as well as forensics. The Linux Logical Volume Manager [15] provides a limited form of copy-on-write snapshots of a block store. Paral-lax [35] significantly improves on this design by providing limitless lightweight copy-on-write snapshots at the block level. The Andrew File System [12] allows one snapshot at a time to exist for a given volume. Other approaches include RSnapshot, which runs on top of a file system to create snapshots via a series of Chapter 4. Related Work 45 hardlinks, and a wide variety of backup software. DRBD [27] is a software ab-straction over a block device which transparently replicates it to another server. Because these approaches replicate only persistent state, they cannot provide true high availability. It would be straightforward to replace Remus' native disk replication subsystem with other block-level approaches. Checkpointable file systems could also be used, with some loss of generality. 4.6 Speculative execution Using speculative execution to isolate I /O processing from computation has been explored by other systems. In particular, SpecNFS [21] and Rethink the Sync [22] use speculation in a manner similar to us in order to gain asynchrony of I /O processing. Remus is different from these systems in that the semantics of block I /O from the guest remain entirely unchanged: they are applied imme-diately to the local physical disk. Instead, our system buffers generated network traffic to isolate the externally visible effects of speculative execution until the associated state has been completely replicated. 46 Chapter 5 Future Work This chapter briefly discusses a number of directions that we intend to explore in order to improve performance and extend the service which Remus provides in its current implementation. As we have shown in the previous chapter, the overhead imposed by the high availability service is not unreasonable. However, the implementation described in this thesis is quite young. Consequently, sev-eral potential areas of optimization remain. Upon completion of the targeted optimizations discussed in Section 3.4, we intend to investigate more general optimizations and extensions, some of which are described below. In t rospec t ion op t imiza t ions . This implementation currently propagates more state than is strictly necessary. For example, buffer cache pages do not need to be replicated, since they can simply be read in from persistent storage again. To leverage this, the virtual disk device could log the addresses of buffers provided to it for disk reads, along with the associated disk addresses. The process responsible for copying memory could then skip over these pages if they had not been modified after the completion of the disk read. The remote end would be responsible for reissuing the reads from its copy of the disk in order to fill in the missing pages. For disk-heavy workloads, this should result in a substantial reduction in state propagation time. More advanced uses of introspection might infer information similar to that captured by Valgrind [20] in order to determine whether dirty memory is actually in use by guest applications (it may have been returned to a free page pool, for example) before propagating it to the backup. Chapter 5. Future Work 47 Output dependency analysis. For some classes of applications, the la-tency induced by network output buffering represents a significant loss of per-formance. Applications that perform their own checkpointing can reduce this overhead because they have access to dependency information between the state that they must checkpoint and the output that they produce, whereas Remus must assume that every outbound network packet is dependent on any uncheck-pointed state. It may be interesting to explore ways of exporting this depen-dency information to Remus in order to increase performance, in a cooperative manner reminiscent of the techniques Xen uses to paravirtualize guest operating systems. Hardware virtualization support. Due to the lack of equipment sup-porting hardware virtualization in our laboratory at the time of development, we have only implemented support for paravirtualized guest virtual machines. But we have examined the code required to support fully virtualized environ-ments, and the outlook is quite promising. In fact, it may be somewhat simpler than the paravirtual implementation due to the better encapsulation provided by virtualization-aware hardware. Cluster replication. It would be useful to extend the system to protect multiple interconnected hosts. While each host can be protected independently, coordinated protection would allow internal network communication to proceed without buffering. This has the potential to dramatically improve the through-put of distributed applications, including the three-tiered web application con-figuration prevalent in managed hosting environments. Support for cluster repli-cation could be provided by a distributed checkpointing protocol such as that which is described in our colleague Gang Peng's master's thesis [25], which used an early version of the checkpointing infrastructure provided by Remus in order to support restartable parallel computations. Disaster recovery. Remus was born from the SecondSite [8] project, whose aim was to provide geographically diverse mirrors of running systems in order survive physical disaster. We are in the process of planning a multi-site de-ployment of Remus in order to experiment with this sort of configuration. In a Chapter 5. Future Work 48 long distance deployment of Remus, network delay will be an even larger con-cern. Additionally, network reconfigurations will be required to redirect Internet traffic accordingly. Log - s t ruc tu r ed datacenters. We are interested in extending Remus to capture and preserve the complete execution history of protected V M s rather than just the most recent checkpoint. This class of very detailed execution data should be very useful in building advanced debugging and forensics tools. It may also provide a convenient mechanism for recovering from state corruption whether introduced by operator error or by malicious agents (viruses, worms, and so forth). 49 Chapter 6 Conclusion This thesis has presented Remus, a novel system by which high availability may be retrofitted onto existing software running on commodity hardware, without requiring significant changes to either. B y encapsulating an existing system with an virtual machine, it can capture whole-machine state which represents a checkpoint of execution at an instant in time. These checkpoints are propagated asynchronously to a backup host while the protected virtual machine continues to run speculatively. Although network output must be buffered between check-points in order to ensure consistency between primary and backup hosts, Remus performs checkpoints at such high frequency (we have tested it at up to 40 times per second) that the effective latency induced by buffering need not be higher than that experienced between hosts on the internet. Providing high availability is a challenging task and one that has tradition-ally required considerable cost and engineering effort. Remus commodities high availability by presenting it as a service at the virtualization platform layer: high availability may simply be "switched on" for specific virtual machines. As with any high availability system, protection does not come without a cost: The net-work buffering required to ensure consistent replication imposes a performance overhead on applications that require very low latency. Administrators must also deploy additional hardware, which may be used in N-to-1 configurations with a single backup protecting a number of active hosts. In exchange for this overhead, Remus completely eliminates the task of modifying individual appli-cations in order to provide high availability facilities, and it does so without requiring special-purpose hardware. Chapter 6. Conclusion 50 Remus represents a previously unexplored point in the design space of high availability for modern servers. The system allows protection to be simply and dynamically provided to running virtual machines at the push of a button. We feel that this model is particularly attractive for hosting providers, who desire to offer differentiated services to customers. While high availability is an interesting challenge in its own right, we believe that the high-frequency checkpointing mechanism we have engineered in support of Remus will have many other interesting applications, ranging from forensics and error recovery tools based on replayable history to software engineering applications such as concurrency-aware time-travelling debuggers. 51 Bibliography [1] Amnon Barak and Richard Wheeler. Mosix: an integrated multiprocessor unix. pages 41-53, 1999. [2] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, T i m Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 164-177, New York, N Y , U S A , 2003. A C M Press. [3] Robert Bradford, Evangelos Kotsovinos, Anja Feldmann, and Harald Schioberg. Live wide-area migration of virtual machines including local persistent state. In VEE '07: Proceedings of the 3rd international confer-ence on Virtual execution environments, pages 169-179, New York, N Y , U S A , 2007. A C M Press. [4] Thomas C. Bressoud and Fred B . Schneider. Hypervisor-based fault-tolerance. In Proceedings of the Fifteenth ACM Symposium on Operating System Principles, pages 1-11, December 1995. [5] Subhachandra Chandra and Peter M . Chen. The impact of recovery mecha-nisms on the likelihood of saving corrupted state. In ISSRE '02: Proceedings of the 13th International Symposium on Software Reliability Engineering (ISSRE'02), page 91, Washington, D C , U S A , 2002. I E E E Computer Soci-ety. [6] David Cheriton. The v distributed system. Communications of the ACM, 31(3):314-333, 1988. Bibliography 52 [7] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. Live migration of virtual machines. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation, Berkeley, C A , U S A , 2005. U S E N I X Association. [8] Brendan Cully and Andrew Warfield. Secondsite: disaster protection for the common server. In HOTDEP'06: Proceedings of the 2nd conference on Hot Topics in System Dependability, Berkeley, C A , U S A , 2006. U S E N I X Association. [9] George Dunlap. Execution Replay for Intrusion Analysis. Ph D thesis, University of Michigan, 2006. [10] George W . Dunlap, Samuel T. King , Sukru Cinar, Murtaza A . Basrai, and Peter M . Chen. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on Op-erating Systems Design & Implementation (OSDI 2002), 2002. [11] D. Gupta, K . Yocum, M . McNett, A . C. Snoeren, A . Vahdat, and G . M . Voelker. To infinity and beyond: time warped network emulation. In SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles, 2005. [12] John H . Howard, Michael L . Kazar, Sherri G. Menees, David A . Nichols, M . Satyanarayanan, Robert N . Sidebotham, and Michael J . West. Scale and performance in a distributed file system. A CM Transactions on Computer Systems, 6(1):51-81, 1988. [13] HP. NonStop Computing. [14] Samuel T. King , George W . Dunlap, and Peter M . Chen. Debugging op-erating systems with time-traveling virtual machines. In ATEC'05: Pro-ceedings of the USENIX Annual Technical Conference 2005 on USENIX Bibliography 53 Annual Technical Conference, Berkeley, C A , U S A , 2005. U S E N I X Associ-ation. [15] Lvm2. [16] D. Marques, G . Bronevetsky, R. Fernandes, K . Pingali, and P. Stodghill. Optimizing checkpoint sizes in the c3 system. In 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), Apr i l 2005. [17] Patrick McHardy. Linux imq. [18] Sape J . Mullender, Guido van Rossum, Andrew S. Tanenbaum, Robbert van Renesse, and Hans van Staveren. Amoeba: A distributed operating system for the 1990s. Computer, 23(5):44-53, 1990. [19] netem. [20] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavy-weight dynamic binary instrumentation. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and im-plementation, pages 89-100, New York, N Y , U S A , 2007. A C M Press. [21] Edmund B . Nightingale, Peter M . Chen, and Jason Fl inn. Speculative execution in a distributed file system. In SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles, pages 191— 205, New York, N Y , U S A , 2005. A C M Press. [22] Edmund B . Nightingale, Kaushik Veeraraghavan, Peter M . Chen, and Ja-son Fl inn. Rethink the sync. In USENIX'06: Proceedings of the 7th con-ference on USENIX Symposium on Operating Systems Design and Imple-mentation, Berkeley, C A , U S A , 2006. U S E N I X Association. [23] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh. The design and implementation of zap: a system for migrating computing environ-ments. SIGOPS Oper. Syst. Rev., 36(SI):361-376, 2002. Bibliography 54 [24] John K . Ousterhout, Andrew R. Cherenson, Frederick Doughs, Michael N . Nelson, and Brent B . Welch. The sprite network operating system. Com-puter, 21(2):23-36, 1988. [25] Gang Peng. Distributed checkpointing. Master's thesis, University of British Columbia, 2007. [26] Richard F . Rashid and George G . Robertson. Accent: A communication oriented network operating system kernel. In SOSP '81: Proceedings of the eighth ACM symposium on Operating systems principles, pages 64-75, New York, N Y , U S A , 1981. A C M Press. [27] Philipp Reisner and Lars Ellenberg. Drbd v8 - replicated storage with shared disk semantics. In Proceedings of the 12th International Linux Sys-tem Technology Conference, October 2005. [28] Rusty Russell. Netfilter. [29] J . Schindler and G . Ganger. Automated disk drive characterization. Techni-cal Report C M U SCS Technical Report CMU-CS-99-176, Carnegie Mellon University, December 1999. [30] G . Stellner. CoCheck: Checkpointing and Process Migration for M P I . In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, 1996. [31] Symantec Corporation. Veritas Cluster Server for VMware E S X . /High_Availability/vcs22vmware_datasheet.pdf, 2006. [32] VMware, Inc. Vmware high availability (ha)., 2007. [33] Andrew Warfield. Virtual Devices for Virtual Machines. PhD thesis, Uni-versity of Cambridge, 2006. Bibliography 55 [34] Andrew Warfield, Steven Hand, Keir Fraser, and T i m Deegan. Facilitating the development of soft devices. In ATEC'05: Proceedings of the USENIX Annual Technical Conference 2005 on USENIX Annual Technical Confer-ence, Berkeley, C A , U S A , 2005. U S E N I X Association. [35] Andrew Warfield, Russ Ross, Keir Fraser, Christian Limpach, and Steven Hand. Parallax: managing storage for a million machines. In HOTOS'05: Proceedings of the 10th conference on Hot Topics in Operating Systems, Berkeley, C A , U S A , 2005. U S E N I X Association. [36] M i n X u , Rastislav Bodik, and Mark D. Hi l l . A "flight data recorder" for enabling full-system multiprocessor deterministic replay. In ISCA '03: Proceedings of the 30th annual international symposium on Computer ar-chitecture, pages 122-135, New York, N Y , U S A , 2003. A C M Press. [37] Qing Yang, Weijun Xiao, and Jin Ren. Trap-array: A disk array architec-ture providing timely recovery to any point-in-time. In ISCA '06: Proceed-ings of the 33rd annual international symposium on Computer Architecture, pages 289-301, Washington, D C , U S A , 2006. I E E E Computer Society. 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items