UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A garbage collector & free space compactor for the parallax storage system Abdul-Amir, Mohammad 2011

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2011_spring_abdul-amir_mohammad.pdf [ 684.49kB ]
JSON: 24-1.0051194.json
JSON-LD: 24-1.0051194-ld.json
RDF/XML (Pretty): 24-1.0051194-rdf.xml
RDF/JSON: 24-1.0051194-rdf.json
Turtle: 24-1.0051194-turtle.txt
N-Triples: 24-1.0051194-rdf-ntriples.txt
Original Record: 24-1.0051194-source.json
Full Text

Full Text

A Garbage Collector & Free Space Compactor for the Parallax Storage System by Mohammad Abdul-Amir B.Sc., Jordan University of Science and Technology, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2011 c Mohammad Abdul-Amir 2011  Abstract Parallax is a distributed storage system for virtualized environments. It provides virtual disk abstractions for the consumption of virtual machines. Parallax’s design has been driven by emerging trends in virtualized deployments. The system makes use of Copy-on-Write mechanisms to provide primitives for fast snapshotting and provisioning of disks in order to match the nimbleness and agility of virtual machines. However, using Copy-onWrite mechanisms in a distributed storage system complicates free space reclamation and leads to data fragmentation. Implementing a garbage collector and a disk defragmentation utility for Parallax has been attempted previously. The work described in this thesis presents a revised implementation of garbage collection for Parallax. It also introduces a utility for managing free space reclaimed from garbage collection, namely a fault-tolerant free space coalescing utility that aims at minimizing the occurrence of data fragmentation - and its adverse effects in Parallax.  ii  Preface The work presented in this thesis builds upon earlier work that yielded the Parallax storage system. These two pieces of work are: 1. A. Warfield, R. Ross, K. Fraser, C. Limpach, and S. Hand. Parallax: Managing storage for a million machines. In Proceedings of the 10th conference on Hot Topics in Operating Systems-Volume 10, page 4. USENIX Association, 2005. [17] 2. D.T. Meyer, G. Aggarwal, B. Cully, G. Lefebvre, M.J. Feeley, N.C. Hutchinson, and A. Warfield. Parallax: virtual disks for virtual machines. In Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008, pages 4154. ACM, 2008. [12]. Chapter 2 serves as an introduction to the Parallax storage system, and the figures included in this chapter are copied from the second of these two aforementioned works [12]. The authors of the previous work hold copyrights to these figures and they are included in this work with their consent.  iii  Table of Contents Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  Preface  List of Tables  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Algorithms  . . . . . . . . . . . . . . . . . . . . . . . . . . .  ix  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . .  x  1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.1  Copy-on-Write Storage Systems  . . . . . . . . . . . . . . . .  1  1.2  Copy-on-Write and Parallax  . . . . . . . . . . . . . . . . . .  3  1.3  Mitigating Fragmentation Effects  . . . . . . . . . . . . . . .  5  2 Background - Parallax . . . . . . . . . . . . . . . . . . . . . . .  7  2.1  Design Principles  . . . . . . . . . . . . . . . . . . . . . . . .  7  2.2  System Structure  . . . . . . . . . . . . . . . . . . . . . . . .  9  2.3  Virtual Disk Images . . . . . . . . . . . . . . . . . . . . . . .  10  2.3.1  VDIs as Block Address Spaces . . . . . . . . . . . . .  10  2.3.2  Snapshots  . . . . . . . . . . . . . . . . . . . . . . . .  11  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  12  2.4  Blockstore 2.4.1  Extent Based Access  . . . . . . . . . . . . . . . . . .  12  2.4.2  Lock Management . . . . . . . . . . . . . . . . . . . .  13 iv  Table of Contents 3 Design and Architecture 3.1  . . . . . . . . . . . . . . . . . . . . .  15  Parallax Shortcomings . . . . . . . . . . . . . . . . . . . . . .  15  3.1.1  Allocate-only Storage System  . . . . . . . . . . . . .  15  3.1.2  Data Fragmentation . . . . . . . . . . . . . . . . . . .  16  3.1.3  Free Space Fragmentation  . . . . . . . . . . . . . . .  16  3.2  Design Objectives  . . . . . . . . . . . . . . . . . . . . . . . .  18  3.3  Design Highlights  . . . . . . . . . . . . . . . . . . . . . . . .  19  3.3.1  Physical Address and Location Decoupling . . . . . .  19  3.3.2  Minimal Interference with the I/O Path . . . . . . . .  20  3.3.3  Fault-tolerant Free Space Compaction . . . . . . . . .  21  3.4  Architecture  . . . . . . . . . . . . . . . . . . . . . . . . . . .  21  4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . .  23  4.1  Disk Object Layer . . . . . . . . . . . . . . . . . . . . . . . .  23  4.2  Parallax RPC Server  27  4.3  Parallax Instance Registry  . . . . . . . . . . . . . . . . . . .  27  4.4  Free Space Reclamation . . . . . . . . . . . . . . . . . . . . . 4.4.1 Deleting VDIs and Snapshots . . . . . . . . . . . . .  28 29  4.4.2  . . . . . . . . . . . . . . . . . . .  32  Free Space Compaction . . . . . . . . . . . . . . . . . . . . .  36  4.5.1  Offline Compactor . . . . . . . . . . . . . . . . . . . .  36  4.5.2  Online Compactor . . . . . . . . . . . . . . . . . . . .  39  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  44  5.1  Experimental Setup . . . . . . . . . . . . . . . . . . . . . . .  44  5.2  Garbage Collector Performance . . . . . . . . . . . . . . . . .  44  5.2.1  Live Data  . . . . . . . . . . . . . . . . . . . . . . . .  45  5.2.2  VDI and Snapshot Count . . . . . . . . . . . . . . . .  45  5.2.3  Impact of Block Sharing  . . . . . . . . . . . . . . . .  47  5.2.4  Discussion  . . . . . . . . . . . . . . . . . . . . . . . .  48  4.5  5 Evaluation  . . . . . . . . . . . . . . . . . . . . . .  Garbage Collection  5.3  Free Space Compactor Performance  5.4  I/O Path Overhead 5.4.1  . . . . . . . . . . . . . .  51  . . . . . . . . . . . . . . . . . . . . . . .  52  Disk Object Layer I/O Path Overhead  . . . . . . . .  52  v  Table of Contents 5.4.2  Compaction I/O Path Overhead . . . . . . . . . . . .  6 Comparison with Earlier Work  53  . . . . . . . . . . . . . . . . .  56  6.1  Garbage Collector . . . . . . . . . . . . . . . . . . . . . . . .  56  6.2  Data Defragmentation . . . . . . . . . . . . . . . . . . . . . .  57  6.3  Fault Tolerance  . . . . . . . . . . . . . . . . . . . . . . . . .  57  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  58  7.1  Block-level Indirection Layer . . . . . . . . . . . . . . . . . .  58  7.2  Garbage Collection and Free Space Management . . . . . . .  60  7 Related Work  8 Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  62  9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  63  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  66  vi  List of Tables 4.1  Parallax RPC Server Procedures. . . . . . . . . . . . . . . . .  28  4.2  Parallax Delete Tools. . . . . . . . . . . . . . . . . . . . . . .  29  4.3  Parallax Instance I/O Paths for Different Compaction States.  43  5.1  Average Free Space Compaction Execution Times for Different Synthetically Fragmented Blockstore Extents. . . . . . . .  51  vii  List of Figures 2.1  The Parallax Storage System Structure. . . . . . . . . . . . .  9  2.2  Virtual Disk Snapshot. . . . . . . . . . . . . . . . . . . . . . .  12  2.3  Parallax Blockstore Extents. . . . . . . . . . . . . . . . . . . .  13  3.1  Simplified Demonstration of Free Space Resulting from Deleting a VDI’s Snapshots. . . . . . . . . . . . . . . . . . . . . . .  3.2  Parallax Architecture with Free Space Reclamation and Management Utilities. . . . . . . . . . . . . . . . . . . . . . . . . .  4.1  17 22  Address Translation Layers in Parallax before and after Adding the Disk Object Layer. . . . . . . . . . . . . . . . . . . . . . .  24  4.2  Extent Anatomy Showing the Disk Object Map. . . . . . . .  25  4.3  Physical to Machine Block Address Translation. . . . . . . . .  26  4.4  Creating and Deleting Virtual Disk Images Sharing Common History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5.1  Garbage Collector Performance for Different Amounts of Live Data Blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . .  5.2  30  45  Garbage Collector Performance versus Virtual Disk Image Count. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  46  5.3  Garbage Collector Performance versus Snapshot Count. . . .  47  5.4  Garbage Collector Performance versus Level of Sharing among Virtual Disk Images. . . . . . . . . . . . . . . . . . . . . . . .  48  5.5  Disk Object Layer Overhead. . . . . . . . . . . . . . . . . . .  53  5.6  Copying and Commit Phases I/O Overhead. . . . . . . . . . .  54  viii  List of Algorithms 1  Garbage Collector . . . . . . . . . . . . . . . . . . . . . . . .  33  2  Offline Compactor . . . . . . . . . . . . . . . . . . . . . . . .  37  3  Online Compactor . . . . . . . . . . . . . . . . . . . . . . . .  41  ix  Acknowledgements I would like to thank my professors Mike Feeley, Andrew Warfield and Norman Hutchinson for their positiveness and patience. They have been among the most supportive people I have ever met. I would also like to thank Dutch Meyer for easing me into Parallax and enduring long discussions about it and storage systems in general.  x  Chapter 1  Introduction A Storage System provides persistent storage services to its clients. Different storage systems offer different storage abstractions and sets of primitive operations to manipulate these abstractions. For example POSIX compliant file systems offer a file abstraction which can be manipulated using the POSIX standard file interface. Virtual disk storage system offer a raw disk abstraction which can be manipulated using block reads and writes. Key/Value stores offer an immutable object abstraction which can be manipulated using a dictionary like set/get interface. In the rest of this work the term storage object is used to refer to a storage abstraction instance of some storage system. Some storage systems make use of a technique called Copy-on-Write (CoW) to provide useful features to their clients. These systems suffer from problems which have tractable solutions in single hosted systems. However, these solutions do not work well in distributed CoW storage systems. The following sections discuss these issues in more depth, introduces our proposal to solve them and delineates this work’s contribution. Section 1.1 introduces CoW storage systems, their problems and the existing solutions to these problems. Section 1.2 introduces Parallax and explains why the aforementioned solutions fail to apply in such a distributed CoW storage system. Finally, in Section 1.3 we introduce our solution to mitigate CoW effects in Parallax and summarize our results.  1.1  Copy-on-Write Storage Systems  Storage systems that use Copy-on-Write (CoW) techniques enable sharing disk blocks among multiple storage objects. These shared blocks are marked 1  1.1. Copy-on-Write Storage Systems as read-only which makes reading them safe from the context of any of the sharing storage objects. However mutations to storage objects affecting any of these shared blocks will fault. These faults are handled by allocating a new block, reading the contents of the faulting shared block and finally writing the modified contents to the newly allocated block. The mutated storage object’s index is updated to refer to the new block which is safe to modify thereafter as long as it is exclusively owned by the storage object. One benefit of using CoW in storage systems is saving storage space as it allows for eliminating block duplicates. Another is that it simplifies the implementation of certain file system operations such as consistent cloning and check-pointing of storage objects. Check-pointing a storage object means keeping track of the object’s state at a given point of time. Cloning on the other hand is creating an exact duplicate of a storage object’s state at a given point of time. The two operations are similar except for that the outcome of the check-pointing process is immutable. Both operations can be implemented in a non-CoW storage system using a combination of buffering write requests and lazy copying of storage objects’ data blocks. However, in CoW storage systems all that needs to be done is copy the storage object’s index and mark all its data block references as read-only. Therefore, implementing these operations in a CoW storage systems reduces the amount of copying that needs to be done and is relatively simpler to implement. One problem storage systems deploying CoW techniques face is data fragmentation. A core property of these systems is that the storage location of a shared block is changed before it can be modified. Consequently CoW breaks the linearity of storage objects and fragment their data over the underlying storage. A key property of the magnetic storage disks is that they achieve high throughput on sequential accesses, while they perform abysmally on random ones. Thus a computer system’s performance is dependent on the order of data accesses and their layout on disk. Many storage system’s clients are tuned to perform well on the assumption that the storage objects provided for them are stored sequentially on disk. CoW storage systems break these assumptions causing significant performance degradation. 2  1.2. Copy-on-Write and Parallax One possible solution to this problem is defragmentation. Defragmentation is a process that reverses the effects of fragmentation. It coalesces and sorts out data blocks of storage objects so that they are contiguous and laid out sequentially on underlying storage. Defragmentation is often achieved in two steps. The first is creating contiguous free space on underlying storage big enough to fit the object being defragmented. The second is moving the blocks of the defragmented object to the free space and modifying the storage object’s index references consistently. If an object is shared, however, it may not be possible to defragment every distinct CoW versions of the object without duplicating some blocks. Defragmenting one storage object will fragment every other storage object that it shares blocks with. Nevertheless, it is still possible to defragment one storage object at the expense of others. However, defragmentation in CoW storage systems is complicated by the fact that individual blocks are shared by multiple storage objects. Thus in the second step of defragmentation moving blocks may require updating multiple storage objects index structures. Besides, one needs to identify all storage objects that share data blocks with the storage object being defragmented. Finally, one last problem in CoW storage systems is that it complicates free space reclamation. In non-CoW storage systems data blocks belong exclusively to a single storage object. When deleting a storage object all data blocks belonging to it can be safely reclaimed back to free space. However, in CoW storage systems blocks can be shared across multiple storage objects. Thus deleting a storage object does not imply the reclamation of all data blocks it refers to. One way to overcome this obstacle is to keep track of the count of storage objects referring to a certain data block. Whenever this reference count reaches zero, the block is reclaimed back to free space.  1.2  Copy-on-Write and Parallax  Parallax is a high performance and highly available distributed storage system that serves virtual disk abstractions to its clients. Indices in the form of radix trees are used to translate virtual disk block addresses to their cor3  1.2. Copy-on-Write and Parallax responding physical addresses on the underlying storage. Sharing among virtual disks is fine grained at the block level and applies both to data and metadata information (i.e., including the aforementioned radix tree indices). The fact that metadata radix tree nodes can be shared enables Parallax to deliver fast cloning and consistent check-pointing operations on virtual disks. Parallax cheap and fast cloning and check-pointing operations can be beneficial for many use cases. For example frequent check-pointing of virtual disks enables reading the disk’s state at arbitrary points in the past. This could be useful for many applications such as virtual machine history analysis and recovering disk’s state back to known consistent points in the past. In addition, cheap cloning of virtual system disks speeds up virtual machines provisioning which matches the level of nimbleness and agility expected from virtualized environments nowadays. This capability has been shown to be valuable for intrusion detection and analysis systems [15]. For these reasons and others, we expect heavy use of the CoW based cloning and check-pointing operations and consequently a high degree of block sharing in Parallax. Like any other CoW storage system, Parallax is susceptible to data fragmentation. However defragmentation in Parallax is more problematic than regular CoW systems for the following reasons. First Parallax is a distributed storage system. Maintaining consistency of distributed data structures is nontrivial and often involves some level of coordination that limits scalability and availability. For example lock coordination decreases service availability as it makes the system vulnerable to potential deadlocks on failures. Secondly, block sharing is heavy and fine grained. Moving one block means multiple reference updates which in turn require multiple seek operations to different parts of the disk. Furthermore it is necessary to track and be able to efficiently locate all of the index references to a block in order to update those references when moving the block. This can be achieved either by maintaining a database of back references or scavenging all virtual disks in the system every time we need to move a block. Both alternatives are sufficiently expensive that their utility is questionable.  4  1.3. Mitigating Fragmentation Effects  1.3  Mitigating Fragmentation Effects  For the reasons mentioned in Section 1.2 it is clear that defragmentation in Parallax may be expensive. And so one question this thesis is trying to answer is: ”Is there a less expensive technique to alleviate the fragmentation problem in Parallax?” One hypothesis is that coalescing free space decreases fragmentation of data. Having large contiguous regions of free blocks will cause subsequent block allocations in Parallax to be contiguous on the underlying storage. Hence when bursts of sequential block writes hits virtual disks, they will end up sequential on the underlying storage. In contrast, if free space is fragmented the sequential written blocks will end up scattered on the underlying storage. It is expected that this provision for avoiding data fragmentation can benefit some workloads such as sequential writes followed by sequential or random reads. Free space coalescing is a problem that is easier to solve than data defragmentation in Parallax. Unlike data defragmentation, a free space coalescing utility is agnostic to all kinds of meta information about blocks except for their allocation status. This relieves the utility from having to understand the layout of data blocks on disk. An extra level of indirection can decouple block addresses from their physical locations. This eliminates the need to lookup and update references to blocks being moved due to coalescing free space. This decoupling can be advantageous for data defragmentation in principle. However, address translation efficiency requires keeping address translation maps close to their corresponding blocks on disk. This in turn limits the movement of blocks in the blockstore and the level of data defragmentation that can be achieved. Free space coalescing does not suffer from this problem as free blocks are uniform, and the choice of which free blocks to coalesce is irrelevant. In a nutshell, free space coalescing is an easier problem to solve compared to data defragmentation. Parallax in its current implementation is an allocate/write-only storage system, as it lacks any implementation of free space reclamation. Without a mechanism to recycle storage blocks, continuous use of Parallax will eventually lead to storage exhaustion and yield the storage system inoper5  1.3. Mitigating Fragmentation Effects able. It is most likely that a garbage collection implementation have been deferred to expedite the implementation of a Parallax prototype. However, implementing garbage collection for a storage system such as Parallax is quite tricky due to Parallax’s fine grained sharing of blocks. This could be another reason behind deferring the implementation of a garbage collector. Free space reclamation using reference counts is harder to implement in Parallax compared to other CoW storage systems. Fine grained sharing of blocks in Parallax means lots of reference count updates on CoW operations. Moreover, given that Parallax operation is distributed, reference count updates must be coordinated to ensure their consistency. On one hand a simple implementation would use locks to coordinate updates to reference counts and can incur two I/O operations per reference count update. This would slow down the I/O path significantly and introduce the possibility of stalling I/O operations. A more complicated alternative is to log reference count updates locally for them to be collected and merged at some centralized reference count database some time later. In this work it has been opted to avoid reference count based garbage collection in favor of a simpler garbage collecter that does not add overhead to the I/O path. In this thesis we describe a garbage collector and a free space compaction utility designed for the Parallax storage system. The garbage collector is a variant of the mark and sweep algorithm. The free space compaction utility coalesces free space and is designed to work in a distributed fashion and be fault tolerant. In addition we ensure that both of these utilities have a minimal impact on the I/O path.  6  Chapter 2  Background - Parallax Parallax is a distributed storage system specially designed for virtualized environments. It provides virtual disk abstractions which are consumed by virtual machines. Virtualization offers a new level of flexibility by abstracting complex pieces of software, namely operating systems and applications running in them. While hardware has evolved to support virtualization, storage remains relatively inflexible and less agile. Parallax is designed to provide storage that matches the flexibility of the virtual machines that consumes it. Novel uses of virtual machines dictated some unorthodox design directions in Parallax. For example, the need for fine grained rewinding of virtual disks dictated the need for the ability to snapshot disks frequently. While the low overhead of VM creation dictated the ability to provision disks fast and cheaply. The following subsections will introduce Parallax in greater detail. First, the principles that guided the design of Parallax are introduced in Section 2.1. The structure and high level architecture of Parallax are presented in Section 2.2. The virtual disk image (i.e., VDI) which is the main storage abstraction delivered by Parallax is introduced in Section 2.3. Finally, the Parallax backend blockstore is described in Section 2.4.  2.1  Design Principles  The following paragraphs will briefly introduce the design principles that guided Parallax’s development.  7  2.1. Design Principles Agnosticism and Isolation Parallax is implemented as a tapdisk driver [12] that intercepts the I/O requests of client VMs. The Parallax storage appliance could live in its own VM, thus Parallax isolates storage management from client virtual machines. This isolation results in Parallax being agnostic to client virtual machine’s operating systems as well as the backend storage system. This unifies the storage management role as Parallax instances can be managed by a single cluster-wide administrator. Block Level Operations Parallax provides a narrow block level I/O interface. This further reinforces Parallax’s agnosticism to client operating systems. The narrow interface at the block level simplifies the implementation, however it potentially can result in some complications due to missing file system semantics at this lower level. Minimize Lock Management Often distributed systems rely on distributed lock managers to coordinate access to shared resources. However such systems are vulnerable to network failures and single host failures. Parallax seeks to minimize the use of locks for coordinating distributed storage operations. It uses coarse locks that are acquired infrequently, where the regular operations of Parallax instances (serving I/O operations) do not depend on the availability of the lock manager. This results in Parallax instances being able to continue operation in the face of network disconnections  †  and host failures.  Snapshots as a Primitive Operation Parallax provides a disk snapshot primitive by design unlike other storage systems that provide it as an afterthought. The snapshot primitive has low †  This does not include disconnections from the shared blockstore.  8  2.2. System Structure overhead in time and space. This is achieved using radix tree block maps with Copy on Write (CoW) sharing semantics.  2.2  System Structure  Figure 2.1 shows the structure of a Parallax storage system. A storage appliance virtual machine in each physical host mediates access to the shared blockstore. Parallax runs as a user level daemon in the storage appliance. The storage appliance VM intercepts block requests efficiently using Xen’s block tap driver. The driver provides a user space asynchronous block interface. When virtual disks are connected, the block tap starts a user space tapdisk process in the VM appliance to service all client VMs running on the physical host. All Parallax instances share access to a single shared block device. There are no restrictions on what this block device is as long as it is sharable and accessible as a block target. Physical Host D  Physical Host A  Paravirtual VM  Fully Virtualized VM  Physical Host C Paravirtual VM  Fully Virtualized VM  Storage Appliance VM Storage Appliance VM  Physical Host B  Paravirtual VM  Fully Virtualized VM  Storage Appliance VM  Paravirtual VM  Fully Virtualized VM  Storage Appliance VM device  tapdisk  parallax  emulator  blkfront blkfront  VMM (Xen) blkfront  VMM (Xen)  device tapdisk parallax emulator physical physical physical blktap SCSI SCSI Ethernet device driver tapdisk parallaxdriver driver physical emulator blktap physical physical SCSI SCSI Ethernet driver driver driver physical physical physical blktap SCSI SCSI Ethernet driver driver driver  VMM (Xen)  device emulator  blkfront  VMM (Xen)  physical SCSI driver  emulated block requests  blktap  tapdisk  parallax VDI VDI  physical physical SCSI Ethernet driver driver  Shared Block Device  paravirtual block requests  Figure 2.1: The Parallax Storage System Structure.  9  2.3. Virtual Disk Images  2.3  Virtual Disk Images  Virtual Disk Images (VDIs) are the core abstractions provided by Parallax to virtual machines. Virtualized environments simplifies distributed access to VDIs as they are always single-writer. This simplifies coordination among Parallax instances, and significantly minimizes the use of distributed locking in Parallax. A VDI can be accessed through the storage appliance running on any physical host. This location independence of VDIs from the physical hosts allows virtual machines to migrate from one physical machine to another seamlessly. Create, delete and snapshot are the main operations that can be performed on VDIs. Deletes were not fully implemented prior to implementing the garbage collection described in this work. Deleting VDIs is discussed in detail in Chapter 4.  2.3.1  VDIs as Block Address Spaces  The mechanism used to represent VDI address spaces borrows heavily from virtual memory techniques. A Parallax VDI can be seen as a single block address space, represented by a radix tree that maps virtual block addresses to physical block addresses. Where virtual block addresses are block addresses within a VDI and physical block addresses are addresses of blocks in the Parallax shared blockstore. The current Parallax implementation uses 4KB blocks (i.e., radix tree nodes) to map virtual to physical address. These mappings are stored in a 3-level radix tree. Each internal radix tree node stores 512 64-bit global block address pointers where the high-order bit is used to indicate that a link is read-only (necessary for implementing snapshots). This results in a maximum VDI size of 512 GB, and a maximum shared blockstore size of 32 Zetta (1021 ) bytes. VDI address spaces are sparse, a zeroed block reference in a radix tree node indicates that the range beyond this node is non-existent. Therefore creating a VDI is as simple as allocating a single zeroed root block for the 10  2.3. Virtual Disk Images VDI. References can be shared across descendant radix trees in order to implement snapshots.  2.3.2  Snapshots  A snapshot is a read-only image of an entire disk at a particular point in time. Snapshots are crash consistent in that their consistency is similar to that of a disk of a crashed system. This consistency is good enough to allow modern file systems (e.g., ext3 and NTFS) built on top of parallax to recover from a system crash. Snapshots of VDIs can be obtained offline or online. In the online case, snapshot requests are asynchronous and are pushed into the I/O pipeline similar to other I/O operation, and their implementation is similar to write barrier operations. A Parallax snapshot operation simply copies the root radix tree node and marks all its references as read-only. The operation requires three writes (two of them can be issued concurrently), and can be completed in times as low as 10 milliseconds. To implement snapshots, the high-order bit in block references is used to indicate that a block is read-only. On every access, the VDI radix tree is traversed from the root node down the tree. If a reference is marked read-only, this indicates that the whole subtree the reference points to is read-only. After a VDI is snapshot, writes might not be applied in place if there is no direct path of writable links from the root to the data block. In the case of such a fault, the target block is copied and written into a newly allocated block (i.e., Copy-on-Write). The old radix root node’s reference is appended along with a time-stamp to a snapshot log. Snapshot log radix roots are implicitly read-only, however they can be used as a reference to create new VDIs. This cloning is achieved by making a copy of the snapshot’s radix root and marking all its references as read-only (similar to the snapshot operation). VDI logs form a system wide tree that shows VDI historical dependencies. Figure 2.2 shows an example of a snapshot operation on a simplified VDI radix tree. The 3 level radix tree maps 6 bit virtual addresses where  11  2.4. Blockstore Snapshot Log  VDI Address Mapping Metadata  Data Blocks  parent_log 2007.3.2 23:10:12.59 2007.3.2 23:40:12.23  Previous Radix Root w 00 ... w 11  VDI Record last_snapshot radix_root capacity ...  Radix mappings: Read-only Link  Current Radix Root r 00 ... w 11  w 00 ... w 11  w 00 ... w 11  w 00 ... w 11 w 00 ... w 11 w 00 ... w 11  w 00 ... w 11 w 00 ... w 11 w 00 ... w 11  r 00 ... w 11  r 00 ... w 11  Writable Link  Figure 2.2: Virtual Disk Snapshot.  at each level 2 bits of the virtual address is resolved. A snapshot is taken by copying the radix root node and marking all its references as read only (dashed links). The VDI record’s radix root link is redirected to point to the new radix root node. A write to address 111111 can not be handled in place as the corresponding data block is marked as read-only. Instead the VDI creates private writable copies of these nodes and links them to the new radix tree node. Note that this process does not require copying data blocks as write operations are carried out at block granularity.  2.4  Blockstore  Parallax instances share a common blockstore. In distributed storage systems sharing is often coordinated using distributed locks. By design Parallax avoids the use of such locks so that Parallax instances could continue functioning in the face of failure and disconnection. This guided Parallax’s approach to blockstore layout and lock requirements.  2.4.1  Extent Based Access  The blockstore is carved into fixed-size extents (2GB in the current implementation). These extents represent lockable single-allocator regions. Ex12  2.4. Blockstore Extent 0 Type: Super Blocksore Global Lock Extent Catalogue  Locking in parallax ensures that writes cannot conflict and keeps node allocation from becoming a bottleneck on the data path.  Extent 1 Type: Metadata Allocation bitmap  Extent 2 Type: Data Allocation bitmap All blocks in use  1 M Unlocked ... n-2 M plx2.cs.ubc n-1 D plx2.cs.ubc  VDI Lock: All witable data referenced by a VDI is protected by the VDI lock, irrespective of the extent that it is in. VDI 19 locked by host plx2.  VDI 373 locked by host plx4 (not shown)  Inactive VDIs remain unlocked  ...  Extent n-2 Type: Metadata Allocation bitmap  Extent n-1 Type: Data Allocation bitmap  VDI Registry ... VDI 19 Dutch’s W2K3tst plx2.cs.ubc radix_rt: snaplog: ... VDI 373 DSG Wiki VM plx4.cs.ubc radix_rt: snaplog: ... VDI 885 Testbed VM [unlocked] radix_rt: snaplog:  ... Full extents remain locked, and may not be claimed by any host  Extent Locks: Extents are locked by a single host, as indicated in the extent catalogue. That host is free to allocate new blocks in grey above within these. Extents n-2 and n-1 locked by host plx2.  Figure 2.3: Parallax Blockstore Extents.  tents are typed as either metadata or data extents. Metadata extents hold snapshot log blocks and radix tree meta nodes. On the other hand data extents hold VDI data blocks. When a Parallax instance attaches itself to the blockstore, it allocates an extent of each type for its use. From there on it is free to modify the unallocated regions of the allocated extent. The first extent of the blockstore is a system extent and holds special information about the system such as the size of the blockstore, extent size and extent count. In addition to that it contains an extent registry and a VDI registry. The extent registry hold information about each extent’s type and current lock holder. The VDI registry in the form of a radix tree maps VDI ids to VDI structures that contain information about the VDI in addition to a persistent VDI lock.  2.4.2  Lock Management  Extent locks and VDI locks are orthogonal, however both work in tandem to satisfy I/O operations in Parallax. An extent lock allows its holder to allocate unused blocks from an extent. On the other hand, a VDI lock allows its holder to safely write to any of its blocks regardless of which extent the 13  2.4. Blockstore blocks reside in. A Parallax instance is free to write in-place to any writable block of a VDI it holds a lock on. Writes to read-only or sparse regions will fault and require block allocations from locked (a.k.a., allocated) extents. Locking is required for infrequent operations such as claiming an extent (for block allocation), and creation and deleting of VDIs. Unless an allocated extent’s free space is exhausted, regular I/O (reads and writes) and snapshot operations do not require lock acquisition. This locking policy increases scalability because no lock manager acts as a bottleneck that limits I/O throughput. It also enhances reliability as Parallax instances can continue running in the face of failure (e.g., lock manager failure, or network disconnection) as long as they do not run out of free space - in the extents they have claimed.  14  Chapter 3  Design and Architecture This chapter presents Parallax’s garbage collector and free space compaction mechanisms. The chapter starts by listing Parallax’s shortcomings and therefore motivating the need for free space reclamation and compaction utilities. Later on, a set of design objectives and principles are presented. Finally, the chapter concludes with presenting architectural changes in Parallax.  3.1 3.1.1  Parallax Shortcomings Allocate-only Storage System  Parallax prior to the work presented in this thesis was an allocate-only storage system. In other words Parallax consumes storage space and has no provisions for releasing storage space that is no longer needed. A Parallax deployment starts with a fresh blockstore. Parallax instances start consuming unused space in the blockstore until no free storage space is available anymore. If Parallax needs to allocate new blocks at this point to satisfy I/O operations, it would block indefinitely yielding its services unavailable. The need for a garbage collector in Parallax is imperative. The lack of a garbage collector implementation is most likely due to expediting the implementation of a Parallax prototype. Nevertheless, the implementation of a garbage collector for Parallax remains tricky due to Parallax’s fine grained sharing of disk blocks. This might also explain the lack of a garbage collector in the prototype implementation.  15  3.1. Parallax Shortcomings  3.1.2  Data Fragmentation  Data fragmentation is bad for I/O performance, and subsequently is bad for virtual machines’ performance. File systems are built with the assumption that the underlying address space is linearly laid out on disk. They layout data in a way to minimize the required number of disk seeks to satisfy I/O requests. Data fragmentation is an inherent problem of Copy-on-Write (CoW) storage systems. Like any other CoW system, Parallax is also vulnerable to data fragmentation. The problem is further exacerbated in Parallax due to the fine grain sharing of blocks and the expected heavy use of the inexpensive snapshot and cloning primitives it provides. Because these primitives are implemented using Copy-on-Write mechanisms, we expect a high level of block sharing and consequently a high level of data fragmentation of VDIs. Another reason why Parallax exacerbates data fragmentation is due to VM migration. A Virtual Machine’s I/O requests to some VDI will be satisfied by the Parallax instance of the physical machine hosting that virtual machine. The Parallax instance will have to allocate blocks for the VDI from an extent it owns to satisfy some I/O requests. This leads to some blocks of the VM’s VDI to be located in the extent belonging to that Parallax instance. When the VM is migrated to another physical host, it will be serviced by another Parallax instance that will satisfy block allocations from another extent that it owns. The consequence is that the data blocks of a VDI end up fragmented on different extents of the blockstore.  3.1.3  Free Space Fragmentation  A Parallax free space reclamation utility would allow users to delete VDIs and snapshots. Assuming Parallax had a free space reclamation utility, free space reclaimed through such a utility will be fragmented in the blockstore. An example of deleting a VDI’s snapshots is given in Figure 3.1. A simplified virtual disk is consisting of four blocks is being snapshot twice. Blocks 3 and 1 are rewritten after taking the first snapshot, while blocks 0 and 1 are rewritten after taking the second snapshot. Deleting the two snapshots 16  3.1. Parallax Shortcomings later on will cause the freed blocks to be dispersed in the underlying disk’s address space. Blockstore Virtual Block Address  Physical Block Address  0  n  2007.3.2 23:10:12.59  1  n+1  2007.3.2 23:40:12.23  2  n+2  Radix Root Reference  3  n+3  Free Block  3’  n+4  1’  n+5  0’’  n+6  1’’  n+7  Radix Tree Roots Snapshot Log parent_log  LEGEND:  VDI Record last_snapshot radix_root capacity ...  Indirect Block Reference  Allocated Block  Figure 3.1: Simplified Demonstration of Free Space Resulting from Deleting a VDI’s Snapshots.  The fragmentation of free space reclaimed due to garbage collection will depend on several factors. Some of these factors are the block allocation policy, frequency of CoW operations, write operations pattern and the data retention policy. This thesis does not try to analyze free space fragmentation in a CoW storage system such as Parallax. However it is obvious that such a system - given its current allocation policies - will cause reclaimed blocks to be dispersed over the blockstore. If free space fragmentation is light, a temporary remedy would be avoiding block allocation from heavily fragmented chunks of free blocks. However in the long term free space fragments will grow in quantity and shrink in size. Eventually, it will force allocation of blocks from this heavily fragmented free block runs, consequently causing VDI data fragmentation. As discussed in Section 3.1.2, data fragmentation is bad for performance. In a Copy-on-Write storage system some sources of data fragmentation are 17  3.2. Design Objectives unavoidable, however other sources such as free space fragmentation can be avoided. In this thesis we design a free space compaction utility for Parallax to coalesce free space in the hope of alleviating VDI data fragmentation.  3.2  Design Objectives  The following paragraphs briefly describe the objectives set for the garbage collector and free space compactor. Efficient and Scalable Free Space Reclamation Any implementation of a free space reclamation utility for the Parallax storage system has to be efficient and scalable. The free space reclamation has to be efficient in that it should minimize the required I/O operations and the memory space required. It also has to be scalable in that it should be possible to reclaim free space from large Parallax deployments (e.g., exabyte scale systems). Minimize Data Fragmentation While some sources of data fragmentation can not be controlled, free space fragmentation is one source that can be. Free space coalescing would alleviate the VDI data fragmentation problem to a certain degree. A free space compaction utility that coalesces free space in Parallax deployments needs to be implemented in order to address this problem. Minimal Impact on the I/O Path Any free space reclamation and management solution should minimize any performance impacts on the I/O path. Slowing down the I/O path could potentially adversely affect virtual machine performance, this is specially true for virtual machines exhibiting intensive I/O workloads. It is therefore critical to minimize the performance impact on the I/O path when incorporating free space management mechanisms in Parallax.  18  3.3. Design Highlights Live Operation It should be possible to run free space administrative utilities on a Parallax deployment without disrupting its services to client VMs. Besides not disrupting client operation, these administrative utilities should minimize interference with the client requests’ I/O path. Ideally, failures in free space management processes should not cause the failure of client VM’s I/O requests. Parallax provides storage services to clients with high availability guarantees, and such utilities should keep downtime as small as possible. Fault Tolerance Data centers consist of hundreds of machines that are connected through a single network. Failure of any of the hosts or network links should be treated as the norm rather than an exception. Therefore the free space utilities should be designed with this consideration in mind. To be more specific, the free space utilities should not leave the storage system in an inconsistent state in case of network disconnection or host failure.  3.3  Design Highlights  The following sections discuss some design decisions that made implementing garbage collection and free space compaction possible in Parallax.  3.3.1  Physical Address and Location Decoupling  Free space coalescing requires moving data blocks on disk. Consistent data block movement requires updating all references to these blocks across the system. Parallax’s cheap snapshot and cloning facilities allow an unprecedented level of block sharing. As a consequence a block might be referenced by hundreds or thousands of VDIs and snapshot images. Therefore moving a block requires many reference updates, each of which potentially incurs at least one I/O operation. This is not the only cost involved in moving blocks as the references to moving blocks need to be identified in the first place. 19  3.3. Design Highlights A back-reference tracking system could potentially solve the problem of efficiently identifying references. However, we are not aware of the existence of any efficient distributed back-reference tracking system, and it is our belief that implementing such a system is quite challenging. Fine grain sharing of blocks along with heavy use of CoW operations will yield a high volume of back-reference updates. Therefore, a simple and naive implementation of back-reference tracking would significantly degrade the I/O path performance. On the other hand, tracking reference updates offline and out of the I/O path will make providing a consistent and up to date view of the back-reference database challenging for a distributed storage system such as Parallax. Instead we chose to avoid the reference update problem all together by introducing a block address indirection layer. The newly introduced address space is called the machine address space, and it decouples Parallax physical block addresses from disk block locations on disk. Moving a block’s location on disk requires updating a single physical to machine address map entry no matter how many references its physical address has in the blockstore. In other words references in the Parallax blockstore - which use physical addresses - are no longer affected by disk block movement.  3.3.2  Minimal Interference with the I/O Path  In Parallax, operations that require coordination between Parallax instances is kept out of the way of the I/O path as much as possible. This allows Parallax instances to continue serving I/O requests in the face of network partitioning or failure. Coalescing free space in Parallax requires moving live blocks in the blockstore. A free space compaction utility will have to coordinate the movement of blocks with Parallax instances sharing the blockstore. Parallax instances will need to notify the compaction utility of any writes it makes to moving disk blocks. If compaction is implemented naively, it can cause I/O operations to depend on network connectivity, therefore contradicting Parallax’s aforementioned design principle. Unlike I/O operations, free space  20  3.4. Architecture compaction is not critical to the operation of the system. Therefore the compaction protocol should be designed in a way such that failures in the compaction process will not degrade the availability of I/O services.  3.3.3  Fault-tolerant Free Space Compaction  The previous design choice of giving I/O operations priority over free space compaction requires safe abortion of free space compaction. Free space compaction implements a special commit protocol that allows for the survival from a diverse set of failure scenarios. This protocol allows for the safe abortion of free space compaction at any point during the compaction process while maintaining the consistency and integrity of data in the Parallax shared blockstore.  3.4  Architecture  There are no significant changes to the architecture of Parallax. Figure 3.2 is a modified version of figure 2.1 that represents the Parallax architecture. The figure shows how the components implemented as part of this work fits in the Parallax architecture. The figure also removes any details that are unnecessary to the discussion of the work presented in this thesis. Two of the newly introduced components (i.e., RPC server and the block address indirection layer) are part of the Parallax instance that serves as a driver of the tapdisk process. The other three components (i.e., free space compactor, garbage collector and parallax registry) are stand-alone utilities that are envisioned to run in their own storage management appliance VM. Most of the new components are self-descriptive in terms of purpose except for the Parallax instance registry and RPC server. The first of these is responsible for keeping track of all active Parallax instances in the system. The second component allows other tools - more specifically the free space compaction utility - to communicate with Parallax instances. The purpose and the implementation details of all the newly introduced components are described further in chapter 4.  21  3.4. Architecture  LEGEND: New Component in Parallax Physical Host D  I/O Path  Physical Host Appliance C Storage VM tapdisk parallax Physical Host B Storage Appliance VM RPC VDI Server RPC Server RPCblock address VDI VDI indirection layer Server block address indirection layer block address indirection layer  RPC Communication Path  tapdisk tap disk Storage Appliance VM tapdisk  Physical Host A  parallax  Storage Appliance VM tapdisk  Storage Maintanance Appliance VM  parallax VDI VDI  Maintanance Physical Host  Parallax Free Space Compaction  RPC Server  Parallax Garbage Collector  Parallax Instance Registry  block address indirection layer  Shared Block Device  Figure 3.2: Parallax Architecture with Free Space Reclamation and Management Utilities.  22  Chapter 4  Implementation This chapter describes the details of the garbage collector and free space compactor of the Parallax storage system. Section 4.1 describes the indirection layer used to simplify free space compaction. Sections 4.2 and 4.3 describe the implementation of two services (i.e., the Parallax RPC server and instance registry) which are used to coordinate the free space compaction process. Section 4.4 describes the implementation of Parallax’s garbage collector. Finally, Section 4.5 describes the implementation of the free space compaction utility.  4.1  Disk Object Layer  The disk object layer adds a new level of indirection to block addressing. This indirection layer decouples block addresses from their locations on disk, therefore eliminating the need to update references on block movement. Figure 4.1 shows the address translation layers in Parallax before and after adding the disk object layer. The disk object layer manages allocations and I/O accesses to the underlying storage. It provides the disk object abstraction that serves as a persistent storage container. Disk objects consist of a group of blocks that are contiguous in the disk address space. The terminology used to describe this level of indirection is similar to that used to describe the extra level of indirection in memory systems of virtual machines. A physical address is used to refer to a disk object, whereas a machine address is used to refer to block address in the underlying storage. Once allocated, a disk object’s physical address remains constant throughout its lifetime, even though its location on disk can change. This simplifies 23  4.1. Disk Object Layer  VDI VDI VDI  VDI VDI VDI  ...  Virtual Disk Address Space  ...  Virtual Disk Address Space  Radix Tree Lookup  Radix Tree Lookup Physical Address Space  Physical Address Space NEW  Disk Object Descriptor Lookup Machine Address Space  Blockstore  Blockstore  Figure 4.1: Address Translation Layers in Parallax before and after Adding the Disk Object Layer.  disk objects’ movement as moving a disk object does not involve updating all references to it. A disk object map is added to the beginning of each Parallax extent. The map contains information necessary for translating physical addresses of disk objects into machine addresses. The map consists of a table of disk object descriptors. Figure 4.2 shows disk object descriptor fields. The valid field bit indicates whether the entry is free or refers to a valid disk object. The offset field determines the offset of the disk object’s first block within the extent. The size field indicates how many blocks does a disk object consist of. Disk object sizes are a power of two of parallax blocks where object size = 2size  f ield  ∗ parallax block size† .  Currently, all disk objects used in Parallax have a size equal to one Parallax block (4 KB). However, larger disk objects are provisioned to give †  Parallax block size in the current implementation is 4 KB.  24  4.1. Disk Object Layer Parallax more control over the layout of data on disk. Multi-block disk objects guarantee that the blocks they consist of are contiguous on disk. Super paging is a feature implemented in previous work [2] that can make use of variable sizes of disk objects. Extent Map  2 MB  Disk Object 0 Disk Object 1 Disk Object 2  Valid  Size  1 bit  8 bits  Offset  Disk Object 3  2 GB - 2 MB  23 bits  ...  Extent Data Blocks  Disk Object N  Figure 4.2: Extent Anatomy Showing the Disk Object Map.  Each disk object descriptor requires 4 bytes of disk space. The smallest disk object is a single Parallax block which currently has a size equal to 4 KB. In the worst case where all disk objects are 4 KB blocks - which is the common case in Parallax - each block will need a disk object descriptor to address it. Thus the disk space overhead of disk object descriptors would be 4 bytes per 4 KB or 0.1%. The current implementation uses 2 GB extents, which requires 2 MB disk object maps. I/O operations on disk objects specify the physical address of the disk object of interest. The disk object layer is responsible for translating the physical address into machine address in order to carry out these I/O operations. Figure 4.3 demonstrates the steps required to carry out this address translation. First the machine address is split into two parts: an extent index and a disk object descriptor index. The extent index determines the extent where the block of interest resides. It also determines the extent map to use for the disk object descriptor’s lookup. The descriptor index is then used to select the corresponding disk object descriptor from the for25  4.1. Disk Object Layer merly selected extent map. Combining the offset field of the selected disk object descriptor with the extent index yields the machine address of the disk object. Physical Address:  Extent Index 63  Descriptor Index 19 18  0  Blockstore: Extent 0  Disk Object 0  Extent Map  Extent 1  Disk Object 1  Extent 2  Disk Object 2  ...  ... Extent Data Blocks  Extent X  v  offset  size  ...  ...  Extent N  Disk Object M  Machine Address:  Extent Index 63  Extent Offset 19 18  0  Figure 4.3: Physical to Machine Block Address Translation.  Each disk object I/O access needs to carry out this translation of a disk object’s physical address to a machine address. A naive way of implementing the I/O path is to read the disk object descriptor of interest from disk on every I/O access. However incurring a synchronous disk access on every I/O access is prohibitively expensive. For this reason Parallax clients cache recently read disk object descriptors in memory. The current implementation caches disk object descriptors chunks at the granularity of parallax blocks (4 KB). The cache size is configurable at build time and uses LRU as its eviction policy. Besides serving as a translation table, the disk object map implicitly 26  4.2. Parallax RPC Server holds information about disk block allocation (block liveness). When a Parallax instance locks an extent it constructs an in-memory block allocation bitmap using the disk object map. Parallax instances use this bitmap to identify free blocks when allocating disk objects. A disk object allocation is carried out by allocating a free disk object descriptor and a chunk of free contiguous blocks on disk. The disk object descriptor’s size and offset are set to reflect the allocated chunk’s size and offset within the extent. The allocation is finalized by writing back the modified descriptor to disk. Incurring a disk access for each disk object allocation is prohibitively expensive. Instead, the cost of accessing the disk is amortized by pre-allocating batches of disk objects. The physical addresses of these disk objects are pooled in memory to serve subsequent allocation requests.  4.2  Parallax RPC Server  An RPC server has been added to Parallax instances that is used to serve requests from other Parallax instances and utilities. Parallax remote server procedures can be divided into two categories according to their purpose: action delegation and distributed coordination. In the first category, a client carrying out an operation that requires access to a shared resource can delegate the operation to the Parallax RPC server holding the exclusive lock over that resource. In the second category the server listens to notifications about changes in the state of the system, so that it can carry out necessary actions to achieve consistency in the distributed system. Table 4.1 shows the procedures provided by the Parallax RPC server. The purpose of these procedures will be discussed in Sections 4.4.1 (first two procedures) and 4.5.2 (the last procedure).  4.3  Parallax Instance Registry  The Parallax instance registry holds information about active Parallax instances in a Parallax deployment. This information is stored in a dedicated 27  4.4. Free Space Reclamation Procedure SnapIncRefCount  Category Delegation  Functionality Increments the reference count of a given snapshot record. SnapDecRefCount Delegation Decrements the reference count of a given snapshot record. NotifyRemapChange Coordination Notifies the Parallax client of a remap state change. Table 4.1: Parallax RPC Server Procedures. region in the blockstore. The registry is represented as a list of tuples, each of which corresponds to a Parallax instance. Each tuple has the form < P arallaxID, Hostname, P ort >, where Parallax ID is the unique identifier of a Parallax instance, Hostname is the name of the physical machine where the Parallax instance is running and Port is the port number used by the corresponding Parallax RPC server. The ability to identify all active Parallax instances and their server addresses is essential for coordinating operation in the distributed system. As will be explained in Section 4.5.2, all Parallax instances need to be notified of changes to compaction state of extents. This list of all active Parallax instances can be retrieved through the Parallax instance registry. When a Parallax instance starts it registers itself in the Parallax instance registry. To guarantee the consistency of the Parallax registry all updates and queries are coordinated using the global lock master. This does not conflict with the minimal use of locks design principle of Parallax as these updates are infrequent and do not interfere with the critical operation of the system (i.e., the I/O path).  4.4  Free Space Reclamation  Free space reclamation in Parallax goes through two stages. The first is marking storage abstractions (i.e., VDIs and snapshots) for deletion, and the second is reclaiming disk space that has been exclusively referenced by these abstractions. The next two sections discusses these two stages more  28  4.4. Free Space Reclamation thoroughly revealing the challenges they impose and their implementation details.  4.4.1  Deleting VDIs and Snapshots  Parallax users can delete two types of abstractions, virtual disks and snapshots. Table 4.2 lists two tools that have been implemented to mark abstractions as deleted. The first tool deletes a virtual disk and all its exclusive snapshots. The second tool is used to delete individual snapshot records. These tools merely set a delete field in either the VDI structure or snapshot record on disk to indicate their deletion. Name vdi delete snap delete  Functionality Delete a virtual disk and all its exclusive snapshots. Delete an arbitrary snapshot. Table 4.2: Parallax Delete Tools.  In order to be able to determine which snapshots are exclusive to a certain VDI, a reference count field has been added to each snapshot record. The reference count reflects how many VDIs and snapshots refer to the corresponding snapshot. A non-zero reference count means that the snapshot is still alive; once the reference count drops to zero it is considered to be safe to delete. Figure 4.4 shows an example of cloning and deleting two virtual disks to help illustrate how reference counts work. Initially at time t0, VDI 1 has four snapshot images in history. Note that all the reference counts at this point are equal to 1, as each snapshot represents a direct chronological dependency for either a VDI or another snapshot image. At time t1 a new VDI is cloned from the second snapshot of VDI 1, therefore the reference count of that snapshot image is incremented. This indicates that there are two storage abstractions (i.e., VDI 2 and the third snapshot of VDI 1) that are directly dependent on that second snapshot. At time t2 VDI 1 is deleted, as a consequence of that the reference count of the snapshot it is directly dependent on is decremented. Once a 29  4.4. Free Space Reclamation snapshot’s reference count reaches zero it is considered to be deleted. This in turn will cause a reference count decrement of the snapshot that represents a direct dependency of the deleted snapshot. This snapshot reference count decrement will recur until a snapshot that has a reference count greater than one is encountered. In other words, deleting snapshots stops when the first shared snapshot is reached.  t0 1  1  1  1  Create VDI 2  VDI 1 LEGEND:  t1 1  Dependency  1 2  VDI 1  1  n  VDI Snapshot  Delete VDI 1  Reference Count  VDI 2  VDI  t2 0  0 1  VDI 1  1  VDI 2  Figure 4.4: Creating and Deleting Virtual Disk Images Sharing Common History.  Marking VDI structures as deleted can not result in any race conditions, as access to these structures is coordinated through VDI locks. On the other hand, snapshot records can be accessed by multiple Parallax instances (or tools) concurrently. For example one process might be cloning a VDI from a snapshot, while another process is deleting it. Before implementing space 30  4.4. Free Space Reclamation reclamation in Parallax, no coordination to access snapshot records was needed as there were no possible race conditions. However now that snapshot records can be deleted, this race condition might arise. Consequently access to these structures requires some sort of coordination. It has been decided to carry out coordination of accesses to snapshot using the existing VDI locks instead of introducing a new set of locks. The rationale behind the decision is to keep the implementation simple by sticking to Parallax’s original locking design which has been shown to be scalable. One potential drawback to using locks for coordinating snapshot record accesses is that VDI/snapshot delete and clone operations might potentially stall indefinitely on lock failures. However it can be argued that in most use cases clone/delete operations are not critical to the operation of virtual machines, and therefore these faults can be tolerated until further administrative intervention. The choice of which VDI lock to obtain in order to access a snapshot record is made simple. Snapshot records reside in snapshot log blocks that are initially created to serve some specific VDI. Records residing in snapshot logs created by some VDI are considered to be owned by that VDI. Subsequently, any access to a snapshot record should be coordinated through its owner’s VDI lock. Parallax clients will not always be able to obtain VDI locks. Parallax instances often hold VDI locks for long periods: often as long as the lifetime of the virtual machines consuming these VDIs. Forcing a Parallax instance to release a VDI lock in order to access a snapshot record would deprive the virtual machine monitors using it from writing blocks to the VDI or taking snapshots of it. Moreover, if the client that is shortly taking over the VDI lock fails to relinquish it for any reason (e.g., network failure or software bug), the virtual machine execution will be blocked indefinitely. Instead of relinquishing a VDI lock to access snapshot records, clients that need to access snapshot records that are coordinated using an obtained VDI lock will delegate these access operations to the lock holder. This delegation of access is implemented in the form of a remote procedure calls that is associated with each Parallax instance (see Section 4.2). 31  4.4. Free Space Reclamation When a client needs to access a snapshot record, it first checks which VDI owns the snapshot log block that contains the record of interest. A VDI owner field has been added to the snapshot log block for this purpose. Next, the client tries to get hold over the corresponding VDI lock. If it succeeds in doing that, it accesses the snapshot record itself. However, in the case that the lock is already held by another process the client queries the lock master for the identity of the lock holder. The client uses the resolved identity to query the Parallax instance registry for the corresponding Parallax RPC server address. Finally, the client delegates the desired operation to the lock holder by issuing the proper remote procedure call to the resolved Parallax RPC server. Delegating access to snapshot records over the network does not go without a price to pay. For example, any network failures will disable clients from accessing snapshot records. However, the operations that need access to snapshot records are those that provision and delete virtual disks. Both of these operations are not critical to the operation of virtual machines in the system, and could be deferred to manual administrative intervention.  4.4.2  Garbage Collection  The previous section (Section 4.4.1) discussed how storage abstractions (i.e., VDIs and snapshots) in Parallax are marked as being deleted. This marking of storage abstractions does not automatically relinquish the disk space exclusively referenced by these deleted abstractions. Disk blocks that are no longer referenced by any storage abstraction need to be identified and reclaimed back to free space. This free space reclamation process is called garbage collection. In general, garbage collection operates on a set of objects such as memory allocations or disk blocks. These objects can reference each other forming graphs. An object is reachable through another object if there exists a path in the object graph from the latter object to the former. The objective of a garbage collector is to identify all objects that are not reachable through a set of root objects and reclaim them back to free space. The set of root  32  4.4. Free Space Reclamation Algorithm 1 Garbage Collector 1: for every extent where extent ∈ extents to gc do 2: lock extent 3: end for 4: for every extent where extent ∈ (extents to gc ∪ metadata extents) do 5: allocate disk object reachability bitmap for extent 6: end for 7: for every VDI in Parallax deployment do 8: if VDI is not deleted then 9: Mark VDI’s radix tree root’s corresponding bit in the reachability bitmaps. 10: end if 11: current snapshot ⇐ VDI’s most recent snapshot. 12: while current snapshot = NULL do 13: if current snapshot is not deleted then 14: Mark current snapshot’s radix tree root’s corresponding bit in the reachability bitmaps. 15: end if 16: current snapshot ⇐ current snapshot’s parent. 17: end while 18: end for 19: for i = 1 to radix tree depth do 20: for every extent where extent ∈ metadata extents do 21: for every bit where bit ∈ extent’s reachability bitmap AND bit is set do 22: radix tree node ⇐ read bit’s corresponding block from disk. 23: for every valid entry ∈ radix tree node do 24: Mark entry’s corresponding bit in the reachability bitmaps. 25: end for 26: end for 27: end for 28: end for 29: for every extent where extent ∈ extents to gc do 30: for every bit where bit ∈ extent’s reachability bitmap AND bit is unset do 31: reclaim corresponding disk object as free. 32: end for 33: unlock extent 34: end for 33  4.4. Free Space Reclamation objects are assumed to be always reachable. Garbage collection techniques can be divided into two broad categories: reference counting and mark and sweep algorithms. As implied by its name, reference counting garbage collection is accomplished by maintaining a reference count for each disk block. Whenever a reference to some block is created or destroyed, the corresponding reference count would be incremented or decremented respectively. A simple reference count implementation however suffers from some drawbacks. For one, maintaining reference counts would slow down the I/O path on faulting write operations. Another drawback is complexity associated with having to coordinate updates of reference counts from different Parallax instances in the distributed system. The alternative approach for garbage collection that have been used to implement garbage collection for Parallax in this work is the mark and sweep garbage collection. The generic mark and sweep algorithm searches for all reachable objects by traversing the graph and keeping track of reachable objects. It starts with a set of reachable objects containing the root objects. Objects are added to the reachable objects set iteratively where in each iteration objects referenced by any reachable object are added to the set of reachable objects. When no more objects can be added to the set of reachable objects, all unmarked objects are assumed to be free to reclaim. In Parallax, the objects of interest are disk objects (see Section 4.1). In general, the number of iterations the mark and sweep algorithm makes is indefinite and depends on the size and structure of the graph of objects. However in Parallax, disk objects are arranged in directed acyclic graphs of a predetermined maximum depth. Thus a mark and sweep algorithm for Parallax requires a finite number of iterations equal to the maximum depth of VDI radix trees (currently set to three). The garbage collector’s workings is given in Algorithm 1. The algorithm takes as an input a set of extents to garbage collect. The following paragraphs will briefly explain the garbage collector’s algorithm. Note that in this context marking or checking whether a disk object is marked reflects operations on its corresponding bit in an in-memory reachability bitmap.  34  4.4. Free Space Reclamation • Extent Locking (lines 1-3) The garbage collector locks the extents that will be garbage collected. There are two reasons behind locking these extents. First, the garbage collector ensures exclusive access to the disk object map as it might need be updated to reflect any disk object reclamations. Secondly, it ensures that no Parallax instance will allocate disk objects from the extent during garbage collection, which ensures no references to unmarked disk objects are created during the process. This makes all unmarked disk objects in the extent safe to reclaim back to free space. One potential drawback of locking extents for garbage collection is that I/O operations that require allocating disk blocks might stall due to all extents being locked. However the lack of unlocked extents is often a result of the lack of free space in the blockstore rather than an anomaly in the garbage collector’s implementation. This is specially true if the garbage collection policy is set to reclaim free space from extents that has no free space (i.e., fully allocated). In other words, even if a lockless approach is adopted to garbage collection, this stall in I/O operations will happen as a result to the exhaustion of free space in the blockstore. • Initialize Reachability Bitmaps (lines 4-6) The garbage collector creates an in-memory reachability bitmap (initialized to all zeros) for each extent that is either a metadata extent or being garbage collected. Any metadata extent might contain references to a data extent under garbage collection. To ensure that no reachable disk object in data extents is missed during the marking process, reachability is tracked through all metadata extents and each of these are allocated a reachability bitmap. • Mark Live Radix Tree Root Blocks (lines 7-18) The garbage collector starts by marking the root disk objects, which consist of all radix tree roots belonging to live VDIs or snapshots. This requires the garbage collector to traverse through all the VDIs and snapshot logs of the Parallax deployment. 35  4.5. Free Space Compaction • Marking Rounds (lines 19-28) The garbage collector continues searching for and marking reachable disk objects. It does that by iterating over marked disk objects and marking disk objects they refer to. As discussed earlier, it is sufficient to run these marking iterations a number of times equal to the radix tree maximum depth to guarantee that all reachable disk objects are marked. • Space Reclamation and Extents Unlocking (lines 29-34) The unmarked disk objects are reclaimed back to free space. This is done through resetting their corresponding valid bit in their disk object descriptor in the disk object map. Finally, extent locks held throughout the GC process are released.  4.5  Free Space Compaction  The free space compactor coalesces free data blocks in an extent to minimize the occurrence of data fragmentation. The following sections will describe the implementation of this tool in two steps. The first presents an offline version of the free space compactor, while the second presents an online version. The first version of the tool is assumed to have exclusive access to the Parallax blockstore, while the the second can operate concurrently alongside Parallax instances.  4.5.1  Offline Compactor  The workings of the offline compactor is given in Algorithm 2. The compactor takes the ID of the extent to be compacted as its sole argument. The offline compaction process consists of three phases: planning, copying and committing. • Planning Phase (lines 1-8) In this phase the compactor computes the new value that the disk object map should be assigned to achieve free space coalescing. First the compactor reads the extent’s disk object map (line 1) and computes the block allocation bitmap (line 2). 36  4.5. Free Space Compaction  Algorithm 2 Offline Compactor 1: Read extent’s disk object map. 2: Calculate the block allocation bitmap from the disk object map. 3: new disk object map ⇐ disk object map 4: for i = 1 to disk object descriptors per extent do 5: descriptor ⇐ disk object map[i] 6: descriptor.new off set ⇐ CalculateNewLocation(block allocation bitmap, descriptor.size) 7: new disk object map[i] ⇐ descriptor 8: end for 9: Read extent contents from the blockstore into memory buffer extent buff er. 10: for i = 1 to disk object descriptor count per extent do 11: descriptor ⇐ disk object map[i] 12: new descriptor ⇐ new disk object map[i] 13: if descriptor.off set = new descriptor.off set then 14: Copy descriptor.size blocks from descriptor.off set to new descriptor.off set in extent buff er 15: end if 16: end for 17: Write modified blocks in extent buff er to the blockstore. 18: Write the new disk object map to disk.  37  4.5. Free Space Compaction Next, the new locations of disk objects are calculated. The function “CalculateNewLocation” is used to determine the new locations of disk objects. This function can be implemented to achieve different compaction policies, these will be explained further later in this section. • Copying Phase (lines 9-17) During this phase the compactor moves disk objects to their new locations based on differences between the new disk object map and the original one. It first starts by reading the extent’s content from the blockstore (line 9). It then copies the disk objects in memory from their old to their newly assigned locations (lines 10 - 16). Finally it writes out the modified extent blocks back to the blockstore (line 17). • Commit Phase (line 18) The final phase writes the new disk object map to disk thereby committing disk objects to their new locations on disk. A special property of this algorithm is that disk objects are moved to free extent space. The implication is that a failure in the compaction process during any phase is will not corrupt the extent. In case the compaction fails in the copying phase, the disk object map would still point to valid disk objects as old disk object locations are never overwritten. This is true given that updating the disk object map is accomplished using atomic sector writes, therefore a failure during the commit phase would result in a partial update to the disk object map. In the worst case scenario this yields a partial movement of disk objects, while all disk object descriptors would still point to valid disk objects. Two compaction policies have been implemented in this work: “quick” and “intact” compaction. Both policies coalesce allocated blocks at the beginning of the blockstore extent, and consequentially coalesce free space at the end of the blockstore extent. The “quick” policy simply moves disk objects from the end of the extent to fill free runs of blocks at the beginning of the extent. It is named so because it can fully coalesce free space in one run of the Algorithm 2. 38  4.5. Free Space Compaction The “intact” policy on the other hand maintains the order of disk objects in the extent. It fills the first free run of blocks in the extents with the first set of allocated blocks that follow this run. One property of this policy is that it would maintain the contiguity of existing data in the extent under compaction. This policy however incurs more I/O workload compared to its “quick” peer. Also, the “intact” policy requires multiple runs of the compaction algorithm to fully coalesce free space.  4.5.2  Online Compactor  Running the offline compactor alongside active Parallax instances could corrupt the extent being compacted. To be more specific, writes to moving disk objects issued after copying them and before committing the disk object map will be lost. Furthermore, given the distributed operation of Parallax one needs to coordinate the commitment of multiple Parallax instances to use the new locations of disk objects to serve I/O operations. This is needed to ensure a consistent view of the extent under free space compaction from different Parallax instances. The online compactor is an extension of the offline version that is safe to operate concurrently beside active Parallax instances. To achieve this goal the compactor needs to coordinate Parallax instances’ accesses to disk objects of the extent under compaction. The online compactor’s algorithm is an extension to that of the offline compactor and is given in Algorithm 3. Due to space constraints, duplicate parts of the algorithm are substituted with their corresponding offline-compaction algorithm phase name (see Section 4.5.1). The following sections describe the details of the changes related to online compaction. Blockstore Format Changes Two changes were made to the blockstore format. The first is the addition of a compaction state field for each extent in the extent registry. The second is adding a compaction plan registry. The compaction state field’s value indicates whether an extent is being 39  4.5. Free Space Compaction compacted and which phase of compaction it is at. The values that this field can take on are: NORMAL, COPYING, COMMIT-1 and COMMIT-2. The assignment of two distinct states to the commit phase is necessary to implement a special commit protocol that ensures a consistent switch to using the new locations of disk objects on disk. The I/O path behaviour in Parallax instances will depend on the accessed disk object’s extent compaction state. This protocol will be described gradually throughout the rest of this section. The compaction plan registry holds information about ongoing compaction operations. Each registry entry contains a pair of fields: the compactor’s server address and the new disk object map. The role of the compactor’s server will be discussed later when discussing the algorithm. The registry is the means by which the compactor communicates the details of its compaction plans to Parallax instances. Compaction Algorithm The online compaction algorithm adds two major changes compared to the offline algorithm. The following paragraphs discuss these changes. First, the online compactor locks the extent under compaction to ensure that no other Parallax instance or client allocates disk blocks from that extent. The compactor assumes full control over the extent’s free space, and can safely copy disk blocks to this free space. Secondly, the online compactor notifies Parallax instances when moving from one compaction state to another. This coordination enables Parallax instances to modify their I/O paths to ensure the consistency of the extent under compaction. This I/O path switch is discussed in further detail later in this section. Moving from one compaction state to another requires the consent of all active Parallax instances in the system, otherwise the compaction process will be aborted. Finally, the parallax compactor creates an RPC server to respond to client’s notifications regarding write operations to moving blocks in the extent under compaction during the COPYING phase. This server is required to ensure that the blocks being copied by the compactor are up-to-date.  40  4.5. Free Space Compaction  Algorithm 3 Online Compactor 1: Lock extent. 2: Planning Phase. 3: Start the compactor server. 4: Write <compactor server address, new disk object map> pair to compaction registry. 5: Mark extent’s compaction state to COPYING. 6: Broadcast extent’s compaction state COPYING to all active Parallax instances. 7: Wait for acknowledgements from all active Parallax instances. 8: Copying Phase. 9: Mark extent’s compaction state to COMMIT-1. 10: Broadcast extent’s compaction state COMMIT-1 to all active Parallax instances. 11: Wait for acknowledgements from all active Parallax instances. 12: Mark extent’s compaction state to COMMIT-2. 13: Broadcast extent’s compaction state COMMIT-2 to all active Parallax instances. 14: Wait for acknowledgements from all active Parallax instances. 15: Commit Phase. 16: Mark extent’s compaction state to NORMAL. 17: Broadcast extent’s compaction state NORMAL to all active Parallax 18: Wait for acknowledgements from all active Parallax instances. 19: Unlock extent.  41  4.5. Free Space Compaction Parallax Instance I/O Path The I/O path of Parallax instances will change based on the compaction state it has been notified about by the compactor process. The goal is to maintain a consistent view of the extent under compaction from different Parallax instances all the time. More specifically, this consistency should be maintained even in the case of a failure during transitioning Parallax instances from one compaction state to another. To achieve that, the I/O path of consecutive compaction states should be compatible with each other. This last invariant could not have been achieved without the introduction of several compaction states to the compaction commit protocol. Table 4.3 summarizes these I/O path changes. Each of these changes is discussed briefly in the following paragraphs. The I/O path change in the COPYING state ensures that no writes are lost during compaction. When a Parallax instance receives a write request to write to a moving disk object in this state, it writes the disk object to its old location. It then notifies the compactor’s RPC server of this write operation so that the compactor would recopy the newly written content of the disk object to the new location. The write notification process can be done asynchronously off the I/O path as long as the Parallax instance will not acknowledge a compaction phase switch until all the write notifications it has sent out are acknowledged by the compactor. The I/O path changes in the COMMIT-1 and COMMIT-2 states ensures a smooth shift from using old disk object locations to using new ones. Splitting the COMMIT phase into two distinct states serves the purpose of switching the serving of read requests of moving disk objects from their old to new locations. The duplication of writes to both old and new locations in both the COMMIT-1 and COMMIT-2 states enables this gradual shift of reads as it would guarantee that both the old and new locations contain up-to-date values.  42  4.5. Free Space Compaction Compaction State NORMAL COPYING COMMIT-1 COMMIT-2  Parallax Instance I/O path for moving objects. Regular I/O path (i.e., reads and writes are satisfied from a single on-disk location). Read from the old location. Write to old location besides notifying the compactor of the write operation. Read from the old location. Duplicate writes to both the old and new locations. Read from the new location. Duplicate writes to both the old and new locations.  Table 4.3: Parallax Instance I/O Paths for Different Compaction States. Fault Tolerance The compaction algorithm operates in a distributed environment that is prone to network failures. Moreover, the compaction process and Parallax instances are prone to all kinds of software and hardware failures. The commit protocol presented in this work is designed to deal with all these kinds of failures. Designing a protocol that is resilient to this broad range of failures has its drawbacks however. Automating the fault recovery mechanism is hard given the diverse failure scenarios, therefore fault recovery should involve manual administrative intervention. This intervention would be in the form of resolving the underlying fault issues (i.e., restore network connectivity, replace failed hardware), followed by restarting the compaction process from the appropriate compaction state. Although administrative intervention is much slower than automated failure recovery, it is acceptable for a free space compaction utility because of its non-critical nature to the overall operation of the system. The worst outcome of compaction failure will only degrade the availability level of the I/O service by adversely impacting its performance - due to duplicating writes destined to moving disk objects.  43  Chapter 5  Evaluation This chapter evaluates the modifications made to Parallax that are implemented as part of this work. First the experimental setup is introduced in Section 5.1. Sections 5.2 and 5.3 evaluate the performance of the garbage collector and the free space compactor. Finally, Section 5.4 evaluates the impact of this work on Parallax’s I/O performance.  5.1  Experimental Setup  One machine was used to carry out the experiments used to evaluate the work presented in this thesis. The machine is an SMP with 4 way dual core Intel Xeon E5420 processors and 16 GB of random access memory. It runs the open source Xen hypervisor and uses Parallax as the backend storage of the experimental virtual machines. A local hard disk is used as a back end to serve as the Parallax blockstore.  5.2  Garbage Collector Performance  The performance of the garbage collector is determined by several factors. Looking closely at the garbage collection Algorithm 1, it is apparent that its execution time will depend on the execution time of its two major phases, namely the radix tree roots marking (lines 7-18) and the rest of live block marking rounds (lines 19-28). The first of these phases execution time will depend on the live VDI and snapshot count, while the second will depend on the amount of live data blocks in the system. The following sections evaluates the impact of each of these factors on the garbage collector’s performance. 44  5.2. Garbage Collector Performance  5.2.1  Live Data  To evaluate the impact of live data size on the performance of the garbage collector, the latter is run on blockstores containing different amounts of live data. More specifically, the blockstore is populated with a varying number of virtual disks. Each of these disks is 6 GB in size. The time it takes the garbage collector to execute over blockstores containing a varying number of independent virtual disks is measured. Figure 5.1 plots the time it takes to garbage collect different amounts of live data in the Parallax blockstore. It is apparent that the time required to run the garbage collector is linearly proportional to the amount of live data.  Garbage Collection Time (Seconds)  120  100  80  60  40  20  0 0  10  20  30  40  50  60  70  80  90  100  110  120  130  140  Live Date Size (GB)  Figure 5.1: Garbage Collector Performance for Different Amounts of Live Data Blocks.  5.2.2  VDI and Snapshot Count  Next the impact of increasing the VDI and snapshot count on the garbage collector’s performance is evaluated. The impact of each of these factors is evaluated in two independent experiments, where in each of these experiments either the VDI count or snapshot count is varied while keeping the amount of metadata in the system constant. The experiment to measure the impact of VDI count starts by populating 45  5.2. Garbage Collector Performance the blockstore with 10 independent 6 GB VDIs. Next, each of these VDIs is cloned an equal amount of times varying from zero to 19 clones† . Finally, the garbage collector is run on the blockstore and its time of execution is measured. Figure 5.3 plots the amount of time required to garbage collect different amounts of VDIs in this experiment.  Garbage Collection Time (Seconds)  120  100  80  60  40  20  0 0  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180 190 200 210 220  VDI count  Figure 5.2: Count.  Garbage Collector Performance versus Virtual Disk Image  Similar to the VDI count experiment, the experiment to measure the impact of the snapshot count on the garbage collector’s performance starts by populating the blockstore with 10 independent 6 GB VDIs. Next, each of these VDIs are snapshot an equal amount of times varying from 0 to 1000. Finally, the time it takes the garbage collector to finish execution is measured for different snapshot counts. Figure 5.2 plots the amount of time required to garbage collect different amounts of snapshots in this experiment. Note that in this figure, zero snapshots refer to the case where no snapshots are taken of the ten VDIs in the blockstore. † Note that disk cloning keeps the amount of live data blocks in the blockstore unchanged.  46  5.2. Garbage Collector Performance  Garbage Collection Time (Seconds)  120  100  80  60  40  20  0 0  1  10  100  1000  Snapshot count  Figure 5.3: Garbage Collector Performance versus Snapshot Count.  It is apparent from both aforementioned experiments that the VDI and snapshot counts have no significant impact on the garbage collector’s performance. Therefore the radix tree root marking phase does not dominate the execution time of the mark and sweep algorithm of the garbage collector.  5.2.3  Impact of Block Sharing  One last interesting factor in the garbage collector’s performance is the amount of sharing among VDIs in the system. To evaluate the impact of sharing an experiment was carried out where twenty VDIs are cloned from the same base image which is 6 GB in size. Then in each run of the experiment a portion of each VDI is rewritten to break off Copy-onWrite sharing across VDIs. The rewritten portion is varied from zero to one hundred percent of the base image size in increments equal to twenty percent of that same base image size. Figure 5.4 plots the garbage collection execution time versus the level of divergence of the twenty VDIs from the base image. The execution time grows linearly and proportional to the level of divergence between VDIs. This result is not surprising given that it has been shown in Section 5.2.1 that the garbage collection execution time is linearly proportional to the size 47  5.2. Garbage Collector Performance of live metadata in the blockstore. Parallax VDIs share metadata blocks besides sharing data blocks, and VDI data divergence implicitly breaks off sharing of metadata blocks therefore increasing live metadata size and consequently the time required for garbage collection.  Garbage Collection Time (Seconds)  120  100  80  60  40  20  0 0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  VDI divergence  Figure 5.4: Garbage Collector Performance versus Level of Sharing among Virtual Disk Images.  5.2.4  Discussion  The following sections discuss several issues related to the garbage collector’s evaluation. First the scalability of the current implementation is discussed. Next, the garbage collector’s memory requirements is evaluated. Finally, the garbage collector’s use of and potential contention for extent locks is considered. Scalability The numbers in figure 5.1 indicate that the garbage collector can scan the blockstore’s data roughly at a rate of 1.2 GB per second. That would suggest that the garbage collection process is limited to processing just more than 4 TB of live data for every hour of time it is run.  48  5.2. Garbage Collector Performance Due to the lack of any realistic Parallax workloads, it is hard to evaluate the rate of storage consumption in such a system. There are two scenarios to running the garbage collector in a realistic setup. The first would be running it in the background concurrently while other Parallax instances are servicing I/O requests. This scenario is hard to evaluate as it is hard to predict how the the garbage collector’s workload and the client’s workload will impact each other’s performance. The second scenario would be running the garbage collector during the downtime or low usage time periods of Parallax. In this case the amount of work the garbage collector will be able to do is more predictable and is approximately processing 4 TB of live data in the blockstore per hour. Assuming that there exists four hours of downtime or low usage of the system per day, the amount of work the garbage collector can do is bound by those hours and is limited to processing 16 TB of live blockstore data per day. Before carrying on further in evaluating the garbage collector, the notion of blocks churning rate in a storage systems needs to be defined. A churned block refers to a block that has been allocated (i.e., brought to life) and then left with zero references (i.e., freed). The churning rate of data blocks is the amount of churned blocks within a period of time. In a storage system, this rate is proportional to the amount of churned blocks and inversely proportional to the average lifetime of data blocks. A high churning rate of data blocks happens when lots blocks are being allocated and trashed within short periods of time. Having defined the churning rate of blocks in a storage system, it is observed that this churning rate represents the consumption rate of storage blocks in Parallax. Therefore this rate is the determinant factor in how often the garbage collector needs to be run. Assuming a worst case scenario where the whole storage in a Parallax blockstore is churned on a daily basis† , the maximum work the Parallax garbage collector can do to catch up with this rate of churning is processing 16 TB of the blockstore storage. Conservatively speaking, assuming the †  The worst case scenario of high churn rates of the blockstore is not unrealistic given the nimbleness and volatility of virtual machines that are assumed to use Parallax.  49  5.2. Garbage Collector Performance worst level of churn rates combined with the garbage collector implementation developed in this work will limit the size of the Parallax blockstore to a few tens of terabytes (∼16 TB given our assumptions above). Garbage Collection and Memory The mark and sweep algorithm uses memory bitmaps to track the reachability to disk objects. This requires one bit of memory for each potential disk object in an extent. Given that the minimum disk object size is 4 kilobytes, 1 gigabytes of memory can track the reachability of approximately 32 terabytes worth of blockstore space. Modern servers usually have several tens of main memory, therefore memory is not expected to be the bottleneck for the current implementation of the garbage collector. Garbage Collection and Extent Locks The garbage collector presented in this work locks extents being garbage collected for the whole duration of the garbage collection process. Given that the garbage collection process might run for long durations of time (e.g. hours for blockstores containing several terabytes of live data), it is worth addressing the possibility of any lock contentions that might arise. A Parallax instances takes hold of an extent lock so that it can allocate free blocks from that extent. As discussed earlier in Chapter 2, Parallax instances need to allocate blocks to serve writes to read-only shared blocks. This allocation of blocks can be accomplished from any unlocked extent in the blockstore, therefore the availability of unlocked extents with enough free space is critical to the continuous operation of the system. It is necessary to ensure that the Parallax instances will be able to operate during the execution of the garbage collector. This can be guaranteed by carrying out a check prior to running the garbage collector. This check would ensure that there is enough free space in extents that are not being garbage collected for the expected duration of the garbage collection process.  50  5.3. Free Space Compactor Performance  5.3  Free Space Compactor Performance  The time it takes for the free space compactor to coalesce free space in an extent is dependent on the distribution of free space in that extent. The time is also dependent on the compaction policy. As mentioned in Section 4.5.1 the “intact” compaction policy requires extra I/O operations to complete, which makes it slower than its peer the “quick” compaction policy. Table 5.1 summarizes the average time that the free space compactor has taken to complete coalescing free space of synthetically fragmented 2 GB extents. The synthetically fragmented extents where obtained by running an I/O intensive benchmark on an ext3 file system installed on a Parallax VDI. A snapshot of the VDI is taken every second during the benchmark execution, and at the end of it every other snapshot was deleted and free blocks were reclaimed back by running the garbage collector. The benchmarks that were used to create these fragmented extents are: Bonnie, Postmark and a linux kernel build. Fragmentation Type Bonnie Postmark Linux Build  Quick Compaction Execution Time (seconds) 48.5 43.2 34.1  Intact Compaction Execution Time (seconds) 50.6 78.6 102.9  Table 5.1: Average Free Space Compaction Execution Times for Different Synthetically Fragmented Blockstore Extents.  The current implementation of the compactor is not optimum and there is still space for enhancing its performance. However, the fact remains that the compaction process is I/O intensive and would still be costly in terms of execution time. Given the numbers in table 5.1, it is possible that compacting a 1.5 Terabyte disk would take twenty hours to complete. This time will only get worse as disk capacities grow. On the positive side, it is possible that extent compaction could be carried out as part of processes such as disk scrubbing and backup operations to eliminate redundant reads from the disk. 51  5.4. I/O Path Overhead  5.4  I/O Path Overhead  In this section we investigate the impact of the work presented in this thesis on the Parallax I/O path. The two sources of overhead introduced to the Parallax I/O path are the disk object indirection layer (see Section 4.1) and the modifications related to the free space compactor (see Section 4.5.2). Both of these overheads are analyzed in Sections 5.4.1 and 5.4.2. A third source of disturbance to the I/O path of Parallax clients is the workload generated by the maintenance utilities (i.e., garbage collector and free space compactor). The client and maintenance workloads contend for the limited storage resources’ bandwidth of the Parallax blockstore. In the case these workloads exceed the blockstore’s capabilities, the performance perceived by the storage system clients will degrade. This last source of impact is ignored in this evaluation for the following reasons. It is assumed that I/O schedulers can prioritize client requests over those of maintenance utilities. Furthermore, these utilities are often run in periods of low activity of the storage systems, which minimizes the effects of contention over storage resources.  5.4.1  Disk Object Layer I/O Path Overhead  The disk object layer (see Section 4.1) is used to decouple data block addresses from their physical locations therefore enabling data block movement without reference updates. The overhead is in the form of translating block addresses from one address space to another (physical to machine addresses in this work’s vocabulary). This translation is required for servicing every I/O in the system. Figure 5.5 shows the performance of multiple I/O intensive benchmarks with the disk object layer in place. These benchmarks are run within a VM that uses Parallax VDI as its storage backend. The performance measures here are normalized to the same benchmark results without having the disk object layer in place. The maximum time overhead that the indirection layer introduces is less than 5%. Therefore it is fairly reasonable to consider this overhead minimal. 52  5.4. I/O Path Overhead  Performance (Normalized execution time).  1.06 1.05 Sequential Write 1.04 Sequential Read  1.03 1.02  Postmark  1.01  Linux Build  1 0.99 0.98 0.97 0.96 0.95 0  Benchmark  Figure 5.5: Disk Object Layer Overhead.  5.4.2  Compaction I/O Path Overhead  The overhead incurred on the I/O path during online free space compaction is in the form of replicating writes destined to moving blocks both to their source and destination locations on disk. This is done as part of a special commit protocol that ensures the consistency of extents as viewed from different Parallax instances. This overhead is transient as it is only required to take place during two out of four phases during the execution of extent free space compaction. A worst case analysis of the I/O path degradation is carried out to bound the I/O path performance degradation due to online compaction. The use of a worst case analysis is due to the lack of any realistic I/O workloads that contain logs of clone, snapshot and delete operations on virtual disks besides read and write operations. The experimental setup is designed to incur the worst possible degradation of performance on the I/O path. In our case, this worst case scenario 53  5.4. I/O Path Overhead is having to replicate every write request to two different locations on disk. This can be achieved if every disk block is being moved by the compaction process. The worst case scenario is enforced in the experimental setting by carrying out the following steps. First, the VDI of the benchmarking machine is pre-filled in the blockstore in such a way that its blocks only occupy the second half of each extent. This will yield all blocks in the first half of each extent to be free. The free space compaction utility is then run for each extent in the blockstore. The simplistic policy of the free space compaction utility will move every block in the second half of each extent to some location in the first half of the same extent. For the purposes of this experiment, these free space compaction runs are aborted right after they have marked an extent as being in the COMMIT phase. This sequence of actions ensures that all writes to the benchmarking VM’s VDI will be replicated and therefore achieve the worst case scenario intended.  Performance (Normalized execution time).  4  Sequential Write Sequential Read  3  Postmark Linux Build  2  1  0 0  Benchmark  Figure 5.6: Copying and Commit Phases I/O Overhead.  54  5.4. I/O Path Overhead Four I/O intensive benchmarks are run inside the benchmarking VM which uses a Parallax VDI as its backend storage. The performance of each benchmark is measured both in the normal scenario (i.e., prior to compaction) and in the worst case scenario (i.e., where each write operation is replicated to two disk locations) The levels of degradation in the performances of the benchmarks are shown in Figure 5.6. As expected sequential writes are heavily degraded as write requests are replicated. This increases the request count and breaks the linearity of write requests which incurs extra seeks. On the other hand, sequential reads are barely impacted by online compaction as the read I/O path is mostly left unchanged. As for more realistic benchmarks, postmark is slowed down to half of its execution speed while building linux is barely impacted. In summary, even though the impact can be heavy on the I/O path the slowdown is only transient. Moreover, the experiments carried out in this work are carried out on a single disk. This represents a worst case scenario for handling replica writes in terms of performance. The performance of different RAID configurations - which are not uncommon to use in virtualized storage environment - might be less adversely affected by write requests replication.  55  Chapter 6  Comparison with Earlier Work A previous work on Parallax [2] investigated the implementation of garbage collection, defragmentation and deduplication services for Parallax. The work presented in this thesis attacks the first two of the aforementioned problems. The following sections contrast this earlier work to the one presented in this thesis.  6.1  Garbage Collector  The garbage collector implementation presented in this thesis is similar to the one presented in the earlier work. Both garbage collection implementations are variants of the mark and sweep algorithm with a finite pass count. In addition both implementations use of a per-extent approach in garbage collection which helps scale the process to overcome memory limitations. As an optimization to the garbage collector implemented in the previous work, a journal is appended to each extent that identifies VDIs that have been allocated blocks from that extent. This helps in limiting the number of VDIs that need to be scavenged when searching for live blocks in the mark and sweep process. However, as VDI blocks gets dispersed over extents of the blockstore the benefits of such a scheme would degenerate in time. One area the previous work did not explore is safely deleting storage abstractions (i.e., VDIs and snapshots). This work tackles the problem of coordinating cloning and deleting disks in the distributed environment. In the previous work such a mechanism was assumed to be in place.  56  6.2. Data Defragmentation  6.2  Data Defragmentation  In the previous work blocks can be remapped across the store, this is required to defragment disks that might have blocks scattered over different extents. In this work remapping occurs at the extent level as it is only concerned with defragmenting free space. Defragmenting virtual disks in the previous work was carried out in response for the need for contiguous free space for super pages. In contrast, in the work presented in this thesis free space defragmentation is a precautionary measure to minimize data fragmentation effects for some class of workloads.  6.3  Fault Tolerance  The previous work introduces three locks per extent to accomplish garbage collection remapping and free space defragmentation. It is my believe that this proliferation of locks would cause a lock management problem. In this work no additional locks have been introduced to Parallax. Furthermore the introduction of locks that both data management services (GC, remapper and defragmentation) and I/O servers (Parallax instances) depend on introduces the possibility of deadlocking client VMs. In this work free space management utilities are considered a lower priority process compared to I/O servers. Parallax instances can continue operation in the case of faults that would abort free space management processes.  57  Chapter 7  Related Work The following sections discusses related work to that of this thesis. Section 7.1 presents earlier work that has introduced an additional block-level layer of indirection in storage systems. Section 7.2 presents earlier work on garbage collection and free space management in storage systems.  7.1  Block-level Indirection Layer  Several systems use a new indirection layer that translates disk block addresses from one namespace to another. Each of these systems had a different purpose for introducing this new level of indirection. The following paragraphs will discusses a selected set of these systems. The logical disk [5] separates disk and file management. It uses a location independent naming scheme where blocks are referred to using constant logical addresses. These logical addresses are translated into physical ones using block mappings. Data blocks can be moved on disk without the need to update the block references in file system structures. This allows disk managers to change the layout of data blocks on disk independent of the file systems sitting on top of them. The per-extent address indirection layer in Parallax serves a similar purpose to that of the logical disk. It relieves the system from searching and updating references to a moving block. However in the logical disk the motive behind this indirection is more general. Clotho [7] implements block-level disk versioning. It maintains a list of the block device versions which are created upon user’s request. These versions share physical block on the block storage using a Copy-on-Write technique. A map is used to translate disk logical addresses into physical blocks on the underlying storage. Parallax makes use of the logical-to-physical 58  7.1. Block-level Indirection Layer technique for a different purpose. The log based file system [14] tries to enhance the performance of writing to disks. They do that by batching up write requests in memory then writing them out sequentially in large segments of disk. Therefore blocks end up updated out-of-place and the need arises for a translation layer keep track of logical to physical block mappings. Similar in spirit to log based file systems, Loge [6], Mime [4] and the virtual log based file systems for programmable disks [16] try to optimize the performance of small synchronous writes. They do that by writing data blocks to the closest free space to the reading/writing head of a hard disk. Consequently, block addresses exposed to users of the disk does not reflect location and are different from physical addresses that do reflect location on disk. The need arises for a map that resolves logical addresses into physical ones. The Logical Volume Manager [8] makes the management and aggregation of block storage devices easier. It aggregates these physical volumes into volume group abstractions. The latter is sliced up into logical volumes which are exposed to storage consumers as raw block devices. Device mapper [1] is used to map logical volume regions to physical volume disk space. Redundant Array of Inexpensive Disks [13] aims at achieving parallel I/O execution to enhance performance, redundancy in storage in the face of disk failures or a combination of both aforementioned. This functionality can be achieved using specialized hard disk controllers or software drivers including a device mapper [1]. Conceptually a 1-to-N map is used to map each block to multiple physical locations on different disks. However mappings in RAID systems are so simple that they can be represented as functions that can be computed on the fly instead of having to save them explicitly on disk. Petal [10] aggregates storage across network connected storage. Like LVM it aims to ease the management of storage. It implements redundancy using chained-declustering. It serves virtual disks to storage consumers. Three layers of translation are used to resolve the physical location of virtual disk block. The last layer of these maps logical block addresses to physical block adresses. However the reason for having this map is simply 59  7.2. Garbage Collection and Free Space Management for resolving virtual block addresses rather than having the flexibility to move them on disk.  7.2  Garbage Collection and Free Space Management  WAFL [9] is a file system that is used by NetApp storage appliances. The file system supports a limited number of snapshots. WAFL maintains a fixed liveness bitmap for each disk block to track the liveness of data blocks. Each bit in this bitmap corresponds to a snapshot of the file system and determines whether that snapshot has a reference to the block. When a snapshot is deleted, all blocks referred from the snapshot have their corresponding bit reset in the liveness bitmap. When all bits corresponding to a certain block are masked out, the block is considered free. The liveness bitmap therefore serves as an allocation bitmap. Parallax free space reclamation is more complicated than that of WAFL, however it doesn’t limit disk snapshot count. ZFS [3] provides CoW based snapshot and cloning primitives similar to those of Parallax. It timestamps block references so that it could determine the snapshot that originally created the block being referred to. This makes it easier to identify blocks that could be reclaimed due to a snapshot deletion. The cost of collecting garbage is proportional to the amount of change of the storage object after the snapshot was taken. Recently some work showed promise in the ability of keeping track of back references in write-anywhere file systems efficiently [11]. However this work targets single hosted systems as opposed to distributed ones. Having a similar ability in distributed systems could benefit Parallax as it would obsolete the need to scavenge the block store searching for free blocks to reclaim. Log based file systems [14] divide the disk space into relatively big segments. Updates to blocks are done out of place, thus blocks that have been rewritten could be reclaimed back to free space. A free space compaction  60  7.2. Garbage Collection and Free Space Management process copies the live data out of these fragmented segments, and writes them out to free ones. This is similar to Parallax’s free space compaction except for that the compaction process in Parallax does not copy out data from one extent (i.e., compaction unit) to another.  61  Chapter 8  Conclusion This work presents a garbage collector and free space compaction utility for the Parallax storage system. These utilities are shown to be efficient in that they do not take considerable time to run, and scalable in that they can manage blockstores of sizes of a few tens of terabytes. The impact of the modifications made to the Parallax storage I/O path is shown to be minimal. The only exception is during a phase in the online compaction process. However even this I/O overhead is still bearable given its transient nature. The garbage collector and free space compaction utilities are implemented with high availability and reliability guarantees in mind. The high availability guarantee is addressed by implementing the utilities to work in parallel with servicing I/O requests. While fault tolerance is addressed by implementing the free space compaction process using a special commit protocol. This ensures the robustness of the Parallax storage system in the face of network and node failures which are common in distributed systems.  62  Chapter 9  Future Work The work presented in this thesis represents a step in the direction of making Parallax a realistic storage system choice. It adds free space reclamation and management facilities. However, like any other piece of research work, it opens the door wide open for new problems. The rest of this section will present some of these problems and research areas this work brought to light. The garbage collector’s evaluation (see Section 5.2) suggests that the current garbage collector’s implementation does not scale well under worst case scenarios of storage consumption. Given these assumptions, the Parallax blockstore size is limited to a few tens of terabytes at most. Given today’s ever growing size of virtualized deployments in the cloud, this might yield Parallax unusable for larger deployments. Further research need to be carried out in order to figure out whether such a pessimistic scenarios exist and to what extent would they hinder the adoption of Parallax. An alternative to mark and sweep algorithms for garbage collection would be maintaining reference counts. Maintaining reference counts in a distributed storage system is often challenging as high I/O performance stands in conflict with ensuring reference count consistency. However for garbage collection purposes, reference counts need not be strictly consistent. Reference count updates can be logged offline and later aggregated at a centralized garbage collection service which in turn frees blocks with reference counts that have dropped to zero. Because the aggregated reference counts are not up-to-date, there is a danger that such a reference count based garbage collector might free valid blocks. This situation can be avoided by tagging blocks on allocation with time stamps to avoid freeing recently allocated valid blocks which their corresponding reference count 63  Chapter 9. Future Work updates have not reached the garbage collection service yet. Extending the idea tracking reference counts, it would be useful track back references information in full. This information would help implementing smarter data defragmentation utilities that are better informed about block virtual and physical locations. Tracking back references efficiently for a single host storage system is challenging, but recent work [11] has shown that it is possible. It is still an open question whether a similar technique can be applied to distributed storage systems in general and Parallax in particular. The free space defragmentation utility prototype developed in this work is simplistic and suboptimal in terms of performance. Many optimizations are possible to enhance the performance of such utility. For one, the block recopy protocol can be optimized while still maintaining its correctness. Another optimization would be in the form of applying some heuristics to the algorithm to increase its utility by restricting defragmentation to extents that contain more free space than others. Another heuristic based on the observation that contiguity of data is only useful up to a certain level (e.g., 2 MB contiguous blocks in modern magnetic disks), coalescing free space into buckets of a maximum 2 MB in size would minimize the amount of I/O operations that need to be done and consequently speed up defragmentation. The work presented in this thesis suffers from the lack of realistic benchmarks to both drive the design and evaluate the implementation. One future direction of work would be hosting an experimental virtualized environment that relies on Parallax as its backend. An important goal of such an endeavour is to observe the performance of Parallax in a realistic setting. This would help create a better sense for what are the real problems and challenges that face a system similar to Parallax. One question that is particularly interesting to address is: What is the impact of disk fragmentation in a system such as Parallax on virtual machines’ performance? One last related future venture is building a reliable huge distributed blockstore for Parallax. In its current status, Parallax assumes a reliable and huge blockstore backend that can be accessed over the network. Enterprise grade storage can be used to serve this purpose, however this solution 64  Chapter 9. Future Work is both expensive and is not as elastic as Parallax in terms of scalability. We believe that a Petal[10] inspired storage service aggregating inexpensive disks across the network would be more suitable for Parallax. This disk aggregation service is currently under development as of the time of writing in the Networks System and Security lab. We hope this service will be available to use for Parallax deployments in the near future.  65  Bibliography [1] Device Mapper. Website, 2011. http://sourceware.org/dm/. [2] G. Aggarwal. Metadata services for the Parallax storage system. Master’s thesis, The University of British Columbia, 2008. [3] J. Bonwick and B. Moore. ZFS: The last word in file systems. 2011. http://www.sun.com/software/solaris/zfs lc preso.pdf. [4] C. Chao, R. English, D. Jacobson, A. Stepanov, and J. Wilkes. Mime: a high performance parallel storage device with strong recovery guarantees. Technical report, 1992. [5] W. De Jonge, M.F. Kaashoek, and W.C. Hsieh. The logical disk: A new approach to improving file systems. In Proceedings of the fourteenth ACM symposium on Operating systems principles, pages 15–28. ACM, 1994. [6] R.M. English and A.A. Stepanov. Loge: A self-organizing disk controller. In Proceedings of the Winter 1992 USENIX Conference, pages 237–251, 1992. [7] M.D. Flouris and A. Bilas. Clotho: Transparent data versioning at the block I/O level. In Proceedings of the 12th NASA Goddard, 21st IEEE Conference on Mass Storage Systems and Technologies (MSST 2004), pages 315–328, 2004. [8] M. Hasenstein. The Logical Volume Manager (LVM). SuSE Inc, 2001. [9] D. Hitz, J. Lau, and M. Malcolm. File system design for an NFS file server appliance. In Proceedings of the USENIX Winter 1994 Technical 66  Bibliography Conference on USENIX Winter 1994 Technical Conference, page 19. Usenix Association, 1994. [10] E.K. Lee and C.A. Thekkath. Petal: Distributed virtual disks. ACM SIGOPS Operating Systems Review, 30(5):84–92, 1996. [11] P. Macko, M. Seltzer, and K.A. Smith. Tracking back references in a write-anywhere file system. In Proceedings of the 8th USENIX conference on File and storage technologies, page 2. USENIX Association, 2010. [12] D.T. Meyer, G. Aggarwal, B. Cully, G. Lefebvre, M.J. Feeley, N.C. Hutchinson, and A. Warfield. Parallax: virtual disks for virtual machines. In Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008, pages 41–54. ACM, 2008. [13] D.A. Patterson, P. Chen, G. Gibson, and R.H. Katz. Introduction to redundant arrays of inexpensive disks (RAID). Spring COMPCON89, pages 112–117, 1989. [14] M. Rosenblum and J.K. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems (TOCS), 10(1):26–52, 1992. [15] M. Vrable, J. Ma, J. Chen, D. Moore, E. Vandekieft, A.C. Snoeren, G.M. Voelker, and S. Savage. Scalability, fidelity, and containment in the potemkin virtual honeyfarm. ACM SIGOPS Operating Systems Review, 39(5):148–162, 2005. [16] R.Y. Wang, T.E. Anderson, and D.A. Patterson. Virtual log based file systems for a programmable disk. Operating systems review, 33:29–44, 1998. [17] A. Warfield, R. Ross, K. Fraser, C. Limpach, and S. Hand. Parallax: Managing storage for a million machines. In Proceedings of the 10th conference on Hot Topics in Operating Systems-Volume 10, page 4. USENIX Association, 2005. 67  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items