Copy-on-write in Mammoth by Shihao Gong M . E . , Tsingh.ua University, China, 1999 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y OF G R A D U A T E S T U D I E S (Department of Computer Science) we accept this thesis as conforming the recyiired /Standard T h e Un ive r s i t y of B r i t i s h C o l u m b i a January 2003 © Shihao Gong, 2003 In presenting t h i s t h e s i s i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e for reference and study. I further agree that permission for extensive copying of t h i s thesis for s c h o l a r l y purposes may be granted by the head of my department or by h i s or her representatives. I t i s understood that copying or p u b l i c a t i o n of t h i s thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. Department of The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada Date Abstract Mammoth is a versioned peer-to-peer storage system with a traditional file system interface. Files are replicated selectively on several nodes based on per-file policies to protect against site failure. A set of interested nodes may cache the file data to read and write. To increase the availability during network partition, Mammoth allows branch to occur. Consistency is achieved by separating file history logs and file data. A l l file data are kept as immutable versions. File metadata is propagated eagerly among interested nodes. However, keeping a file version in its entirety may hurt the performance when versioning and replicating large files with minor modifications. Thus a copy-on-write (COW) mechanism is integrated with Mammoth file system. It preserves all the features of Mammoth's original design. The basic idea of C O W is to use an in-core bitmap to record which blocks are modified during an update session. When versioning the file, only the new data of changed blocks (delta) is copied into a dfile and the changed block list is written to an ifile. When updating a copy on a remote node, the system scans the ifiles created during this interval and gets a complete list of changed blocks. Then only the delta is sent through the network. File restoration is achieved by applying deltas reversely. We did a comparison test between Mammoth and Mammoth-)-. It shows that Mammoth+ reduces versioning overhead and replication overhead dramatically for large files with a small update size. For example, when only 4KB changed in a 64MB file, the versioning overhead of Mammoth is 4565 times of Mammoth-)-, and the replication overhead of Mammoth is 17812 times of Mammoth-h ii Contents Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgements viii 1 Introduction 1 1.1 Motivation : 1 1.2 Challenges for C O W in Mammoth 2 1.3 Methodology 4 1.4 Thesis Organization 5 2 Mammoth File System 6 2.1 File Metadata 6 2.2 File Versioning 9 2.3 File Replication 11 2.4 System Architecture 11 i i i 3 Background of Delta Compression 13 3.1 Delta Chains 14 3.2 Delta Generation 15 3.3 Delta Transmission 16 4 Design 19 4.1 Overview 19 4.2 File Versioning 21 4.2.1 Granularity of Delta Compression 21 4.2.2 Delta Chaining Direction 21 4.2.3 Bitmap vs. Block List 22 4.2.4 Creating a New Version 24 4.3 File Replication . 27 4.3.1 Reading File History 28 4.3.2 Calculating Changed Blocks 31 4.3.3 Discussion 33 4.4 File Restoration 34 4.5 Communication Protocol 35 4.6 Summary 36 5 Evaluation 38 5.1 Performance of File Versioning • • 39 5.2 File replication performance 45 5.3 File Restoration Performance 52 5.4 PostMark Benchmark Performance 55 5.5 Summary 56 iv 6 Conclusions and Future Work 57 6.1 Conclusions 57 6.2 Future Work 58 Bibliography 60 v List of Tables 2.1 Common file metadata 8 2.2 File version log entry 9 4.1 Comparison between forward and backward delta 23 4.2 Operation cost of bitmap and block list. N is the number of blocks in the file, M is the number of changed blocks, "-" means not applicable. 23 4.3 Added fields to the common file metadata 27 4.4 Message header for transferring data and block list 36 5.1 The interleave depth 40 5.2 File versioning overhead. "-" means not available 41 5.3 Replication overhead on the sending node with a warm cache 47 5.4 Replication overhead on the sending node with a cold cache 50 5.5 File restoration time 54 vi List of Figures 2.1 The layout of the file system 7 2.2 The system architecture of Mammoth 12 4.1 The new layout of the file system. 25 4.2 Delta file format 26 4.3 Sorting file history 30 5.1 File versioning overhead vs. update size or file size 42 5.2 File versioning overhead for 64KB files 43 5.3 File versioning gain for C O W over non-COW 44 5.4 Overhead of calculating the changed blocks with a warm cache. . . . 48 5.5 Replication overhead with a warm cache. 49 5.6 Overhead of calculating the changed blocks with a cold cache 50 5.7 Replication overhead with a cold cache 51 5.8 File restoration overhead 54 vii Acknowledgements First of all, I would like to thank Dima Brodsky, who worked from the beginning of the Mammoth project and is continuing working on it. He helped me understand the project and inspired this thesis. The previous work of Jody Pomkoski on this project also built the foundation of this work. Thanks definitely go to my supervisor, Dr. Michael Feeley. Without his supervision, it is impossible to finish this work. I especially appreciate his patience and encouragement throughout this work. I am also grateful to Dr. Norm Hutchinson, for his valuable comments. SHIHAO G O N G The University of British Columbia January 2003 viii Chapter 1 Introduction 1.1 M o t i v a t i o n With the increase of disk capacity and network bandwidth, two trends appear in file system design. One trend is to retain multiple versions of the same file on disk to protect it from user mistakes. The other is to replicate the same object or entire file system on several network nodes to protect against site failure, balance load, or increase availability. However, disk operations and network bandwidth are still the bottleneck to a file system compared to C P U and memory speed. Traditional backup technologies (tertiary storage backup) are found useful to reduce disk storage and network band-width usage. Checkpointing file systems [17, 7] only back up those files or blocks changed between two checkpoints. Later systems [20] assign finer-grained backup policies to individual files or volumes. Most systems exploit Copy-on-write (COW) techniques to generate check-points, thus never overwriting the original copy. Multiple versions can share the same blocks if these blocks are not modified in the creation of these versions. When 1 a block is modified, the system makes a copy of the original block and overwrites the new block, thus allowing all previous versions to continue to share the old block. Recently a delta compression algorithm [2] has been used for backup and distri-bution. It encodes a file version as the difference (delta) from an existing version at byte level and is effective for both binary files and text files. Software revision control systems [22, 19] provide a way to manage multiple versions and reduce their storage requirement. Previous work suggests that reducing the amount of data stored on disk and transmitted over the network in a distributed versioned file system is important. A recent peer-to-peer file system (Mammoth) [1, 16], developed at University of British Columbia, calls for efficient version compression and distribution. Though file-grain retention and replication policies are supported in Mammoth, storing and transmitting file versions in their entirety is undesirable especially for large files. Studies [26] indicate that file size is increasing and most accesses to large files only involve a small amount of data [26, 14]. So there is good opportunity to increase large file performance from version compression. In this thesis, we present a copy-on-write (COW) mechanism that is fully integrated with Mammoth. We evaluate how version creation and distribution can benefit from this approach. 1.2 Challenges for C O W in Mammoth Although previous work on Mammoth built the foundation for version compres-sion and distribution, integrating C O W in a distributed file system like Mammoth introduces special challenges. To protect data from user mistakes, every version is kept on the creation node 2 . (owner node) in Mammoth. Without replication, C O W could be implemented in the same way as in Elephant [20]. However, Mammoth replicates a file's data based on a user-specified replication policy to increase availability and sustainability. These policies usually set up an upper bound between two replication points. Furthermore, file data can also be cached on a set of nodes (interested nodes) to exploit access locality. This implies that a replica node or an interested node may not contain all of the versions. To keep a file's data up-to-date, the owner node should compute the delta between arbitrary versions. For.example, the owner node contains five consecutive versions for a file starting from the first version, a replica node stores the first version of the same file, and an interested node caches the second version of the same file. To bring all of them up-to-date, the owner node should send the delta between the first and the fifth version to the replica node, and the delta between the second and the fifth version to the interested node. If all versions are available on the owner node, the delta can be computed locally. However, this is not always the case. In Mammoth, a file's owner is responsi-ble for ensuring the consistency of file data, by using a multiple-reader/single-writer lock. To increase write performance, file ownership is transferred with the write lock. When transferring ownership, the system only guarantees the new owner has the most up-to-date version on it. If this new owner does not have the version that is already replicated on a replica node, it may need a collaborative effort to compute the delta. For example, initially node A creates a file "foo" and replicates it to node B. Then it creates the second and third version of "foo". Later node C requests ownership to update "foo". Node A transfers the data of the third version to node C. Node C creates the fourth version based on the third version. Now, based on the 3 replication policy for "foo" node C decides to replicate the fourth version to node B. Since the most up-to-date version of "foo" on node B is the first version, node G needs to compute the delta between the first and fourth version. However, node G can only compute the changed blocks between the third and fourth version locally. Thus, node C sends a request to node A to compute the changed blocks between the first and third version. Then node C computes the delta between the first and fourth version. 1.3 Methodo logy This thesis fully integrates C O W with Mammoth. The new system is called Mam-moth-}-. It exploits a per-file-based block level delta compression mechanism to sup-port fast versioning and reduce the bandwidth consumed when transmitting deltas through the network. The main goal of this work is to satisfy the following conditions: • Fully integrate copy-on-write with Mammoth. • Reduce storage space when storing a set of file versions. • Reduce versioning overhead, thus improve client response time. • Support arbitrary replication policies. Only changes made between replication points are transmitted to the replica nodes. Thus improve network usage and reduce disk operation. • Support caching in arbitrary nodes. Nodes can request any version and only changes are transfered. 4 • Support arbitrary types of file format. It is not bound to any specific appli-cation. • It aims to improve large file performance and has less impact on small files. • Support reconstruction of arbitrary versions. 1.4 Thesis Organiza t ion The rest of this thesis is structured as follows: Chapter 2 describes the background of the Mammoth project. In Chapter 3 we discuss related work in delta compression. We present the design of Mammoth+ in Chapter 4. The performance of Mammoth-t-is presented in Chapter 5. In Chapter 6, we present our conclusions and possible future work. 5 Chapter 2 Mammoth File System Before proceeding to discuss C O W in Mammoth, we first review Mammoth without C O W . Mammoth is a peer-to-peer, hierarchical file system that supports file-grain replication. Each directory and file is stored on at least one node and may be repli-cated on other nodes based on per object policies. Consistency is ensured using locks during normal connectivity and switches to an optimistic approach when fail-ures occur. Nodes are allowed to access whatever data that are currently available. Eventual consistency is simplified by storing directory and file metadata as logs' and by storing file data as a collection of immutable versions. Write conflicts are represented in the metadata as version history branches, and must be resolved by higher-level software or the user. 2.1 File Metadata File data is stored as a journal of immutable file versions in Mammoth to protect from human mistakes as in Elephant [20]. A file's metadata tracks these versions so that they can be located when needed. Both file data and metadata are referenced 6 symbolically so that they can be replicated separately. Therefore consistency issues are confined to the file's metadata. The layout of file data and metadata is shown in Figure 2.1. shgong SH — root.dir.md — root.dir.pol Paper RE Paper paper.dir.md paper.dir.pol thesis, tex.fde. md thesis, tex.fde.pot thesis.tex.ver —— thesis.tex.ndl-tl thesis.tex.nd2-t2 thesis.tex.ndl-t3 thesis, bib.fde. md thesis, bib.fde.pol thesis.bib.ver — thesis, bib.ndl-tl' — thesis, bib. ndl-t2' thesis.tex thesis.bib Figure 2.1: The layout of the file system. For each file in the file system, there are two metadata files and a directory to store file versions in the shadow directory (SH). Files that have the suffix .md store the common metadata and the file version log. Files that ending with .pol store the interest set, the reader locks, the replica set, and the policies (described in Section 2.3) for replication. Directories with the ending .ver store the versions for the file. The most recent version is stored in the RE directory. 7 A file's metadata comprises two parts. The first part is called Common File Metadata, which stores general file state information rather than a specific version of a file. Mammoth relies on the underlying file system to manage file attributes like protection information, UID, GID, and access time, etc. Only Mammoth specific information, such as who is the current owner of the file, the last known version of the file, and whether the file is up to date, is stored in the Common File Metadata. Table 2.1 shows the fields and their purpose. This structure is persistent and is stored as the header of the file version log. Name Purpose cur.owner Current owner of the file deletedJl Has the file been deleted owner _fl Is this node the owner rlockJl Has a read lock been acquired currentJl Is the file data up to date verdir_fl Has the version directory been created symlink_fl Is it a symlink last_node Creation node of last version last_time Creation time of last version last_seq Sequence number of last version branch.node Creation node of last branch branch_time Creation time of last branch filehist_c Number of file history entries Table 2 .1 : Common file metadata. The second part of the metadata is called the File Version Log. The file version log operates as an append only log and each time a file is modified an entry with the information in Table 2.2 is appended to the file's version log. Each entry is particular to a version of a file. An entry describes the created version, when and where it was created, the operation that created it, which history branch it belongs to, the generation number in its branch, and any additional information that may be 8 necessary. For example, with the replication operation, the additional data contains the ID of the replica node. Name Purpose M ^ ^ e r f ^ ^ ^ g ^ r ^ ^ updaleT cr-node Creating node of this cr_time cr_num br_node br_time Creating time of this version remove, replication) Sequence number of this version version Creating node of this branch br_num datalen date Creating time of this branch Sequence number of this branch Length of additional data Additional data Table 2 .2: File version log entry. A file's interest set is a list of nodes that store the metadata of the file. One node in the interest set is the file's owner and is the only node allowed to modify the file. To exploit update locality, the system allows ownership transfer. The concept of ownership synchronizes the updates to a file. When the owner is unavailable due to failure, a new owner is elected. In this case, if update conflicts occur, they are stored as a branch in the version history of the file. Mammoth leaves the resolution of conflicts to higher-level software or the user. To ensure metadata consistency, Mammoth eagerly to the interest set. It propagates metadata updates uses a two phase like process, 2PC, to ensure eventual consistency. 2.2 F i l e Vers ion ing In Mammoth, a file version is defined by an open and close call. When a file / , is first created, the two metadata files, SHj'.../fi.md and SH/.../fi.pol, are instantiated. An entry is added to the file version history. Its operation field is set to eFOCreate and its seq-num is set to 1. The create.node and branch-node are set to the node address of the current node, create-time and branch-time are the physical clock at this time. The SH/... / fi.ver/ is not created at this time. When a file is modified after creation, the system first checks if it has been versioned before. If not, the directory SH/.../fi.ver/ is created first. Then Mam-moth copies the file fi to SH /.../fi.ver / fi.nj.tm where nj is the node that currently owns fi and tm is the creation time of last version. If the open is called with CLTRUNC then fi is moved and an empty file is created in its place. Subsequent writes are ap-plied directly to fi. When Mammoth receives a close call it creates and appends an entry to the file version log. The entry operation field is set to eFOUpdate, meaning that the file was simply changed. The field cr^num is incremented by one and other fields are set as file creation except for the two branch fields. If no branches occur, these two fields are exactly the same as its parent version. Otherwise, they are set as the same as the parent version's create-node and createAime. Removing a file is similar to modifying and versioning a file except that a new file is not created and the file is removed to the shadow directory. The operation field is set to eFORemove. The rename operation is the most complex operation. The system breaks it up into a source and a destination operation. The source operation is similar to the remove operation. The operation field is set to eFOMove_src. The destination operation consists of three possible cases. If a file exists at the destination then it is versioned as if it was removed. If a file has never existed at the destination, then the metadata files are instantiated as in the create case. If a file existed but had been deleted, the scenario is treated in the same way as a create. In all cases the 10 operation field is set to eFOMove_dest. 2.3 F i l e Rep l i ca t ion A n important goal of Mammoth is to ensure high availability and to protect data from all forms of failure. It uses selective replication to protect users from site failures. Since not all files are created equal, not all require the same level of protection against failure. A per file based replication policy is designed to address this problem. The policies have the general form of placing an upper bound on the hours of work that could be lost for a given failure. This avoids replication of all versions and full file system backup. The system then implements the object's replication policy using a background process that tracks recently modified objects to determine when they should be replicated and to what nodes. In a non-copy-on-write implementation, file replication operation is very sim-ple. The sending node sends the full copy of a version to the replica node. On success it updates the last replication time for the file and adds an entry to the file version log. The entry's operation field is set to eFOReplicate and the replica's node ID is put into the data field. On the replica node, the file versioning operation is almost the same as on the owner node with one exception. Since the file is transfered in its entirety, it simply removes the old version to the shadow directory and creates a new one in the current file directory. 2.4 System Arch i t ec tu re The system is composed of six main components as shown in Figure 2.2. The Front End accepts client requests. It interacts with the Name Service, Version 11 Control, and Metadata Manager components to fulfill client requests. The Name Service performs name resolution for Mammoth. That is, given a fully qualified path name, it returns the server name where the owner of the requested file or directory resides. Replica Control is responsible for file replication. Version Control performs the functions for versioning. The Metadata Manager is responsible for propagating metadata updates, thereby maintaining consistency in Mammoth. The Liveliness Monitor actively monitors the status of other nodes, providing failure detection for the system. There is also a Cleaner thread (not shown in Figure 2 . 2 ) , which periodically removes unwanted versions and compacts version histories. Name Service Liveness Monitor Metadata Manager Version Control Replica Control Figure 2 . 2 : The system architecture of Mammoth. The current prototype of Mammoth is a user-level NFS [3] server. The client is modified version of NFS version 2 that adds explicit open and close calls to the server. A version is created each time when the server receives a close call after a file has been opened for writes. 1 2 Chapter 3 B a c k g r o u n d o f D e l t a C o m p r e s s i o n As described in the previous chapter, the original design of Mammoth stores and transmits all file versions in their entirety. This has several drawbacks for large files. First, it hurts performance when storing the full version on disk. Second, it wastes network bandwidth when transmitting a version. Third, it wastes storage space. To enhance its performance and storage utilization, a copy-on-write (COW) mechanism is implemented in Mammoth. This design is largely influenced by previous work on delta compression. It includes storing a list of versions as a delta chain and transmitting deltas over the network. This chapter provides an overview of previous work in this area. It first discusses the directions of delta chains. Then it reviews different approaches to generate a delta. Finally, several systems that support delta transmission are presented. 13 3.1 D e l t a Chains One of the primary goals of version control systems is to store multiple versions of a file efficiently. Source code common to more than one version is not duplicated. The Source Code Control System (SCCS) [19] stores the first version as a full copy and subsequent versions as deltas between two adjacent versions. For example, we have a series of file versions as following, V1,V2,...,Vi^,Vi,Vi+1,... Vi denotes a full copy of version i. It will produce the following chain Vi) A( i i 2 ) ,A( i _ 1 ) i ) , A( i i i + 1),... A(j_ 1 )j) denotes the difference between Vi-\ and Vi. It is a forward delta chain. To reconstruct an arbitrary version Vi, the algorithm must copy the first version V\ and then apply all deltas from 2 to i. The restoration time grows linearly in the number of intermediate versions. So reconstructing a most recent version is costly. Other version control systems [22] use a reverse delta chain to solve the above problem. Their decision is based on the observation that most applications tend to access the most recent versions more frequently than older versions. The reverse delta chain is like the following A(2,i)> A ( 3 , 2 ) > A ^ x ) , A ( j + 1 ) j ) , V n . Each time when a new version created, it computes the delta and also stores the full copy of the new version. To reconstruct an older version, it starts from the last version and applies the deltas in reverse. However, traversing several deltas to restore a version is time-consuming especially when the versions are stored on a backup media. A version jumping delta 14 chain [2] is developed for file system backup. It occasionally inserts a whole file into the repository and the delta is computed between the reference file and the version it represents. Following is the organization of this chain, Vi> A(i,2), A( 1 ) 3),. . . , A(1 > i_1), Vi, A( i ) i + 1 ) , . . . In some systems, all reference files and deltas are stored in one file with special file structures [22, 19]. Others just store each reference file and delta as separate files [12, 2]. 3.2 Delta Generation A delta is usually generated at one of three different granularities. Some database systems [25] and file backup systems [20] operate at the block level. The block size is usually the block size of the underlying file system. A bitmap is used to represent which blocks are changed during the modification of the file. Both forward delta chains and reverse delta chains can be used. For the forward delta, a full copy of the original file is copied. Subsequent updates are applied in place. When versioning the file, only the modified blocks are copied into the new delta. The idea of reverse delta is when an old block is overwritten the system actually allocates a new block, copies the old value and then overwrites the newly allocated block. Then it modifies an inode-like structure to point to the new block. To preserve the information necessary to construct the full history, there is additional metadata stored somewhere else. A potential drawback of this approach is that it reduces the effectiveness of buffer-cache write absorption, thus increasing the number of disk writes [20, 17]. A similar idea is used in virtual memory management [5, 18]. Different processes can share the same physical memory in a C O W fashion. Only when a write occurs does the 15 system make a copy of the original page. A few systems [23, 13] use hashing to identify blocks. If the hash values of two blocks are equal, then the two blocks contain the same data. Rsync [23] breaks a file into fixed-size blocks. L B F S [13] breaks a file into variable-size blocks by setting block boundaries based on file contents. Both of them avoid sensitivity to shifting file offsets. To improve the granularity, some systems generate deltas in the granularity of records [21, 22, 6]. This is natural for database systems. Text files can be treated as records of lines. However, it does not work well with binary files. Some systems intend to break their input into line-delimited records, but perform poorly [8]. Some recent works [2, 11] attempt to achieve better compression at the gran-ularity of bytes. They make no assumptions of the structure of the file being ver-sioned. But they need to scan both the reference file and the target file (new version of the file) to compute the delta. These systems work well on binary files except large files. They are mainly used to backup in tertiary storage and reduce network traffic. 3.3 D e l t a Transmiss ion Several attempts have been made to transmit a delta instead of a full copy of a file or file system. CVSup [15] is designed to support C V S through a network. When updates occur, CVSup extracts new deltas directly from the RCS files on the server and edits them into the client's RCS files. A client can also request a specific version and the server only sends the difference between the existing version and the desired version. The H T T P Distribution and Replication Protocol (DRP) [24] is to efficiently 16 replicate a hierarchical set of files to a large number of clients. No assumptions are made about the content or type of the files. It uses content identifiers to represent different files and their versions. A data structure called an index is used to describe the state of a set of data files, e.g., version, size, and type of each file. The client can determine which files are changed on the server through the index. When a file is changed on the server, the client can issue a differential GET request for the file. The server can compute the difference between two versions and return the delta rather than the entire file. This protocol does not specify the storage of different versions on the underlying file system. L B F S [13] is a network file system designed for low-bandwidth networks. It exploits cross-file similarities. The L B F S server divides the files into chunks and indexes the chunks by hash value. The L B F S client similarly indexes a large persistent file cache. When transferring a file between the server and client, L B F S only sends chunks of data that are not stored on the recipient and identifies those chunks that can be found in other files on the recipient. The A D S M backup system [2] computes deltas at the byte level in the client side. It uses version jumping to achieve constant restoration time. The client keeps copies of previous backed up files in a file store. It computes the delta between the reference file and the current version and only the delta is sent to the server. The server simply stores the delta on it. But it may need to keep several reference files and only delete them when no clients have a reference to them. X D F S [12] integrates delta compression with a file system. It is oriented towards efficient secondary storage. It does not integrate with a delta transport mechanism, but includes the ability to extract a delta between an arbitrary pair of versions from storage. A prototype system named xProxy has been implemented on 17 top of X D F S to implement delta-compressed H T T P transport. Some distributed shared memory (DSM) systems also exploit delta transmis-sion to reduce network traffic. To reduce the impact of false sharing, a multiple-writer protocol is used in Munin [4] and TreadMarks [10]. Initially a shared page is write-protected. At the first write, the D S M system makes a copy of the page (a twin), and removes the write protection so that further writes apply directly to the twin. The twin and the original copy can later be compared to create a delta, and propagated to all other copies of the page. 18 Chapter 4 Design 4.1 Overview This chapter presents the design of Mammoth+ based on Mammoth. The extension is conformable to the system architecture of Mammoth as described in Chapter 2. Most of the modifications fall within the Version Control, and Replica Control com-ponents. The former component is responsible for generating deltas and managing version storage. The latter is responsible for extracting deltas between an arbitrary pair of versions and transmitting these deltas through the network. A forward delta chain is used to archive the list of a file's versions. Mam-moth-)- makes a full copy' of the current version of a file upon the first write. Sub-sequent writes are applied to the copy and versioned as a delta to the previous version. To capture the modification to a file, Mammoth+ uses an in-core bitmap to record changed blocks during an update session. When the update finishes, the system copies the new value of changed blocks into a delta file (dfile) and writes a sorted list of modified block numbers in another file (ifile). As in standard Mam-moth, a version entry is appended to the file history log indicating the new version's 19 creation node, creation time, branching information, etc. Mammoth+ adds a new field into the entry to indicate the file length of this version. A few new fields are also added to the common file metadata to indicate the latest version stored on this node, the number of versions on this node, etc. These are used in file replication and restoration. When replicating a file, the system must compute a delta between two repli-cation points and send the delta over the network. The system scans the file history log to generate a list of all intermediate versions between the two versions in ques-tion. The system then reads the ifiles corresponding to the intermediate versions to generate a changed block list between the replication points. If ownership changes during this period, a request is sent to the previous owner node to calculate the changed blocks before transferring the ownership. Finally, the system reads the changed blocks' data from the current version and sends them to the replica node. To restore an old version, Mammoth+ reads both the ifiles and the dfiles in reverse order of the delta chain, that is from the target version to the first version. The data is read from the dfile and copied into the restored file. Mammoth+ follows the communication model of the original design of Mam-moth. The major modification is that file data is not transmitted in its entirety during replication. Only the delta is transmitted block by block. It allows normal file operations during replication. 20 4.2 F i l e Vers ion ing 4.2.1 Granularity of Delta Compression Files are divided into blocks in Mammoth-)-. If a portion of a block is modified, the whole block is copied to the delta. Since Mammoth is a general file system, it has no assumption of the file format stored on it. Record-level granularity is not a viable choice. Byte-level compression has good compression results, but it either needs to operate on two full versions of a file (e.g., xdelta [11]), or break the file into variable-lengthed segments (recording the changed byte-ranges during updates). Both of them increase the complexity of delta generation especially when the file is large. For Mammoth, every version is kept on the owner node for recovering from user mistakes. The compression algorithm must be fast enough to minimize the impact of client response time. Block-level operation introduces only a small amount of overhead for versioning. It can also simplify the replication and restoration process. 4.2.2 Delta Chaining Direction The choice of forward delta chaining for C O W has several benefits. During an update session, the system only needs to record which blocks are changed and update the file in-place. Upon finishing an update, the system makes a copy of all the changed blocks' value into the delta, thus removing the costly copy from the critical path of client access. It may also exploit the caching effect in this case. The file truncation operation is very fast since there is no need to do a copy. The main drawback of forward chaining is that restoring an old version may require longer time. Since a delta only stores the changed blocks between two versions, restoring an old version could potentially visit all deltas from the first 21 version to the target version. Reverse delta chaining solves this problem by storing old blocks in the delta. Restoring an old version only needs to read the deltas from the target version to the current version. Considering the fact that the most recent versions are accessed far more frequently than older versions, this scheme works better than the forward scheme for file restoration. Reverse delta chaining also eliminates the need to make a full copy of the first version that slows down the creation of the second version in the forward delta chaining. The file appending operation is fast in this case since there is no need to copy old values. However, several drawbacks of reverse delta chaining make it undesirable. Firstly, copying the old block is in the critical path of file update. It is likely that the old value is not in the cache. Copying old blocks from disk significantly slows down client response. Secondly, the copy of old block data must be made persistent before overwriting this block. Otherwise all previous versions may be lost when failures occur. This happens when the system writes the new value to disk but fails to make the delta persistent before the failure. In this case, there is no way to restore any of the previous versions even though most of the deltas are on disk. To prevent an occasional failure, the system needs to call fsync each time when an overwrite happens. This synchronous operation increases versioning overhead and slows down client response. Finally, when a file is truncated, the system needs to copy all of the truncated blocks unless it is truncated to 0. The comparison of these two schemes is shown in Table 4.1. 4.2.3 Bitmap vs. Block List Two alternative data structures were considered to keep track of changed blocks during an update session. One is the bitmap, the other is the block list. Table 4.2 22 Number of fsync calls Cache hit rate File truncate less very high File overwrite File append Generate second version Client response fast fast slow more moderate or low slow or fast slow fast — . J l c « t Restore recent version 1 slow slow fast fast slow fast Table 4.1: Comparison between forward and backward delta presents the cost of some common operations between these two data structures. Bitmap is superior in most of the operations, but has a fixed cost proportional to the file size for scan and I/O. This overhead is significant when the number of blocks changed between two versions is small, in which case the block list works better. I search I I ^ . - '-insert delete sort scan 0(1) 0(1) 0(1) 0(N) O(M) __0(1)_ O(M) tflock list (sorted) O(logM) ~OjM)~ ~0(M\ogM)' OjM) O(logM) + O(M) 0(M) Table 4.2: Operation cost of bitmap and block list. N is the number of blocks in the file, M is the number of changed blocks, "-" means not applicable. A compromise solution is adopted in Mammoth-)-. It stores and transmits the block index as a list of block numbers, exploiting the fact that most of the time only a small fraction of a large file will change [14]. On the other hand, it uses an in-core bitmap to do the computation. When the index is stored on disk or transmitted over the network, it is transformed into a block list. When reading 23 from disk or receiving from the network, the block list is changed back to a bitmap. The operational overhead is O(N) and O(M) respectively. Since I /O operation are more costly, this one time transformation operation for each I /O is negligible. 4.2.4 Creating a New Version A file version is created when a file is closed after modification. When a file is modified for the first time, the server makes a full copy of the original version in the shadow directory. It then overwrites the file and records the changed blocks in the in-core bitmap. When the file is closed, the server saves the bitmap as a sorted block number list in an index file {ifile), and copies the changed blocks' data into a delta file (dfile) in the order of block numbers. After all changed blocks are processed, the server flushes the ifile and dfile onto disk. As the standard Mammoth does, a history entry is also added to the file's history log, indicating its creation time, creation node, and file length, sequence number, etc. Subsequent versions repeat the process of updating the bitmap, generating the ifile and dfile, and logging the file history. Special care is necessary when recording the changed blocks when a file's size changes. If a file grows, all the blocks beyond the original end of the file are treated as changed. If a file is truncated, no blocks need to be recorded as changed. It is easy to find the truncated blocks through comparing the file size of the two versions. So the system only needs to keep the file size of each version in this case. The new layout of the current file and versioned file is shown in Figure 4.1. The files in the RE directory are full copies of the most recent versions available on this node. The ifiles and dfiles are stored in separate directories with the standard Mammoth naming scheme - the current file name followed by the creation node and 24 shgong RE L Paper S H — root.dir.md — root.dir.pol '— Paper paper.dir.md |— paper.dir.pol thesis.tex.fde.md thesis, tex.fde.pol thesis.tex.ver - thesis, tex.ndl-tl - thesis. tex.nd2-t2 - thesis.tex.ndl-t3 thesis.tex.blk — thesis.tex.ndl-tl — thesis.tex.ndl-tl '— thesis.tex.ndl-t3 thesis, bib file, md thesis, bib.file.pol thesis.tex thesis, bib Figure 4.1: The new layout of the file system. creation time. For example, RE/Paper/thesis .tex is the most recent version of Paper/thesis.tex. SH/Paper/thesis.tex.ver/thesis.tex.ndl-tl is the dfile for the version created by node ndl at time tl. The ifile for the same version is SH/Paper/thesis.tex.blk/thesis.tex.ndl-tl. Storing each version as an ifile and a dfile creates many small files in the system. A n alternative is to put all deltas in one file. However, this approach involves developing a special file format and efficient searching algorithm, thus complicating the design. It is unclear whether the system can benefit from this design. The 25 current implementation in Mammoth+ leaves the underlying file system to manage the storage and file lookup. It also conforms to the initial design of Mammoth. Now the pair of ifile and dfile represent a file version (delta). It is easy to turn off C O W such that the dfile itself is actually a full copy of a version. The dfile is organized in a compact form shown in Figure 4.2. The blocks are arranged in increasing order of the block number. The block numbers in the ifile serve the mapping between the delta and the actual file of this version. Since the size of the dfile does not reflect the actual size of this file version, the file history log records the file size of each version for reconstruction. The new field added to the version log entry is fileJength. It is possible to preserve the file length in the dfile and copy the changed blocks in the same place as the actual version. Figure 4.2(c) shows the organization of this alternative. Though the dfile matches closely with the actual file, leaving the dfile with holes may waste the inode space and increase disk seeking time when the number of blocks changed between two versions is small. fa) (b) (c) unmodified block modified block 'block with no data (a) the file being versioned (b) compact delta (c) delta with holes Figure 4.2: Delta file format. To support copy-on-write, a few new fields are added to the common file metadata as shown in Table 4.3. The last real version refers to the version of the 26 current file. It is important for an interested node requesting a specific version. Only when both the target version and the real version are present can the server compute the delta. Pos-prever is used to support the reverse reading of the file version log described in Section 4.3.1. Name Purpose blkdirJ l Has the block list directory been created last_realnd Creation node of last real version last_realtm Creation time of last real version last_realseq Sequence number of last real version realver_c Number of real versions on this node pos_prever Position of previous entry Table 4.3: Added fields to the common file metadata. 4.3 F i l e Rep l i ca t ion In Mammoth, a file is replicated by the owner node to a set of replica nodes based on its replication policy. In standard Mammoth, the system simply sends the most recent version (current version) to the replica nodes. With C O W , Mammoth-}- can only send the difference between the current version and the last real version on the replica node (last replication point). However, the system must keep track of the last replication point for each replica node and compute the delta during the replication interval. The following are the major steps to replicate a current version with C O W : • Step 1. Read file history. Get a list of file history entries starting from the last replication point to the current version (new replication point). • Step 2. Calculate the changed blocks bitmap. Read all the block indices 27 corresponding to the intermediate versions. Set the corresponding bits in the bitmap. • Step 3. Record the current version number and then read all the changed blocks' data from the current file and send it to the replica node. • Step 4. Check if the file is opened for write. If yes, make a copy of the current bitmap of this version. Then check the file metadata to see if new versions are created during replication. If yes, calculate the changed blocks using the same method as step 1 and 2. • Step 5. Calculate the union of the bitmaps of step 2 and step 4. If the file is not modified during replication, then go to step 6. Otherwise, read and send the old blocks from the new replication point until the last replication point. • Step 6. Update the replication point and add an entry to the history log that a copy of this version is on the replica node. On the replica node, the processing is similar with file versioning as discussed in the previous section. For a non-replica node that requests a new version, it must specify the last real version on it in the request. The processing is quite similar except there is no need to deal with the replication point. 4.3.1 Reading File History The replicating node (replicator) needs a complete list of the ifiles between two replication points to compute the changed blocks. As mentioned in Section 4.2.4, an ifile is named by the creation node and creation time, and the file history logs 28 this information for each version. T h e replicator needs to find a l l the version entries between the two replicat ion points. A version log entry indicates the creation node, creation t ime and sequence number for this version. It also contains the same information of the version (branch version) from which a branch occurs. If no branches occur, a l l version entries be-tween the creation t ime of the repl icat ion points constitute a version chain. W h e n branches exist, the replicator needs to find a l l the entries belong to the desired chain. T h e replicator first creates an array of l inked list indexed by the sequence number. T h e l inked list is composed of the version entries w i t h the same sequence number. T h e n the replicator scans the array i n reverse order to find the branching chain. A version's parent is determined by searching the l inked list w i t h sequence number one less than this version. If an entry has the same branch version or its creation node is the same as the branch version of the version i n question, then it is the parent. A branching chain example is shown i n Figure 4.3. Note that more than one entry could correspond to one version due to duplicates or repl icat ion entries (recording a replica of this version on other nodes). It is possible to sort the chain by the combination of creation node and creation time. B u t this approach adds more overhead than the one implemented. The history log is scanned i n reverse order due to the fact that the log is append-only and usually a few recent version entries are needed. T h e replicator uses the pos-prever field i n the log entry to locate the previous record i n the log file. Since metadata updates may arrive out-of-order to an interested node, the history entries i n the log are not ordered by the creation t ime or sequence number. U p o n reading the entry corresponding to the last replicat ion point, the replicator tries to find the complete branching chain as described i n the previous paragraph. If the chain is 29 ———^i {(nd3,4), (ndl,2)} {(nd2, 4), (ndl, 1)} Note: {(nd3, 4), (ndl, 2)} means version 4 created by node nd3 is branched from version 2 created by node ndl. Figure 4.3: Sorting file history. incomplete, the replicator reads a few additional entries and checks again. This process iterates until the chain is complete. Since metadata updates are applied in order most of the time, the missing entries should be found within a few blocks. Two alternatives were considered to handle this situation. One simple way is to read all of the history entries and then order them. This may add signifi-cant overhead when the history log is very long and only the last few versions are needed. The other is to sort the history entries and write them back to disk when necessary. So the file history log is composed of two parts. The first part is sorted history entries. The second part is unsorted history entries. To read the last few versions, the system locates the start position of history entry in the first part and 30 reads sequentially until reaching the desired version. Then it scans the second part linearly. Although this method avoids the linear scan of the whole history log, it adds significant overhead when writing back the sorted history log. 4.3.2 Calculating Changed Blocks If the replicator contains all the versions between two replication points, there is an ifile corresponding to each version on this node. The replicator simply reads all the ifiles and sets the corresponding bit in a bitmap. Then all changed blocks during this interval are labeled in the bitmap. However, it is possible that the replicator has only a subset of those versions, and may not have the version of the last replication point, e.g., when ownership changed or when the replicator is not the owner. The ifile in this case stores the changed block list between two closest versions on a specific node. The replicator must find all the ifiles in a delta chain. The replicator divides the branching chain (described in Section 4.3.1) into several chunks as follows. Vi, —j Vini ^Vn^ Vk,...,Vj node N\ node N2 . . . node A Each chunk consists of a small delta chain on the same node. Node A is the replicating node and it contains an intermediate version Vfc and the new replication point version Vj. Node A/2 contains some other intermediate versions Vn and Vm. Node N\ contains the last replication point Vi and the intermediate version Vm. For a node containing a specific version, it is either the creation node of this version, or another node in the interest-set or replica-set that has a copy of this version. As described in Section 2.1, the log entry always records the creation node of a version. For nodes other than the creation node that have a copy of a specific version, a 31 special kind of log entry records the node ID in the entry's data field and sets the entry's operation field as eFOReplicate (see Section 2.3). The branching chain is scanned in reverse order. The system tries to find the maximum number of versions on one node. If the smallest version (a version with smallest sequence number) found so far is not the last replication point, the replicator picks another node randomly among the nodes that have a copy of the smallest version and finds all the versions on that node. This process repeats until it reaches the last replication point. Once the branching chain is divided, the ifiles in each chunk are visited and a changed blocks bitmap is generated for each chunk. The union of all the bitmaps reflects all the changed blocks between two replication points. Since each chunk resides on a different node, only the last chunk can be operated on the replicating node. The replicator sends a request to the hosting nodes of other chunks with the start version and end version of each chunk. Some of the hosting nodes may fail to respond the request. A timeout value is set when a request is sent out. If timeout occurs, the replicator simply supposes all blocks are changed during the replication interval and makes a full version replication. Since requesting another node to compute block changes will delay the repli-cation process, two optimizations are made to save the remote request. One opti-mization is to check how many blocks are changed for the versions on the replicating node. If the number of changed blocks are higher than a threshold, it is handled as a full file copy. A second optimization is to check the block list of the smallest version in the last chunk (Vk in the above example). If it only adds a few number of changed blocks to the bitmap of last chunk, the replicator can add those blocks into the bitmap. This is because the ifile of the smallest version represents the changed blocks between the smallest version and a version generated before the last repli-32 cation point. The number of blocks in this ifile is the upper bound of the changed blocks between the smallest version and the last replication point. 4.3.3 Discussion Our design shows that to compute the changed blocks during a replication interval, the system first needs to read and sort the file version entries, divide the delta chain into several chunks, compute changed blocks in each chunk, and make a union of all these blocks. In this process, several nodes may be involved and several ifiles visited. , This process may add significant overhead to replication when a large number of intermediate versions exist between two replication points. One possible solution is to keep an accumulated changed block list for each replica on the owner node from the last replication point. Whenever a new version is created, the owner updates the accumulated block list. In this case, all changed blocks are immediately available when replicating. However, updating the accumulated block list involves two disk operations (reading and writing), the combined overhead to accumulate the changed blocks is even higher than the overhead to compute the changed blocks in our current implementation. Another drawback of the accumulating approach is that the owner needs to keep an accumulated block list for each replica (recall that replicas are not equal). It is a heavy burden when a file has a large replica set. To minimize the overall overhead, we decided that the system only computes the changed blocks during replication. In our current implementation, requesting other nodes to compute changed blocks delays the replication process. As an alternative, the old owner can transfer the changed block list to the new owner when transferring ownership. This approach 33 eliminates the need to request the changed block list, but makes transferring own-ership heavy weight. If the new owner has a copy of the version corresponding to the last replication point, it can compute the changed block list without the help from the previous owner. To simplify the design and reduce the network traffic, we chose the design that the system only computes changed blocks on demand. 4.4 F i l e Res tora t ion Although no user tools are available to access an old version at the current stage, file restoration is supported in Mammoth+. The system applies deltas in reverse order to the current file. Since the current file is a full copy of the last real version (the version corresponding to the current file), the system exploits the fact that only part of the blocks changed between the last real version and the target version (the version to be restored). The system only applies the changed blocks from dfiles. Following are the steps for restoring an old version locally, that is, when the target version is replicated on the requesting node. • Step 1. Create a bitmap for the changed blocks. If the last real version is older than the target version, set all the bits in the bitmap. Otherwise, calculate the changed blocks from the target version to the last real version and set the corresponding bits in the bitmap as described in Section 4.3. • Step 2. Get the list of file version entries that constitutes the delta chain on this node. It involves reading and sorting the file version entries with sequence numbers less than or equal to the target version (see Section 4.3.1), and exam-ining which versions reside on the requesting node along the branching chain. If the last real version is older than the target version, version entries with 34 sequence number less than or equal to the last real version are excluded. • Step 3. Open the ifile and dfile of the target version. • Step 4. For each block number in the ifile, check whether the corresponding bit is set in the bitmap. If it is set, copy the old value of this block from the dfile into the current file and clear the corresponding bit in the bitmap. • Step 5. If no bits are set in the bitmap or no version entries are left unpro-cessed, go to step 7. • Step 6. Get the next version entry from the delta chain (in reverse order) and open the ifile and dfile of this version. Go to step 4. • Step 7. Set the last real version as the target version and stop. It's possible that the target version is not present on the requesting node. In this case a request indicating the last real version and the target version is sent to a node that has a replica of the target version. If that node can not compute the changed blocks between these two versions (in step 1), it may request block list from other nodes as discussed in Section 4.3.2. The old block data are sent to the requesting node. 4 . 5 C o m m u n i c a t i o n P r o t o c o l In Mammoth, all inter-node communications are based on an event queue [16]. The message header contains a field indicating the type of event. The underlying protocol is a mixture of T C P and UDP. The Liveliness monitor and 2PC are on top of UDP. All other requests and file replication are implemented on top of TCP. Mammoth+ 35 follows the same framework. It adds a new request to transfer the changed block list between two versions. To support online replication, the replicator transmits each block number followed by the data of this block. Thus , when an update occurs dur ing replication, the replicator can easily resend the changed blocks from the dfiles. Note that because deltas are processed i n chronological order, block numbers may be sent out of order. It is infeasible to send the changed block list and the changed blocks ' data separately. T h i s approach also applies to requesting an old version from other nodes. F i l e replicat ion involves pushing data from the owner node to the replica node. T h e owner node knows the last real version on the repl ica node and which version w i l l be replicated. T h i s information is sent to the repl ica node upon repli-cation. W h e n an interested node is requesting file data or the changed block list, it must present this information. T h e major fields of a message header is shown i n Table 4.4. Name Purpose file_path The pa th of the requested file op-type T y p e of operation (e.g., request data, replication, restoration) reaLver Last real version on the requesting node or repl ica node request _ver The target version da taJen A d d i t i o n a l data length data A d d i t i o n a l data (e.g., block list) Table 4.4: Message header for transferring data and block list. 4.6 Summary T h i s chapter describes the design of C O W i n the context of the M a m m o t h file system. Four aspects are presented: file versioning, file replication, file restoration 36 and the communicat ion protocol . T h e basic idea is to store each version as a delta. Forward del ta chaining is selected to minimize client response delay. F i l e delta is block-aligned. A n in-core b i tmap is used to record changed blocks. W h e n an update is finished, the changed blocks ' data is copied to a dfile and the b i tmap is saved i n an ifile as a list of block numbers. W h e n replicating a file, only the delta between two repl icat ion points is transmitted, over the network. T h e system computes the changed blocks from the ifiles of a l l the intermediate versions, and reads the actual data from the current file. If some of the ifiles reside on another node, a request is sent to calculate the changed blocks dur ing that interval. F i l e restoration is also supported w i t h C O W . The system applies deltas i n reverse order. If a file can not be restored locally, a request is sent to a node that has the version i n question. T h i s design follows the event queue model of standard M a m m o t h to do inter-server communicat ion. It makes minor modifications to the message format and adds two new request types (restoration and block l ist) . W h e n t ransmit t ing file da ta over network, i t first sends the block number and then the da ta of this block. T h e underlying communicat ion protocol for C O W is T C P . 37 Chapter 5 Evaluation T h i s chapter evaluates the performance of copy-on-write i n Mammoth-)-. T h e ex-periments are designed to show whether C O W achieves better performance than n o n - C O W and to what extent. B y " n o n - C O W " we mean the standard unmodified M a m m o t h . T h e experimental 'environment consists of 2 6 6 M H z Pen t ium II P C s as clients and 1.7GHz Pen t ium I V P C s as M a m m o t h server nodes both running L i n u x . The client machines have 1 2 8 M B R A M and the servers have 2 5 6 M B R A M . The under-ly ing file system is E X T 3 on clients and E X T 2 on servers. A l l these machines are connected by a 100Mb switched Ethernet network. The hard disk on the servers is M a x t o r 5T040H4. The experiments are conducted w i t h no other applications run-ning on those nodes and no other traffic on the local network. T h e measurements of C O W are based on 4096 block size. Unless specified explici t ly, a l l modifications to a file are overwrit ing a contiguous sequence of blocks. A l l numbers except the benchmark performance reported here are the median of a dozen runs. The remainder of this chapter is organized as follows: Section 5.1 measures the performance of creating file versions. Section 5.2 describes the performance of 38 transmitting a version to a remote node. Section 5.3 shows the performance of restoring a previous version. Section 5.4 presents a benchmark result. Section 5.5 gives a brief summary. 5.1 Performance of File Versioning As described in the previous chapter, file versioning is performed on three kinds of nodes: the owner node, interested nodes, and replica nodes. In copy-on-write mode, all these nodes make a copy of the changed part of a file after an update. In non-C O W mode, only the owner node makes a copy of the whole file before modification, whereas the interested nodes and replica nodes rename the file before modification. As a result, we only compare file versioning performance on the owner node. To minimize the impact of other components in Mammoth, only one instance of the Mammoth server is running. There are no nodes in a file's interest set and replica set except for the owner. Since a file is versioned immediately after the modification session in C O W , the file data is in the cache. For non-COW, file versioning occurs before the modification session, the file data may not reside in main memory. Thus both warm cache and cold cache cases are tested. Here, "warm cache" means both the file's data and metadata reside in memory. "Cold cache" means that the system needs to read the file's data and metadata from disk. In this sense, C O W can be viewed as a warm cache case. When reading from disk, we can not overlook the disk seeking time. Our experiment tries to keep the seeking time consistent by interleaving files on disk. For example, the system first creates 500 files of the same size (4K bytes) sequentially such that all the files are spread evenly. Then creates the second versions for all these files sequentially (in the creation order). Thus the disk seeking time for each file is approximately the same. 39 The system creates enough files to enlarge the seeking time. The number of files, interleave depth, for each file size is shown in Table 5.1. When writing to disk, the fsync call returns immediately once data is transferred to disk. Thus we only measure the data transfer time to a lightly loaded disk. This is achieved by allowing a sufficient time interval between modifications to two different files. In the warm cache cases, we only created a dozen files and updated them sequentially. Fi le size 4k 16k 64k 256k 1M 4M 16M 64M Interleave depth 500 500 500 200 100 20 8 4 Table 5.1: The interleave depth. As described in Section 4.2, in C O W mode the owner node sets the corre-sponding bits during file modification. After it receives a close call from the client, it converts the bitmap into a list of block numbers and writes them to an ifile. It also copies the new value of changed blocks to a dfile. Finally a file version entry is added to the metadata file. In non-cow mode, the owner node makes a copy of the original file on disk before the update session. After it receives the close call, it adds an entry into the log (metadata file). Since the bitmap in C O W mode is an in-core data structure, the overhead of setting bits during update is negligible compared to disk I /O. Thus we did not include this overhead in our test results. File version logging is almost the same in both C O W and non-COW. This overhead is also excluded from our test. For C O W , we only report the overhead of creating the ifile and dfile. For non-COW, we report the overhead of creating the copy of a file. Table 5.2 shows the versioning overhead on the owner node. The first column is the update size for C O W and file size for non-COW. We only measured the 40 Update size C O W N o n - C O W N o n - C O W / F i l e size warm cache cold cache (KB) (ms) (ms) (ms) 4 1.24 0.67 21 8 1.30 - -16 1.39 0.78 22 32 1.61 - -48 1.90 - -64 2.66 1.83 30 192 4.77 - -256 6.11 5.42 56 768 16 - -1024 21 20 101 3072 103 - -4096 140 116 339 15360 607 - -16384 652 601 1430 61440 5070 - -64512 5750 - -65536 6070 5660 6300 Table 5.2: File versioning overhead. "-" means not available. performance of update sizes that are multiples of the block size because Mammoth+ has block-level granularity. As expected, the versioning overhead increases linearly with update size for C O W and file size for non-COW (shown in Figure 5.1). In warm cache cases, C O W is always slower than non-COW when the update size equals the file size. This is due to the extra write of the ifile. The performance degradation is significant for small files, e.g., 85% for 4KB files. With the increase of file size, it becomes insignificant, e.g., 7% for 64MB files. However, when compared with the cold cache case of non-COW, C O W performs very well since it saves disk reading time due to caching effect. Recall that we used a large interleave depth to maximize disk seeking time, the 4th column in Table 5.2 nearly represents the worst case 41 performance for non-COW. In this case r o w « + c mis case, C O W outperforms non-COW when the update size is less than or equal to the file si z size. 256KB 1MB 4MB Update size (for COW) or file size (for non-COW) 16MB 64MB Figure 5.1: File versioning overhead vs. update size or file size. Since column 3 in Table 5.2 shows the best case performance of non-COW, we do a further comparison with C O W (column 2 in Table 5.2) to show the benefit of C O W . When the file size is smaller than 16KB, the best case (update size is 4KB) versioning overhead of C O W is 1.24ms and the worst case (file size is 16KB) versioning overhead of non-COW is 0.78ms. When the file size reaches 64KB, there is no big difference between C O W and non-COW (see Figure 5.2). When the file size is greater than 256KB, C O W begins to show its merit. Figure 5.3 shows the performance gain for C O W over non-COW that is defined by the following formula: gain = versioning overhead for non-COW / versioning overhead for C O W 42 The smaller the update size, the larger the performance gain. The largest gain for C O W (4KB update size) is 4.4 (256KB file) or 4565 (64MB file). Only when the update size is greater than 75% of the file size does C O W performance begin to degrade (gain < 1). When the update size equals the file size, C O W has the smallest versioning performance gain, e.g., 0.89 and 0.93 for 256KB and 64MB respectively. 100 ms 10 ms 1 ms non-COW COW File versioning overhead 0.1 ms 4KB 8KB 16KB Update size 32KB 64KB Figure 5.2: File versioning overhead for 64KB files. 43 0.25 1 1 1 1 1 25 50 75 100 Percentage of file change Figure 5.3: F i l e versioning gain for C O W over n o n - C O W . 44 From the above analysis, we can draw the following conclusions: • C O W versioning overhead increases linearly with the update size. • Non-COW versioning overhead increases linearly with the file size. • For small files (<64KB), non-COW has lower update cost than C O W . • For large files (>=256KB), C O W has lower update cost than non-COW when the update size is less than 75% of the file size. The smaller the update size, the larger the savings of C O W . When the update size equals the file size, C O W degrades less than 20%. 5.2 F i l e repl icat ion performance Replication is tested between two Mammoth nodes with varying update size and file size in both warm-cache and cold-cache conditions. When a file is replicated, there is no reading or writing from the client nodes. When testing with a cold cache, we used the same interleave depth as in Table 5.1. For copy-on-write, several configurations are tested: • COW1 - Replicate each version. • COW4a - Replicate every fourth version. Each version has the same update size and changed block list. • COW4b - Replicate every fourth version. Each version has the same update size but a different changed block list. • COW16a - Replicate every sixteenth version. Each version has the same up-date size and changed block list. 45 • C0W16b - Replicate every sixteenth version. Each version has the same up-date size but a different changed block list. Al l these versions are created on one node and replicated to another node. There are no branches in a file's history. Since file replication on the replica node is simply receiving file data and versioning the file upon finish, we only report the performance on the sending node. Recall that in C O W mode file replication includes reading the file history, calculating the changed block list, and sending the changed blocks to the replica. If a file is not updated during its replication, all changed blocks are read from the current file. In our experiment, we did not update a file during its replication. In non-COW mode, the sender just reads the file data from the current file and transmits it to the replica. Table 5.3 shows the replication overhead with a warm cache (all file data are available in main memory). It shows that when the update size equals the file size, C O W performance degrades greatly with small update sizes compared to non-COW. For example, COW1 degrades 20% and COW16a degrades 80% with 4KB update size. With the increase of update size and file size, C O W matches the performance of non-COW. It degrades less than one percent with 64MB update size (COW16a). Two main contributors to the extra overhead of C O W are reading the file history and calculating the changed block list. The overhead of reading the file history is 38/iS, 63/xs, and 155/zs for COW1, COW4a (or COW4b), and COW16a (or COW16b) respectively, regardless of the update size. The overhead of calculating the changed block list is shown in Figure 5.4. It increases with both the update size and the number of intermediate versions. The update size in Figure 5.4 is the total update size between two replication points. Thus, COW4b (or COW16b) has less overhead than COW4a (or COW16a). 46 U p d a t e size / F i l e size (KB) C O W 1 (ms) C O W 4 a (ms) C O W 4 b (ms) C O W l 6 a (ms) C O W l 6 b (ms) N o n - C O W (ms) 4 0.48 0.55 - 0.72 0.40 12 1.51 - - - _ 16 1.63 1.68 1.68 1.88 - 1.56 32 2.52 - - - _ _ 48 4.33 - - - -64 5.32 5.38 5.37 5.57 5.57 5.05 192 16 - - - _ 256 22 22 22 22 22 21 768 89 - - - _ 1024 123 124 123 124 123 122 3072 392 - - - _ 4096 520 520 520 521 521 511 15360 1980 - - - _ 16384 2120 2120 2120 2120 2120 2080 64512 8500 - - - _ 65536 8620 8620 8620 8630 8620 8550 Table 5.3: Replication overhead on the sending node with a warm cache. Another trend is that replication overhead increases with the update size for C O W and the file size for non-COW. Except for very small update sizes or file sizes (<= 16fc), the replication overhead is linear with update size (COW) or file size (non-COW) (Figure 5.5). When only a part of a file is changed, C O W incurs less replication cost than non-COW. The improvement is almost linear with update size. The smaller the update size, the faster the replication process for C O W . The largest gain for a 64M file is 17812 (update size is 4KB). Table 5.4 shows the performance with a cold cache. For C O W , the sending node needs to read the file history, ifiles and file data from disk. For non-COW, the sending node needs to read the file data from disk. We interleaved files in the same way as described in Section 5.1. That is, we first created a certain amount 47 10 us r-1 u s I I I I I I I I 4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB Update size Figure 5.4: Overhead of calculating the changed blocks with a warm cache. (the interleave depth in Table 5.1) of files with the same size. Then overwrote them sequentially with the same update size to create the second versions for all of them. The third version was created in the same way as the second version. The process was repeated until we got the desired number of versions. For COW4b and COW16b, the interleave depth was based on the total update size between two replication points. As it shows in Table 5.4, file replication overhead increases with update size in C O W and increases with file size in non-COW. When the update size equals the file size, C O W incurs more overhead than non-COW. With more intermediate versions, the sending node needs to read more ifiles to calculate the changed block list, thus adding more overhead for replication. Figure 5.6 shows the overhead of 48 10 S E j -100 ms 10 ms 1 ms 0.1 ms 4KB 16KB 64KB 256KB 1MB 4MB Update size (tor COW) or file size (for non-COW) 16MB 64MB Figure 5.5: Replication overhead with a warm cache. calculating the changed block list. The overhead does not increase with the update size. This is due to the large interleave depth for small files. The disk seeking time becomes dominant in this case. As a result, our experiment is a worst case for C O W . Whereas the experiment is a best case for non-COW since the sending node only needs to read the file data from the current file with least seeking time. Even in this extreme situation, C O W still can outperform non-COW when the file size is large and the update size is small. As shown in Figure 5.7, when the update size is less than 25 percent of the file size, COW1 outperforms non-COW for files larger than 256KB, COW4a outperforms non-COW for files larger than 1MB, COW16a outperforms non-COW for files larger than 4MB. The largest performance gain of C O W over non-COW is 668 (COW1). 49 100 ms \-256KB 1MB Update size 4MB 16MB 64MB Figure 5.6: Overhead of calculating the changed blocks with a cold cache. 50 0.1 ms ' ' ' > 1 1 1 1 4 K B 16KB 64KB 256KB 1MB 4MB 16MB 6 4 M B Update size (for COW) or file size (for non-COW) Figure 5.7: Replication overhead with a cold cache. 51 5.3 F i l e Res tora t ion Performance In non-COW mode, file restoration is simply copying file data from an older version (target version) to the current file. As described in Section 4.4, in C O W mode, the system first calculates changed blocks from the target version to the last real version (the version corresponding to the current file) with a bitmap. Then it examines the ifile of the target version. If the ifile contains a block number in the bitmap, the value of this block is copied from the corresponding dfile to the current file. The system repeats to examine other ifiles and dfiles in a reverse chronological order until all changed blocks are restored. In order to calculate changed blocks and locate the ifiles and dfiles, the system needs to scan the file history log (metadata file). Since file restoration happens less frequently and thus previous versions are most likely not in the cache, we only discuss the performance with a cold cache. The files were interleaved in the same way as the cold cache replication test in the previous section. Three configurations of C O W are tested: • COW1 - A l l old values are read from one dfile. • COW4 - Old values are read from four dfiles. • COW16 - Old values are read from 16 dfiles. For COW4 and COW16, the changed blocks are divided equally and con-tiguously among the deltas. Thus, the system reads the same amount of data from each ifile and the same amount of data from each dfile. Table 5.5 shows the overhead of restoring the second last version on the owner node. When the file size increases, the restoration time increases in non-C O W mode. By contrast, the restoration time increases with the update size in 52 C O W mode. When the update size equals the file size, C O W is worse than non-C O W , and with more deltas involved C O W adds more overhead. One reason is that C O W needs to read the file history log and read the ifile for each version. Another reason is the disk seeking time due to our interleave scheme. The more deltas involved, the more disk seeking overhead. From Figure 5.8 we can see that when the update size is 25% of the file size, COW1 outperforms non-COW for files larger than 256KB, COW4 outperforms non-COW for files larger than 4MB, and COW16 outperforms non-COW for files larger than 16MB. The reason here is that when the total update size is the same, COW4 and COW16 need to read more ifiles and dfiles than COW1, thus incurring more disk seeking overhead. Since it is more likely that an old block is found within a few versions older than the target version, COW16 can be considered as the worst case for C O W . Thus, we do a further comparison between the worst case of C O W and non-COW hereafter. When the update size is 4KB (or 16KB), we use COW1 (or COW4) as a comparison to non-COW. When a file is smaller than 16KB, C O W is 40% to 70% worse than non-COW. When the file size is 64KB, C O W performs the same to 2.6 times worse than non-COW. When the file size is larger than 256KB, C O W restoration overhead varies from 208 times (64MB file with 4KB update size) less than to 5.4 times (256KB file with 256KB update size) greater than non-COW. 53 Update s ize /Fi le size C O W 1 C O W 4 C O W 1 6 N O N - C O W (KB) (ms) (ms) (ms) (ms) 4 28 - - •18 16 28 34 - 20 64 32 54 100 28 256 64 105 320 50 1024 97 180 421 94 4096 312 417 753 305 16384 1330 1430 1690 1290 65536 5970 6060 6340 5870 Table 5.5: File restoration time. Figure 5.8: File restoration overhead. 54 5 . 4 P o s t M a r k Benchmark Performance We ran the PostMark [9] benchmark on our system to evaluate its overall perfor-mance. PostMark is a new file system benchmark developed by Network Appli-ance [9]. It first generates an initial pool of random text files with configurable size bounds. A series of transactions are then conducted. The transactions include cre-ating or deleting a file, and reading or appending a file. The append operation writes a random number of bytes to the end of the file. If the file's bound is reached, the extra bytes are discarded. After finishing the transactions, it deletes all remaining active files. A final report is generated. Though it aims to simulate heavy small-file system load, with careful configuration, it can be used to measure the performance of Mammoth in creating file versions. Our PostMark configuration is as follows: • 30 initial files and 150 transactions. • File size ranges from 64KB to 16MB bytes. • No read operations. Each transaction appends a randomly selected file. • Creates and deletes are equally likely to occur. Al l other PostMark parameters are left as defaults. The test only involves one client node and one server node. The total elapsed time for C O W is 315 seconds, a 36% improvement over non-COW which takes a total 429 seconds. Though the total amount of data transmitted from the client is the same, C O W requires less processing time on the sever node. Thus it increases the throughput from 2.7MBytes/sec to 3.67Mbytes/sec. 55 5.5 Summary Our experimental results show that C O W reduces versioning overhead for files larger than 256KB with the update size less than 75%. It accelerates the client response and increases the server throughput. Network traffic can be reduced when only the delta is transmitted to an interested node or a replica node. The file replication cost ratio between C O W and non-COW is close to the update size and file size ratio when all data is in the cache. When data is not available in the cache, the cost is influenced by the disk layout and the number of intermediate versions. File restoration for C O W is fast when the update size is less than 25% of the file size for large files (e.g. >=256k for COW1). So we suggest a general rule to decide how C O W can be applied: • For small files (< 64KB), do not use C O W . • For large files (>= 4MB), C O W is always the choice. • For medium-sized files (between 64KB and 4MB), C O W can be applied con-ditionally. For example, when the update size is less than 75% of the file size, we can use C O W to improve the performance of version creation and reduce the network traffic during replication. 56 Chapter 6 Conclusions and Future Work 6.1 Conclusions Del t a compression is widely used i n versioned object storage and transmission. How-ever, none of the previous work integrate delta compression w i t h version creation and transmission i n a dis t r ibuted file system. T h i s thesis integrates a copy-on-write mechanism w i t h a peer-to-peer file system called M a m m o t h . T h e new system is called M a m m o t h - h M a m m o t h + uses forward delta chains to store the version chain. It creates a new version after each modificat ion session of a file. It uses an in-core b i tmap to record the changed blocks dur ing this session. After the modificat ion session, the system stores the changed blocks of data and the changed block list on disk. It uses an ifile to store the changed block list and a dfile to store the changed data i n a shadow directory. In bo th versions of M a m m o t h , a file can be replicated on several replica nodes w i t h different policies, e.g., a certain amount of t ime per iod. It can also be cached on several interested nodes. If a previous version (reference version) is on 57 the replica nodes or interested nodes, Mammoth+ only transmits the blocks of data changed between the reference version and the target version when updating those nodes. Calculating changed blocks only involves ifiles between the reference version and the target version. Reading file data is usually from the current version of the file. Since the system supports the movement of file ownership, the changed block list can be calculated cooperatively by consulting previous owners. Performance results show that Mammoth-t- outperforms Mammoth for large files when versioning and replicating files. For medium-sized files, Mammoth-t- can achieve good performance when the update size is smaller than the file size. For files with a large interest-set or a large replica-set, the overall improvement is dramatic. 6.2 Future W o r k At the current stage, Mammoth+ simply applies C O W to every file. As indicated in Section 5.5, C O W should not be applied to small files (<64KB), could be applied conditionally to medium-sized files (64KB to 4M files), and is desirable for large files (>=4M). One possible enhancement to Mammoth-)- is to differentiate files by their initial size, and only apply C O W to large files and medium-sized files. Since a file's size may change with time and C O W degrades file versioning and replication performance when the update size is more than 75 percent of the file size, the above approach may miss the opportunity to improve performance. It would be interesting to develop an adaptation mechanism or to support user-defined policies to switch between C O W and non-COW. For example, the system might turn off C O W when the update size is more than 75 percent of the file size for medium-sized files. Other opportunities exist to fine tune the performance of Mammoth-)-. As an example, there is no need to create a dfile and ifile for an append-only file. 58 Actually, only the file size of each version is needed for replication and restoration. As another example, when requesting remote nodes to compute the changed blocks between two versions, the version chain is divided randomly (Section 4.3.2). It is possible to divide it into fewest chunks such that the fewest number of nodes are contacted. The nodes can also be chosen based on the network connectivity or their workload. 59 Bibliography [1] Dmitry Brodsky, Alex Brodsky, Jody Pomkoski, Shihao Gong, Michael J. Fee-ley, and Norman C. Hutchinson. Using File-Grain Connectivity to Implement a Peer-to-peer File System. In International Workshop on Reliable Peer-to-peer Distributed Systems, Osaka, Japan, October 2002. [2] R. Burns, and D. Long. Efficient Distributed Backup and Restore with Delta Compression. In Workshop on I/O in Parallel and Distributed Systems (IOPADS), 1997. [3] Brent Callaghan. NFS Illustrated. Addison Wesley Longman, 1999. [4] John B. Carter, John K. Bennett, Willy Zwaenepoel. Implementation and per-formance of munin. In 13th ACM symposium on Operating Systems Principles, October 1991. [5] Robert Fitzgerald, Richard F. Rashid. The Integration of Virtual Memory Management and Interprocess Communication in Accent. ACM Transactions on Computer Systems, 4(2): 147-177, May 1986. [6] C. W. Eraser and E . W. Myers. A n Editor for Revision Control. ACM Trans-actions on Programming Languages and Systems, 9(2):277-295, April 1987. [7] Russell J . Green, Alasdair C. Baird, J. Christopher Davies. Designing a Fast, On-line Backup System for a Log-strucured File System. Digital Technical Journal, 8(2):32-45, October 1996. [8] James J . Hunt, Kiem-Phong Vo, Walter F. Tichy. Delta Algorithms: A n Em-pirical Analysis. ACM Transactions on Software Engineering and Methodology, 7(2):158 - 191, April 1998. [9] Jeffrey Katcher. PostMark: A New File System Benchmark. Technical Report T R 3022, Network Appliance Inc., October 1997. 60 [10] Pete Keleher, Alan L . Cox, Sandhya Dwarkadas, Willy Zwaenepoel. Tread-marks: Distributed shared memory on standard workstations and operating systems. In Proc. of the USENIX Winter Conference, January 1994. [11] Josh MacDonald. Versioned File Archiving, Compression, and Distribution. World Wide Web, http://www.cs.berkeley.edu/~jmacd/. [12] Josh MacDonald. File System Support for Delta Compression. Master's thesis, University of California at Berkeley, 2000. [13] Athicha Muthitacharoen, Benjie B. Chen, and D. Mazieres. A low-bandwidth network file system. In 18th ACM symposium on Operating Systems Principles, October 2001. [14] John K.'Ousterhout, Herv'e Da Costa, David Harrison, John A. Kunze, Mike Kupfer, and James G. Thompson. A Trace-Driven Analysis of the UNIX 4.2 BSD File System. In 10th ACM Symposium on Operating Systems Principles, pages 15-24, Orcas Island, WA, USA, December 1985. [15] John D. Polstra. Announcing CVSup 16.If. World Wide Web, f t p : / / f t p . freebsd.org/pub/FreeBSD/development/CVSup/Announce. [16] Jody Pomkoski. Inter-server Communication in the Mammoth File System. Master's thesis, University of British Columbia, November 2002. [17] Sean Quinlan. A Cached W O R M File System. Software - Practice and Expe-rience, 21(12):1289-1299, December 1991. [18] Richard Rashid, Avadis Tevanian, Jr., Michael Young, David Golub, Robert Baron, David Black, William Bolosky, and Jonathan Chew. Machine-Independent Virtual Memory Management for Paged Uniprocessor and Multi-processor Architectures. In 2nd ACM Symposium on Architectural Support for Programming Languages and Operating Systems, October 1987. [19] Marc J . Rochkind. The Source Code Control System. IEEE Transactions on Software Engineering, SE-l(4):364-370, December 1975. [20] Douglas S. Santry, Michael J . Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, and Jacob Ofir. Deciding When to Forget in the Elephant File System. In 17th A CM Symposium on Operating Systems Principles, December 1999. 61 [21] Dennis G . Severance and Guy M . Lohman. Differential Files: Their Applica-tion to the Maintenance of Large Databases. ACM Transactions on Database Systems, l(2):256-267, September 1976. [22] Walter F . Tichy. RCS - A System For Version Control. Software - Practice and Experience, 15(7):637-654, July 1985. [23] Andrew Tridgell. Efficient Algorithms for Sorting and Synchronization. PhD thesis, Austrialia National University, April 2000. [24] Arthur van Hoff, John Giannandrea, Mark Hapner, Steve Carter, and Milo Medin. The H T T P Distribution and Replication Protocol. Technical report, World Wide Web Consortium, August 1997. [25] Joost S. M . Verhofstad. Recovery Techniques for Database Systems. Computing Surveys, 10(2):167-195, June 1978. [26] Werner Vogels. File System Usage in Windows N T 4.0. In 17th ACM Sym-posium on Operating Systems Principles, pages 12-15, Kiawah Island Resort, Charleston, South Carolina, December 1999. 62