UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

RadFS - virtualizing filesystems Karollil, Anoop 2008-03-09

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
24-ubc_2009_spring_karollil_anoop.pdf [ 1.06MB ]
Metadata
JSON: 24-1.0051466.json
JSON-LD: 24-1.0051466-ld.json
RDF/XML (Pretty): 24-1.0051466-rdf.xml
RDF/JSON: 24-1.0051466-rdf.json
Turtle: 24-1.0051466-turtle.txt
N-Triples: 24-1.0051466-rdf-ntriples.txt
Original Record: 24-1.0051466-source.json
Full Text
24-1.0051466-fulltext.txt
Citation
24-1.0051466.ris

Full Text

RadFS - Virtualizing FilesystemsbyAnoop KarollilB.Tech Computer Science and Engineering, Cochin University of Science andTechnology, 2004A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE STUDIES(Computer Science)The University Of British Columbia(Vancouver)October 2008©Anoop KarollilAbstractEfficient disk space usage and quick deployment are important storage mechanismrequirements in a virtualized environment. In a virtual machine farm, there is goodpotential for saving disk space as virtual machines are based off file system imagesthat usually have a lot in common. Deploying virtual machines on demand shouldalso be as quick as possible - close to or faster than the time taken in powering onand booting up a real machine.We introduce RadFS, a shared root filesystem for virtual machines based oncopy-on-write semantics. Creating a new root file system for deploying a virtualmachine is instantaneous, with the ability to base the new file system on the cloneof a previous file system or a snapshot of an existing file system. Disk space usage efficiency is guaranteed initially by the COW semantics and later by transparent file-system ‘merges’ based on content hashing. Thus any file that is identicalamong the file systems of the various virtual machines hosted off our file systemwill be shared which leads to large savings in disk space with increased provisioning of virtual machines running similar operating systems or applications thatuse similar sets of files. Live instantaneous snapshots of existing file systems alsoensure easy backups or new bases for other virtual machines.The filesystem or name space virtualization is implemented in user-space usingFUSE and backed by a content hashing module to take care of merges.11Table of ContentsAbstractTable of ContentsList of TablesList of FiguresAcknowledgments1 Introduction1.1 A quick introduction to FUSE2 Related Work2.1 qcow.2.2 Unionfs2.3 Parallax2.4 Ventana2.5.12.5.211111VViVi’1312131618182.5 Content hash related work5567810Microsoft Single Instance Store (SIS) 10Venti 113 Design3.1 Architecture3.2 Radix tree metadata structure3.3 Persistence3.4 Content hashing1113.5 Command Line Interface3.5.1 Use Case4 Implementation4.1 RadFS FUSE application4.1.1 Virtual filesystem branches4.1.2 Virtual paths4.1.3 RadFS FUSE call-backs and4.2 RadFS metadata server4.2.1 Radix tree4.2.2 Radix tree operations .4.2.3 Persistence4.2.4 Snapshots4.3 RadFS Hash Server5 Evaluation5.1 Test environment5.2 POSIX compliance using pjd5.3 Throughput and metadata operation benchmark using Bonnie++5.4 RadFS filesystem scaling6 Conclusions6.0.1 Future Work . .Bibliography 57metadata operations1920222223232534343544454748494950535556ivList of Tables4.1 FUSE calibacks and associated RadFS operations 26VList of Figures1.1 FUSE working.43.1 RadFS architecture 143.2 Radix tree metadata structure 174.1 Virtual paths mapping to ondisk paths 254.2 rtree-search for /home/johnljohn.doc 374.3 rtree-insert for /usr/binljohn-app 404.4 rtree-delete on read-only and read-write nodes 425.1 RadFS throughput 515.2 RadFS metadata operation benchmark 525.3 RadFS scalability 54viAcknowledgmentsI am very thankful to my supervisors Mike Feeley and Andrew Warfield for theirpatient guidance and solid support. I would also like to thank Norm Hutchinsonfor graciously accepting to be my second reader. Thanks as well to the people Iworked with at the DSG lab for their humor and genial camaraderie.viiChapter 1Introductio.nOS virtualization has moved from something to be tried out when there is sparetime to fiddle with one’s cluster/server farm to something essential that needs to bedone because the benefits are too great to ignore. Virtualization adoption had beenpredicted to grow rapidly and with the advent of hardware virtualization support,this growth is predicted to be even quicker in the next 4 years. The growth invirtualization also sees a corresponding growth in disk storage requirements for thegrowing number of virtual machines per farm/cluster. These requirements shouldbe provided not just by adding more storage, but also by exploiting the nuances ofvirtualization and the opportunities that it presents in ensuring efficient disk spaceusage and ease in provision of virtual machines.Virtual machines (VMs) boot off filesystem images. In the Xen Virtual Machine Monitor, a privileged virtual machine (DomO) hosts the filesystems for possibly several non-privileged virtual machines (DomU) as flat file images, physicaldisk partitions, LVM volumes or NFS export points and provisions virtual ma-chines based off them. In a given server farm or cluster, it is highly likely that theimages used for deploying virtual machines by a VMM have the same base files;they usually store the same operating system distribution with their associated applications to ease administration. Hence a copy-on-write approach, where thereis a single read-only copy shared by all the virtual machines initially, with filesmodified getting copied out for each VM, is intuitively a method for efficient diskspaèe usage. Another requirement for filesystems in general is data backup throughsnapshots. A copy-on-write approach is very conducive to creating snapshots as afilesystem can be marked copy-on-write at a particular point which effectively creates a snapshot of the filesystem at that point. Added to this is also the possibilityof using any of those snapshots as a base for a new filesystem deployment.Copy-on-write filesystems are useful when there is common data (OS/application executables, packages, etc.) to be shared as it gives ‘free’ read sharing. Alsothe existence of COW in a filesystem leads to easy creation of snapshots which isa necessary requirement for most filesystems. The approaches though different atthe level of implementation, basically involve copying a file (or a file block) to adifferent location and then linking the copied file into the filesystem transparentlyso that any future accesses to the file are routed to the copy. The level of implementation for this copying on modification or write can be at the file block level orat the whole file level. The fonner provides finer per-block granularity and hencemore control over the copy-on-write semantics but with the overhead of more complex meta data handling. The latter, at the higher level of whole file copy-on-write,provides lesser control but with lower meta data handling complexity.Virtualizing filesystems using RadFS caters to the requirements for disk spaceusage efficiency and quick and easy deployment of filesystems for provisioningvirtual machines. By using copy-on-write semantics at the whole file level, common files in a root (bootable) filesystem are shared among all virtual machines.This provides obvious savings in the initial deployment phase of virtual machineswhere they start off with virtual filesystems based on the same disk image. Butgiven a deployment environment, like a cluster or an office, there is also a highprobability that a sizeable number of newly created files, for example softwarepackages, updated system files or even PDF documents, in each virtual machineare identical to each other even after individual copy-on-write or ‘create’ operations. To provide continuing disk space usage efficiency, a content hashing modulekeeps track of identical files among the different filesystems and maintains a singlecopy that is shared even after a copy-on-write or create operation. By virtualizing the filesystem name space and with the COW mechanism, very fast snapshotsand new filesystem deployments are also made possible. New filesystems can bebased on the initial base filesystem, on one of the branches spawned off the basefilesystem or even on a snapshot.2The goals of disk space saving and instantaneous snapshots/virtual filesystemdeployment is achieved by virtualizing the namespace using FUSE (Filesystem inUser Space [1]) and then exporting the virtualized filesystems for Xen DomUs using NFS. FUSE filesystems, though implemented in userspace, perform as goodas ‘in-kernel’ filesystems like ext3, with proper optimizations [14]. FUSE provides the means to multiplex requests from different DomUs to a single shared rootfilesystem with the guarantee that the base root filesystem remains unchanged andupdates/modifications are redirected to each virtual filesystem’s writeable ‘branch’on disk. This copy-on-write operation is the basic underlying mechanism to provide namespace virtualization. The namespace is virtual and comes into true existence, backed by writable files, only when written to. Virtualizing a filesystemnamespace involves not only the virtualization of disk locations and filesystemattributes for files and directories but also the mechanisms used to access theseattributes. FUSE provides the interface for facilitating this virtualization.1.1 A quick introduction to FUSERadFS virtual filesystem is based on FUSE. FUSE consists of a kernel module andlibrary which is used by a userspace application handling a virtual filesystem, tocall and get called by the FUSE kernel module. FUSE lets a user mount the virtualfilesystem after which any access to the mount point is routed by the FUSE kernel module to the userspace application through the FUSE library. The userspaceapplication can then implement its own file operation methods or reroute the requests to an existing filesystem. The latter approach avoids implementing all the‘heavy lifting’ mechanisms needed in a traditional filesystem and is what is used inbuilding RadFS. RadFS is the userspace application that provides the COW, snapshot and content hashing mechanisms for virtual filesystems which are backed byan EXT3 filesystem or any other filesystem supported by the Linux kernel VirtualFilesystem (VFS).Figure 1.1 shows the flow of control for a typical file operation. In the caseof RadFS, an open filesystem call to a file on a virtual filesystem gets routed tothe RadFS application, which might perform a copy-on-write operation or initiatecontent hashing, and then re-issues the open to the backing (EXT3) filesystem.3I IFigure 1.1: FUSE workingThe file descriptor thus obtained is returned back to the user application that didthe initial open request.IUser• II.KernelNFS4Chapter 2Related WorkRadFS provides quickly deployable virtualized filesystems that guarantee savingsin disk space and instantaneous snapshots, using a COW mechanism and contenthashing. It has been built specifically with virtual machines in mind. A number ofCOW filesystems are prevalent in the field, having snapshot and disk space savingfeatures, and a few of them specifically for virtual machines. This chapter triesto cover related virtual machine filesystem work, having COW or content hashingmechanisms. The main difference between RadFS and most other related filesystems is the level in the filesystem at which the COW/content hashing takes place.RadFS works at the whole file level while most other filesystems work at the fileblock level. Another common difference is the backing store used by the virtualmachine filesystems which vary from real filesystems to filesystem image files.2.1 qcowqcow is a filesystem image format used by by the QEMU [5] processor emulatoras well as the Xen VMM as a virtual machine filesystem. Root filesystem imagesare whole root filesystems stored in a single file that can be used by virtual machines. The filesystem image is a representation of a fixed size block device whichis exported to the virtual machine by the emulator or VMM using a loop device(blktap [16] in Xen). Files modified by the virtual machine are modified in placein the file image. The qcow filesystem image format [4] has the advantage that it5supports sparseness even when the underlying filesystems doesn’t support sparseness and it also has COW support. These coupled together can be used to quicklydeploy filesystems based on other qcow images. The new image will use very littledisk space as it has ‘read-only’ pointers to the original image with writes to filesleading to modifications in the new image file. A problem with this approach isthat the backing store is an image file which isn’t as efficient as a regular filesystern backed store, especially in disk space reclamation on file deletes which needsspecial handling.2.2 UnionfsUnionfs [7] had a lot of influence on the design of the COW mechanism in RadFSand so there are several similarities among the two in the way they create virtualfilesysterns. Unionfs merges different filesystems called branches and creates avirtual filesystern formed of the union of the two or more filesystems. Directorieswith same paths in the individual filesysterns are represented as a single directoryin the virtual filesystem with the contents from the individual filesystem directoriesmerged. Each base filesystem is identified by a branch id that is used to specifypriorities in choosing files from a particular branch when there are files with identical paths from more than one branch. COW is achieved in Unionfs by using twobranches and marking one of the branches, possibly the one with data as read-onlyand the other possibly empty branch as read-write. Initially all read requests aresatisfied by the read-only branch. Writes/modifications to files lead to a copyingover of the modified file to the read-write branch and further requests for the fileare serviced from the read-write branch, by giving the read-write branch ida higherpriority.Since the virtual filesystem is a union of two or more real filesystems, to maintain consistent filesystem semantics of file operations in the virtual filesystem,Unionfs needs to deal with the problem of handling the various branches of theunion as a single entity. Errors in unlinking files in a particular branch are propagated to the user only if the unlinking fails in the highest priority branch. That is,if there is an unlink error in one of the lower priority read-write branches, then theoperation still succeeds if the unlink is successful in the highest priority branch.6Relating this to COW, deleting a file in the union should lead to the file in a read-write branch being deleted if present while it shouldn’t be unlinked from a lowerpriority read-only branch. So UnionFS needs to mask a lower priority read-onlyfile that has been deleted from a higher priority branch and it does so by creatinghigh priority files called white-outs, each of which signifies the absence of a particular file from a particular high priority read-write branch. So in the case wherea read-only file is deleted from a union of read-only and read-write branches, awhite-out is created for the particular file in the read-write branch (after deletionof the file from the read-write branch, if present). Further file operations like stat,readdir, open on the union check for the presence of a white-out for the file beingaccessed and return a no-existence error if a white-out is found. So even if thefile is present in lower priority branches, the white-out masks its presence from theunion.One problem with this approach is that white-outs pollute the filesystem name-space. Each deleted file that has a read-only counterpart in another branch needs tohave a white-out in the read-write branch. Another problem is that of creation ofwhite-outs recursively. To handle the case of a read-only directory being deletedand then recreated, white-outs have to be created for all the files and subdirectoriesin the directory when its deleted to prevent them from showing up in the newlycreated empty directory. The basic problem is the absence of a dedicated metadatastructure. This has been remedied in a version of Unionfs that has a metadatastructure called On Disk Format (ODF) [6]. The ODF is a regular filesystem on itsown within the kernel that keeps track of white-outs and also does caching.2.3 ParallaxParallax [11] is a distributed storage system specifically built for Xen virtual machines. Parallax works at the block level on a shared global block store and provides a block device interface called a Virtual Disk Image (VDI) to virtual machines. The VDI has certain similarities to the qcow image; they both split a linearblock address into levels that are used to then look up the block, they both supportsparseness wherein a block is allocated only when written to, and they both usetheir support for sparseness to provide fast snapshots and COW features. Parallax7has a distributed architecture. Multiple physical hosts (physical machines with aVirtual Machine Monitor and multiple virtual machines), each running their ownParallax server in a storage virtual machine, access the common block store. Theblock store is divided into extents that each Parallax server locks to provide storagefor its virtual machines. Each virtual machine also locks a VDI and associated datablocks for exclusive in-place writes. RadFS isn’t distributed as there is only oneinstance of the RadFS metadata server that nms on a physical host. Though thismight decrease the availability of virtual filesystems served by RadES, it increasesopportunities for easy sharing and more savings in disk space.Parallax uses a radix tree look-up mechanism in addressing blocks, whichRadFS borrows and which is the crux of the COW mechanism. While Parallaxsplits the bits in a block’s linear address to do its look-up of a block, RadFS usespath segments within a disk path to look-up a particular file or directory. Both haveread-only pointers; they are to blocks in the case of Parallax and to whole filesin the case of RadFS. Comparing Parallax with RadFS should be enlightening onthe benefits and drawbacks of implementation at the block and whole-file levelsrespectively, once sufficient optimizations to RadFS have been done.2.4 VentanaVentana [12] is an object store based distributed virtualization-aware filesystemwhich aims to combine file-based storage and the sharing benefits of a distributedfilesystem with the versioning, mobility and access control features of virtual machine disk images. The motivation behind Ventana is to provide read/write sharingbetween multiple users, to track files in various virtual disks that are related, in acohesive manner, and to provide finer than whole disk rollbacks in virtual disks.Like RadFS, Ventana permits spawning virtual filesystems based on existing virtual filesystems recursively, in a hierarchical manner. To meet these requirements,Ventana provides abstractions called branches and views. A branch is a particularversion of a file tree, and can be private or shared. The concept of a branch inVentana and RadFS are more or less the same except for the scope. For example, a branch in Ventana could be a version of the /usr subdirectory or of a wholeroot filesystem whereas a branch in RadFS is always a version of a root filesystem.8Ventana splits up a root filesystem into file trees to aid in sharing and to deal withsecurity/maintenance problems when using virtual disks [9]. In Ventana, a particular directory can be shared read-write while the rest of the filesystem is read-onlyby having a filesystem ‘view’, with the directory tree as a ‘shared’ branch while theroot filesystem is a private branch. This is akin to having a network share mountpoint that is read-write within the root filesystem.The problem of applying patches or security updates to all virtual machinedisks, even those not currently being used is solved by having access to the Ventana common store which has the normal abstractions of files and directories, orbranches used by each virtual machine, that can be read by normal malware scanning and backup tools. This is the case with RadFS too; the backing store is madeof ext3 filesystem directories that can be read and written independent of RadFSwhile keeping the virtual filesystems consistent.Ventana also has access control at various levels of abstraction:- there are normal file, file version and branch ACLs. These are mainly to .prevent security leaksarising from the forking of virtual filesystems from existing virtual filesystems.Security updates and file access controls should transcend file versions in the various virtual filesystems. Security updates can be easily applied to a file-directoryoriented common store but file access controls transcending file versions requireACLs that apply to the current as well as prior versions of a file. Version accesscontrol is future work in RadFS but given the design of the RadFS metadata tree, itshould be trivial. Metadata nodes for versions of a particular file are linked togetherin the RadFS metadata tree.Ventana also provides the user with the capability of viewing all older versionsof a particular file. In the case of virtual disks, this is tedious as each version ofthe virtual disk is a separate unit that needs to be mounted. Given the file-directoryoriented back store and using the linked list ofmetadata nodes for multiple versionsof a file or directory, it would be again quite trivial for RadFS to have a utility thatscours the store branches to find all versions of a particular file.92.5 Content hash related workRadFS uses a content hashing mechanism to keep disk space savings consistenteven if individual virtual machine filesystems diverge from a common base. Thissection compares RadFS with other content hashing systems.2.5.1 Microsoft Single Instance Store (SIS)Microsoft’s Single Instance Store [3] is a content hashing system for WindowsStorage Servers that provides savings in disk space by consolidating duplicate filesinto a common store. It consists of a user level service that generates contenthashes of files and compares them with a database of content hashes it maintains.The hashing service then reports duplicate files with identical content hashes itfinds to a kernel level filesystem driver that copies the file to the common store ifit not afready present or replaces it with a link to an identical file in the commonstore. The kernel driver redirects reads to the common store and writes are handledby a copy-on-close mechanism as opposed to COW. The copy-on-close (COC) approach is more efficient that COW as only those portions of the file that haven’tbeen overwritten are copied over from the common store. So all writes to a filehappen without the initial copying over as in COW and then on close, only theportions of the file that haven’t been written to are copied over. RadFS currentlydoesn’t have a COC mechanism but its one of the future optimizations planned.SIS uses a 128 bit signature to check for duplicate files. The first 64 bits denotethe size of the file while the remaining 64 bits contain the hash of 2 4KB blocksfrom the middle of the file. If these match, then SIS does a full binary comparison. RacIFS checks the identity of files using SHA 1 hashes and assumes that thehashing is collision free. Quinlan & Dorward [13] also uses SHA- 1 to generatehashes for 8KB blocks and their analysis show that the probability of hash collision when there are approximately1014blocks is10—20.Since RadFS uses SHA-lhash digests at the file level, it can be assumed that the SHA- 1 hashes are collisionfree.102.5.2 VentiVenti [13] is a content addressable storage system at the block level. Venti provides a interface to store and retrieve blocks from a common store based on aSHA- 1 hash of block content, called a fingerprint. It doesn’t provide the servicesof a filesystem but provides the infrastructure for an archival filesystem that canbe built around it by applications. Rather than blocks being addressed by LBA,fingerprints of blocks, including recursive ‘fingerprinting’ of blocks that have fingerprints of other blocks themselves, help to build an addressing system that lets anapplication address blocks as on a regular disk. This provides the benefits of duplicate block coalescing, built in block integrity checks, and immutability of blocksand their addresses which are quite useful in archival systems where data needsto be retained for a long time without major changes. Coalescing of duplicateblocks is dependent on block sizes as well as alignments of the blocks in a file andmakes disk space savings complicated if not unpredictable. Here too, a comparison of content hashing at the block level and file level involves a trade off betweencomplexity and better control over the management of copy-on-write or coalescingsemantics. A byte modified in a large file leads to only a few new blocks gettingallocated in the case of content hashes at the block level. The whole file is copiedover (this overhead can be limited using copy-on-close) in the case of content hashing at the file level, but with the advantage of having an existing high performanceblock addressable filesystem take care of the complexity in addressing, caching,scheduling, etc.11Chapter 3DesignThe goals of this thesis are to provide a filesystem for virtual machines that• saves disk space,• is quickly deployable,• allows snapshots and recursive deployments.In a virtualized environment with a server hosting multiple virtual machines, itis common practice to spawn multiple virtual machines using copies of the samevirtual disk. Thus disk space savings can be achieved by exploiting the fact thatvirtual machines on a server/cluster have a high probability of sharing the sameoperating system distribution with common system and application packages. Thiscommonality of files used continues even as updated packages or new applicationsare installed and the opportunities for exploiting this commonality grows as thenumber of virtual machines deployed increases. Thus sharing is the key to savingdisk space and a filesystem should provide the ability to share files without majorside effects on performance or resource usage.Virtual machine deployment time should mirror real hardware; it should bepossible to start a virtual machine within the same time frame that it takes to bootup a real machine, maybe faster. Ideally this should be the case even for dynamicvirtual machine deployments where a sudden requirement necessitates starting upmultiple virtual machines in as short a time as possible. And ideally, this should be12possible without pre-allocated virtual machine disks. The main deterrent to quickdeployment ofvirtual machines is the time taken to provision a filesystem or virtualdisk that the virtual machine can boot from. Thus a filesystem for virtual machinesshould also be very quickly provisioned.It is usual practice when using virtualization to build a virtual disk or virtualfilesystem based on a particular OS and distribution with a certain set of applications pre-installed for use by various users. These are then copied whenever neededto create new virtual disks with maybe other modifications to them. Hence the ability to base certain virtual disks/filesystems based on previous disks or filesystemsis quite useful in virtualization and this ability goes hand in hand with snapshotsas a snapshot makes a virtual disk immutable. Thus a snapshot sets the point in avirtual filesystemldisk which is interesting (for example a new kernel or a servicepack install) for use as a base for other virtual disks/filesysterns. Thus snapshotsare the means to recursive deployment of virtual disks and also provide a muchneeded backup mechanism for filesystems in general.To meet these needs, RadFS is designed to use a shared root filesystem withvirtual machines booting off virtual filesystems, based on a single root filesysternand using copy-on-write techniques for write sharing, copy-on-write makes it easyto provide the ability to create snapshots. By marking a particular virtual filesystern read-only, any further modification requests are redirected to a new ondisklocation, leaving the original virtual filesystern untouched. This also satisfies therequirement of quick deployment, either from the base filesystem image, which isalready read-only, or from existing active virtual filesystems, by creating snapshotsof them. And since all virtual machines start with virtual filesystems based on asingle base filesystem, there are obvious space savings. Thus a COW mechanismis definitely the crux of the design that lets RadFS achieve the goals mentionedabove.3.1 ArchitectureRadFS is based on FUSE. FUSE provides the interface to build a virtual filesystembased on an existing filesystern, which is ext3 in the case of RadFS. RadFS consistsof a FUSE application, a radix tree oriented metadata server and a content hashing13/ \mntjohn-vfsbindevetchomeiibJohnroot“binvarLjohn.app\“S.Figure 3.1: RadFS architectureserver. Figure 3.1 describes the high level architecture of RadFS.The RadFS FUSE application uses the FUSE library to intercept filesystemcalls to the virtual filesystem it is associated with (see Section 1.1). The virtualfilesystem, which is based on a read-only branch (a directory populated with a root///S. S.mntJane-vfsbifl,devetchomeLjaneLJanedociibrootusrvarS.—S.——.5— —S.S.‘SS.5.S.S..5rIi14filesystem which is tagged read-only, /branchlO in Figure 3.1), represents a virtualdisk which a Xen virtual machine (VM) can boot off.Each virtual disk is associated with a RadFS FUSE application that handlesrequests to it and relays the request either to the read-only branch or to a read-writebranch (a directory populated with files copied over or newly created, /branch/1and /branch/2 in Figure 3.1), with a copy-on-write operation if necessary. RadFSFUSE processes filesystem requests after getting information about the currentfile/directory from the RadFS metadata server. The information includes the read-only/read-write status, the disk path of the file or directory, the file attributes forregular files and other information.The metadata server responds to requests for look-ups, insertions, deletions andother operations that are necessary to keep track of changes to the different virtualfilesystems. Each virtual filesystem that is initialized using the RadFS FUSE application gets a branch id that uniquely identifies it to the metadata server. This branchid changes only when a virtual filesystem is snapshotted when a new branch id isassigned to it, with the old branch id denoting the now read-only snapshot.The metadata server also forwards requests for content hashing from the RadFSFUSE application to the RadFS Hash Server. The hash server’s only purpose is togenerate content hashes of files and update the read-write branch and the commonstore. Thus it is more or less independent of the RadFS FUSE application and themetadata server, and acts as a sink for asynchronous content hashing requests.The RadFS FUSE application maintains the actual files in the backing EXT3filesystem. Each virtual filesystem is made up of a base read-only root filesystemEXT3 directory and a read-write EXT3 directory that starts off empty but storescopies of files on modification or creation. The RadFS metadata server stores themetadata necessary to merge the backing EXT3 filesystem directories to create thevirtual filesystem. This includes a representation for virtual filesystem directoriesthat combines the contents of the directories from the read-only and read-writeEXT3 directories, keeping track of files that are deleted by using metadata ‘whiteout’ nodes, and other information needed for sharing data among the various virtualfilesystems using content hashing and copy-on-write.The design for the RadFS FUSE application is more or less laid out by theFUSE library call-back interface. FUSE gives the base for virtualization and virtu15alizing a filesystem mostly involves handling filesystem metadata and methods tomanipulate the metadata to provide features like copy-on-write and content hashing. Thus the design of the metadata server revolves around the core metadatastructure that needs to handle multiple filesystems with COW semantics. Persistence of this metadata is also a crucial part of the metadata server as in the caseof virtual filesystems, its the metadata that makes the filesystem. The RadFS hashserver is more or less a separate entity whose content hashing mechanism designisn’t influenced by either the RadFS FUSE or RadFS metadata server applicationsas the interface to it can be narrow and one way (see Figure 3.1) without affectingits functionality. The content hashing provides another level of sharing but it isagain built on the copy-on-write mechanism and so is an easy extension to the design. The remainder of this chapter expands on the design of the metadata structureused to support COW for multiple virtual filesystems, persistence, the content hashmechanism and a CLI user interface to RadFS.3.2 Radix tree metadata structureA radix tree metadata structure is the core of RadFS and it is maintained by theRadFS metadata server. The metadata structure maintains information about thevarious virtual filesystem branches that are being served and the server handlesrequests for path look-ups, insertions, deletions and other operations needed tomaintain the state of each virtual filesystem namespace. For example, the mostbasic operation performed by the RadFS-FUSE application is routing file requeststo the read-only or read-write ondisk branches. To do this, it queries the radix treemetadata server, giving it a virtual filesystem path and a branch id to get the ondiskpath of the file/directory needed. Similarly, most queries to the metadata server aremade up of a virtual path and a branch id.Each node in the tree represents a file or directory with the leafnodes in the treerepresenting files and the other nodes representing directories. When the server isstarted, it builds the base read-only branch (the ‘gold master’ branch or branch 0)from a directory containing a root filesystem. This ‘gold’ branch forms the initialtemplate for other branches. Any existing branch can be used as a template for newbranches if it is read-only, which it can be made to be by creating a snapshot of it.16base read-only branchread-write branchFigure 3.2: Radix tree metadata structureThis provides for recursive deployment of virtual filesystems based on previousvirtual filesystems.Each node in the tree has parent, child, sibling and branch links to other nodes.The parent link of a node links it to the node representing the directory that itis contained in. The only node that doesn’t have a parent link is the root node.The child link links a directory node to one sub-directory or file contained in thedirectory. The other sub-directories and files are accessed through the sibling links.The sibling links are used to produce a doubly linked list offiles and sub-directoriesof a directory, the head of the list being pointed to by the child link of the parentdirectory node. The branch links link nodes in the different branches that have thesame virtual path. In Figure 3.1, the nodes representing /home in branches 0, 1 and2 would be linked together by branch links to form a doubly linked list. Figure 3.2shows a part of a metadata radix tree with a base read-only branch, a single branchbased on the read-only branch and shows parent, child, sibling and branch linkages.Each node in the tree has an associated branch ID which identifies it to be partof the read-write branch of a virtual filesystem with that branch ID. To all othervirtual filesystems, a branch ID different from their own branch ID signifies thatthe node representing the file or directory is read-only. Thus the branch ID provideschild link• s4 parent link) sibling double linkbranch double link• II II I17the basis for the copy-on-write mechanism involving multiple virtual filesystems.3.3 PersistenceThe virtual filesystems spawned should remain consistent in the event of hardwarefailure or a reboot/crash of the host machine (DomO) running the metadata server.Since RadFS virtual filesystems are implemented at the file level and backed byext3, consistency at the block level is guaranteed by the backing filesystem but thevirtual namespaces of each virtual filesystem maintained by the RadFS metadataserver needs to be maintained consistent.Each virtual filesystem is based on a common ondisk read-only branch and itsown ondisk read-write branch which stores modified copied on write and newlycreated files and the associated directory structure. Hence it is possible to buildthe virtual filesystem back up from a union of the branches on disk with a specialhandling of delete and rename operations as they don’t reflect on the read-onlybranch. But to keep the metadata recovery independent of ondisk data, and to alsomeet the requirement of keeping track of additional metadata like file attributes,hard link back pointers etc., RadFS logs all metadata operations that modify themetadata radix tree. It also writes out the whole metadata tree to disk when theoperations log reaches a specific size after which it is reset. The ondisk metadatatree along with the log of metadata operations performed after the tree was writtento disk is enough to rebuild the metadata tree structure to a consistent state.3.4 Content hashingIn RadFS, disk space usage efficiency is achieved in part by the shared root filesystern approach wherein virtual machines share a common base root filesystem usinga copy-on-write mechanism. But with time and use, the virtual filesystems associated with each virtual machine will show deviations from the base image, whenthey download new packages and install new applications or modify existing files.To counter this, and to keep disk space savings persistent, RadFS implements acontent hashing module that ensures continued sharing of data.The design involves a content hash dump which is a data store at the file levelwith each file in the dump being named with the 20 byte SHA- 1 hash of its content.18Each regular file in RadFS has a content hash generated and is linked to the contenthash dump. Thus if any two files in any of the branches have the same content, bethat a Linux kernel image package, a common PDF document received throughemail, an MP3 file or a popular video clip, they are all linked to a single file in thecontent hash dump. Each new file created in a virtual filesystem, or modificationof an existing file in a virtual filesystem leads to a content hash generation for thefile and a linking to the content hash dump. The content hash generation is handled lazily by a the RadFS hash server whose only purpose is to process requestsfor generating content hashes for files created or modified by the various virtualfilesystems and manipulating the links between the file in read-write branch andthe content hash dump. The RadFS FUSE application issues a content hashingrequest to the RadES metadata server which processes it and re-issues the requestto the RadFS content hash server. The request for a content hash is one way; theRadFS FUSE application pushes the request out and continues processing otherfilesystem calls. Thus the RadFS hash server acts as a sink for all content hashingrequests. See Section 4.3 for more details.3.5 Command Line InterfaceRadF5 has a command line interface (CLI) that lets a user interact with the RadFSmetadata server and perform the following:• Start the server - starts the server based on a base root filesystem directoryand a directory to use for read-write branches. The latter is also the locationwhere the content hash dump would be built, if required.• Initialise content hash dump - Creates a content hash dump which initiallycontains the content hashes of the files in the read-only base root filesystem• Create a new virtual filesystem - Creates a new branch given a base rootfilesystem (which can be the initial read-only branch or any other subsequentsnapshotted read-only branch)• Mount a created virtual filesystem - Mounts a virtual filesystem given amount point and the filesystem branch id. This starts the RadFS FUSE appplication that handles filesystem requests to the specified mount point.19• Unmount a virtual filesystem - Unmounts a virtual filesystem given thefilesystem branch id• List virtual filesystems - List filesystems being managed by the metadataserver. The following is an example listing displayed by the RadFS CLI:Branches served(branch—id : tag mountpoint)0 base—read—only : read—only—never—mounted1 ubuntu—desktop read—only—never—mounted3 : john—ubuntu 17JulO8—1O:45:05 : read—only—never—mounted5 : kernel—upgrade—2.6.25 O2AugO8—15:57:44 /mnt/john4 jane—ubuntu /mnt/jane2 ubuntu—server : /mnt/lamp• Snapshot a virtual filesystem - Saves the state of a virtual filesystem at aparticular point. This snapshot is a virtual filesystem that is read-only and itserves as a backup as well as a base for other virtual filesystems.• Shutdown the server - Shuts down the RadFS metadata server and contenthash server.3.5.1 Use CaseThe branch listing above could be the result of the following operations perfonnedusing the CLI:1. Start the server using a root filesystem directory (populated using, for example, an Ubuntu 7.04 image). This creates the base-read-only branch withbranch id 0.2. Create a virtual filesystem ubuntu-desktop based on branch 0, mount it, bootoff it and install applications like KDE/GNOME, Firefox, Thunderbird, etc.3. Create a virtual filesystem ubuntu-server based on branch 0, mount it, bootoff it and install Apache and MySQL.204. Snapshot ubuntu-desktop to create a writeable virtual filesystem for userJohn based on ubuntu-desktop and making the ubuntu-desktop branch withits installed applications read-only.5. Create a virtual filesystem jane-ubuntu for user Jane based on the now read-only ubuntu-desktop branch.6. Mount both john-ubuntu and jane-ubuntu.7. Snapshot john-ubuntu and then John upgrades his kernel to 2.6.25. Afterthe snapshot operation, branch 3 would be a virtual filesystem which has thechanges made by John to the point where he upgrades his kernel. Branch5 would be the branch John uses to download the kernel and which reflectsother changes made after the snapshot.21Chapter 4ImplementationRadFS is implemented in C and uses the Filesystem in Userspace (FUSE) libraryto create its virtual filesystems. It was developed under Ubuntu 8.04 GNU/Linux,with the Xen 3.2 VMM and a Linux 2.6.24 kernel. The privileged virtual machine,DomainO run by Xen hosts RadFS and creates the virtual filesystems that are exported via NFS. Virtual machines spawned by DomO use NFS boot to boot off thevirtual filesystems exported as NFS mount points, with RadFS taking care of datasharing using a copy-on-write mechanism. The RadFS FUSE application communicates with the RadFS metadata server using a custom protocol using Unixsockets. The metadata server talks to the RadFS content hash server also usingUnix sockets. RadFS uses the GNU gcrypt library [2] to generate SHA-l contenthashes. The backend for the virtual filesystems, storing their copies of modifiedfiles is an ext3 filesystem. But since FUSE deals with the Linux VFS layer, thebackend can be any filesystem that is supported by VFS.4.1 RadFS FUSE applicationThe RadFS FUSE application is the virtual filesystem builder that is invoked eachtime a new virtual filesystem is created. It uses the FUSE library to mount a virtualfilesystem and then handles any filesystem call directed to the associated mountpoint. The FUSE library lets the application create a list of call-backs for thestandard set of filesystem calls. Any filesystem call directed to a virtual filesystem22mount point leads to the corresponding call-back being executed in the RadFSFUSE application. The RadFS FUSE application then uses the RadFS metadataserver to ensure copy-on-write semantics for filesystem operations. It also issuescontent hash requests and maintains the links from the virtual filesystem to thecontent hash dump.4.1.1 Virtual filesystem branchesEach virtual filesystem is based off a read-only base virtual filesystem which isrepresented in the metadata radix tree by a branch with a specific branch ID. Initially the only read-only filesystem is the branch with branch ID 0, but later on canbe other read-only branches created through snapshots of existing virtual filesystems. Each virtual filesystem also has a read-write branch that stores any filesthat it creates or modifies from the read-only branch. This read-write branch IDis a handle that is passed to the RadFS metadata server along with every metadatarequest issued by a particular virtual filesystem. When a virtual filesystem is created/mounted, the RadFS FUSE application in charge sends a request to the metadataserver to initialize the read-write branch. This request returns a branch ID that isthen used by the RadFS FUSE application as a handle for all further requests to themetadata server. The branch ID given to the FUSE application at the time of virtual filesystem initialization when associated with a node in the virtual filesystemindicates that the node is writeable. When a virtual filesystem is initially mounted,only the node representing the root directory ‘/‘ for that branch would have theread-write branch ID. This root directory node would be linked to the nodes thatcorrespond to the subdirectories of the root directory but these initially would beon the read-only virtual filesystem branch that the new filesystem was based on.4.1.2 Virtual pathsA virtual filesystem mounted by RadFS will be a root filesystem rooted at ‘7’ andpopulated with the standard system directories etc, home, usr, var etc. The virtual filesystem mount point thus could be thought of as representing a disk with abootable filesystem on it. A particular path in the virtual filesystem is translated toan actual ondisk path by RadFS using the metadata tree. A virtual filesystem con23sists of virtual paths that are identified as read-only or read-write by the branch IDof the node representing the path’s target file or directory. At the time of initialization, each virtual filesystem will have just the root directory marked as writeableand backed by a directory ondisk. This directory is then used to store the filesand directories that are created or copied over in a COW operation. Initially theroot directory listing of a virtual filesystem would show sub-directories of ‘/‘ butthey would be backed on disk by a directory associated with the read-only virtualfilesystem which was used to build the virtual filesystem. Whenever a file is modified, the file gets copied over from the disk path associated with the read-onlybranch to the disk path associated with the read-write branch. This copying overalso includes creating the directories leading to the file being copied over. File creations are also routed to the read-write disk path. Thus each virtual path in a virtualfilesystem is either read-only, with read requests being satisfied from a disk pathassociated with the read-only branch, or read-write with the initial write and subsequent reads being satisfied from the disk path associated with the virtual filesystem’s read-write branch. There can be more than one read-only ondisk branchassociated with a particular virtual filesystem. This is because virtual filesystemscan be based on other virtual filesystems that are read-only and this can happenrecursively.Figure 4.1 shows two virtual filesystems jane-vfs and john-vfs with their associated read-write and read-only ondisk backing. Both the virtual filesystems arebased on the read-only branch with branch id 0. The virtual filesystems start withthe ondisk directories branch! 1 and branchl2 which act as the store for COW operations. The user of virtual filesystemJane-vfs, creates a home directoryJane and afile Jane-doc. This leads to the directory and file being created in the ondisk directory branch! 1. Similarly, in the virtual filesystemJohn-vfs, the directoriesjohn, bin,and the file John-app are created in the ondisk directory branch!2. The creation offile John-app leads to the creationof the directories usr and usr/bin in branch!2. Arequest for an unmodified file in branch 0 like /usr/bin/bash gets routed to branch!0(branch/O/usr/bin/bash) while a request for a modified or newly created file goes tobranch/i (e.g., branch/i/home/Jane/Jane-doc) in the case ofJane-vfs virtual filesystem and branch!2 (e.g., branch/2/usr/bin/John-app) in the case ofJohn-vfs. Thisrouting is done by RadFS FUSE by using the branch ID associated with each path24Virtual Filesystem with RW branch id 1and based on RO branch 0I\mnt\jane-vfsbindevetchomeLjanelbLjanedocrootusrvarVirtual Filesystem with RW branch id 2and based on RO branch 0IL. I\mnt’john-vfs:iIbinI/I dev I/etchomeLjohnI lb IrootusrI IILjohn.appIvarL IOn disk directories for ROand RW branchesRead-only disk requests) Read-write disk requestsFigure 4.1: Virtual paths mapping to ondisk pathsin a filesystem call-back, which is obtained from the metadata server using RadFSmetadata operations.4.1.3 RadFS FUSE call-backs and metadata operationsTable 4.1 lists the FUSE callbacks used by RadFS and the corresponding RadFSmetadata operations performed that update the metadata radix tree. Each RadFSoperation results in exchange of data between the RadFS FUSE application andmetadata server using UNIX sockets. This section gives an overview of how theRadFS FUSE application uses the metadata operations to create virtual filesystemswith COW and content hashing. The metadata operations are described at the levelof abstraction required at the RadES FUSE application side. More details about25each RadFS metadata operation, as implemented in the RadFS metadata server aredescribed in Section 4.2.2. The operations in italics are those that are conditionallyexecuted in a call-back based on the nature of the file/directory.Table 4.1: FUSE cailbacks and associated RadFS operationsFUSE call backs RadFS operationsgetattr (stat) GET-STATfgetattr (fstat) fstataccess GET-BRANCH-ID, GET-DISK-PATH, accessreadlink GET-BRANCH-ID, GET-DISK-PATH, readlinkreaddir GET-CHILDREN, istat, fillermknod GET-BRANCH-ID, GET-WRITEABLE-DISK-PATH,mknodlmkfifo, istat, SET-STATmkdir GET-BRANCH-ID, GET-WRITEABLE-DISK-PATH,mkdirunlink GET-BRANCH-ID, GET-STAT, GET-DISK-PATH, unlink, DELrmdir GET-BRANCH-ID, GET-CHILDREN, GET-DISKPATH, rmdir, istat, DELsymlink GET-BRANCH-ID, GET-WRITEABLE-DISK-PATH,symlinkrename GET-BRANCH-ID, GET-WRITEABLE-DISK-PATH,GET-DISK-PATH, GET-STAT, SET-DISK-PATH, SET-STAT, MOVE, DELlink GET-BRANCH-ID, GET-STAT, GET-DISK-PATH,COW, COW-LINKS, SET-STAT, GET-WRITEABLEDISK-PATH, link, L1NK-IN, HASH-DUMP-UPDATEchmod GET-BRANCH-ID, GET-DISK-PATH, GET-STAT, GETWRITEABLE-DISK-PATH, SET-DISK-PATH, COW SETSTAI chmodContinued on Next Page...26Table 4.1 — ContinuedFUSE call backs RadFS operationschown GET-BRANCH-ID, GET-DISK-PATH, GET-STAT, GETWRITEABLE-DISK-PATH, SET-DISK-PATH, COW, SETSTA1 chowntruncate GET-BRANCH-ID, GET-STAT, GET-DISK-PATH, COW,COW-LINKS, truncate, SET-STATftruncate ftruncateutimens GET-BRANCH-ID, GET-DISK-PATH, GET-STAT, GETWRITEABLE-DISK-PATH, SET-DISK-PATH, COW, SETSTA1 utimes, HASH-DUMP-UPDATEopen GET-BRANCH-ID, GET-STAT, GET-DISK-PATH, COW,COW-LINKS, SET-STAT, open, HASH-DUMP-UPDATEread preadwrite pwritestatfs GET-BRANCH-ID, GET-DISK-PATH, statvfsflush close(dup)release close, HASHfsync fsyucThe branch ID of the target file or directory in each file operation and the typeof operation decides which RadFS metadata operations need to be performed toachieve read-only sharing and COW semantics. As can be noticed in Table 4.1,almost all filesystem operations start with a GET-BRANCH-ID operation. Thisgets the branch ID associated with the target file or directory on which the operation needs to be performed. This branch ID obtained from the metadata server iscompared with the branch ID of the virtual filesystem handling the filesystem operation. If the operation is a read operation (access, readlink), the GET-DISK-PATHoperation gets the disk path, be that on the ondisk read-only path or the read-writepath associated with the virtual filesystem. If the operation is a write operation(mknod, mkdir, symlink, rename, link, truncate, open), and the branch ID is not27of an ondisk branch that is read-write, the GET-WRITEABLE-DISK-PATH operation creates the heirarchy of directories leading to the target file or directory thatneeds to be written to or created in the read-write branch associated with the virtualfilesystem. This may include copying of the file (a RadFS COW operation) fromthe read-only disk branch to the read-write disk branch in the case of operationsthatmodifS’an existing file (open, truncate, link). After the copying, the filesystemcall is issued to the newly copied file. In the case of filesystem calls that createfiles (mknod, mkfifo, mkdir, symlink etc.), a GET-WRITEABLE-DISK-PATH operation creates the directory structure in the ondisk read-write branch and then theparticular filesystem call is issued to create the file or directory in the read-writebranch.RadFS ‘s content hashing mechanism and associated file sharing dictates thatthe metadata for files that are shared also include the file attributes that get modified - uid, gid, mode, access time, modify time and change time. Thus the stat callfor a regular file leads to the GET-STAT RadFS operation that stats the ondisk file,possibly linked to the content hash dump, and then overlays the above mentionedattributes values with that obtained from the metadata node for that particular file.This addition of file attributes to the metadata node is only for regular files as onlyregular files are hashed and dumped into the content hash dump. The other filesare either shared read-only, with changes being made only to the access time fileattribute or copied on write to obtain a private copy, and directories are createdprivately for each branch. The addition of file attributes to the nodes representing regular files entails maintaining/updating those attributes using the SET-STAToperation in those filesystem callbacks that lead to a change in the attributes (seeTable 4.1).The DEL operation and the GET-WRITEABLE-DISK-PATH operation are theopposites of each other - DEL deletes nodes from the radix tree metadata structurewhile GET-WRITEABLE-DISK-PATH inserts nodes. DEL requires special mention as in some cases it need not be backed up by a filesystem call, be that unlinkor rmdir, and so filesystem errors need to be generated by RadFS when necessary.This happens when a file or directory on the read-only branch is deleted. In thiscase, only the corresponding metadata node should be deleted, and since a DELoperation always succeeds, permission checking should be done before DEL and28filesystem errors might have to be returned (see Section 4.1.3.1).The GET-CHILDREN operation is to create a directory listing (readdir) andit returns a list of subdirectories and files of a given directory. The FUSE libraryhas a filler interface that feeds a buffer that is used by it to build the readdir response. Since a RadFS virtual filesystem directory might have files or subdirectories from multiple ondisk branches, each file or subdirectory needs to be separatelystat’ed and fed to the filler function. This involves getting the disk path (GET-DISK-PATH) of each file/sub-directory followed by the stat call. Since the readdiroperation is heavily used, this is optimized so that the metadata operations performed are as unified and streamlined as possible. As much processing as possibleis done without having to communicate over sockets unnecessarily. Thus GET-CHILDREN groups the multiple GET-DISK-PATH operation results into one bigbuffer that is sent as a whole to the FUSE application which parses it and stat’s thedisk paths obtained to build the readdir response.The SET-DISK-PATH operation sets the disk path of a node. Normally the diskpath of a node is set when the node is inserted during a write operation throughGET-WRITEABLE-DISK-PATH. But in the case of rename of a read-only file ordirectory, its only the RadFS metadata that needs to be changed and not the diskpath as the rename callback is not backed up by a backing filesystem rename. In thecase of a rename operation on a read-only branch, the metadata change is effectedby inserting a new node into the read-write branch of the virtual filesystem andthen setting its disk path to be the same as that of the read-only branch node’sdisk path, using SET-DISK-PATH and flagging the node as read-only (even if onthe read-write branch). This is followed by a DEL operation that removes theread-only node from the current virtual filesystem and the rename operation iscomplete. renames of virtual filesystem directories that are on the read-only branchneed extra processing as the new node created on the read-write branch should alsobe linked to its files and sub-directories. This is done by the MOVE operationwhich ensures that the renamed directory has all its sub-directories and files linkedin (see MOVE in Section 4.2.2). A rename on a read-write branch file/directoryinvolves a filesystem backed rename call which does an actual rename on diskfrom the source path to the destination path, in addition to the usual updates inmetadata.29SET-DISK-PATH is also used in the chmod, chown, and utimes callbacks whenthe callback is for an operation on a regular file on the read-only branch of a virtual filesystem. As mentioned earlier, content hashing and the associated sharinghappens only for regular files and thus filesystem attributes (uid, gid, mode, atime,ctime, mtime) are maintained as part of RadFS metadata only for regular files. Forother files, and directories, the callback is backed by the equivalent filesystem callwhich updates these attributes on disk. To update the file attributes for a read-onlyregular file, a new node is created using GET-WRITEABLE-DISK-PATH but withthe disk path set to the read-only branch node disk path using SET-DISK-PATH.Thus a later GET-STAT operation would stat the file from the read-only branchdisk path and then overlay the stat information with the attributes saved in the corresponding node on the read-write branch. The chmod, chown, utimes filesystemcalibacks in RadFS FUSE all perform a COW operation in the case of a non-regularfile or a directory.The LINK-IN and COW-LINKS operations are needed to support hard linksin RadFS virtual filesystems. Hard links are difficult to handle in copy-on-writefilesystems. A file hard linked to another is distinguishable from the other only byvirtue of its absolute path. A hard link is another name for a file, and thus hardlinked files have the same attributes, including the same mode number. Thus hardlinked files share the same data but without having any easy mechanism to knowwhich other file is sharing data with a particular hard linked file. Thus a copy-on-write operation performed on a hard linked file would lead to problems, an exampleof which is described by the following scenario.File foo and bar are hardlinked to each other in the read-only branch of a particular virtual filesystem. A write operation is performed on file foo that leads to itgetting copied over to the read-write branch of the virtual filesystem. This breaksthe expected behaviour of bar following foo’s changes as foo in the read-writebranch is not linked to bar anymore. So any application that expects this behaviour(write tofoô and expect the changes to be reflected in bar) breaks.Solitude [10] handles the problem of hard links in its copy-on-write implementation by having a table that maps Solitude’s metadata nodes to each other basedon ides. RadFS handles hard links by having back pointers in hard linked nodesthat link them together to form a ‘hard link’ list. The LINK-IN operation in the30link callback adds the newly hard linked file’s metadata node to the hard link list.When a hard linked file needs to be copied on write, RadFS copies the particularfile and also creates hard links to it in the read-write branch using the hard link list(see Section 4.2.2.4 for more details). This is done by the COW-LINKS operationonce it is known that a particular file is hard linked with other files which can bechecked by using the nlink (number of links) attribute in the stat information of afile.4.1.3.1 Permission checking and Error handlingSince RadFS maintains virtual fllesystems, it also has to deal with virtualizingpermission checking and error handling in those cases where the backing filesystemisn’t used at all and because a virtual filesystem is made up multiple branches. Insome callbacks, for e.g., mknod, the filesystem call is re-issued with the actualdisk path and any error thrown by the backing filesystem call can be returned. Buteven in this case, because of multiple branches, the basic EEXIST error has to begenerated by RadFS FUSE if the path already exists in the read-write branch orthe read-only branches that make up the virtual filesystem. Similarly an ENOENTerror has to be returned if the path is not found in any of the virtual filesystembranches. In the case of unlink, an EISDIR error should be returned if the pathleads to a directory. This as mentioned earlier is necessary as in the case of a read-only branch, the filesystem call unlink isn’t invoked and so the DEL operation willsucceed even if the node is a directory. In the case of rmdir, an ENOTEMPTYerror should be returned if any of the virtual filesystem branches has an entry thatis part of the directory to be removed.The error that needs to be handled the most is EACCES, or insufficient permission. Each RadFS FUSE application taking care of a virtual filesystem is run withsuper user privileges. This means that any operation performed within a particular RadFS FUSE application would be as the root user who has almost unlimitedprivileges. FUSE provides afuse_get_context interface to get the context in whichthe current filesystem call being handled was executed. This context can be usedto extract the user ID (uid) and group ID (gid) of the user performing the operation. Thus every filesystem operation performed within RadFS FUSE needs to31be wrapped in setuid and setgid calls which set the uid and gid to the values obtained from fuse_get_context. This drops the super user privileges and delegatespermission checks to the backing filesystem.For regular files, which are shared through the content hash dump, the uids andgids will be that ofthe super user as its the RadFS hash server, again run with ‘superuser’ privileges, that manipulates the content hash dump linking. Thus for regularfiles, permissions need to be checked against the uid, gid and mode retrieved fromtheir corresponding metadata nodes. The uid and gid retrieved using GET-STATis compared with the uid and gid retrieved usingfuse..get.context and this coupledwith the mode retrieved is used to check if the operation is permitted.In the case of unlink and rename, POSIX states that if the directory containingthe file to be deleted or renamed has the sticky bit set, and if the effective uid(obtained from fuse_get_context) doesn’t match either the containing directory’suid or of the file/directory to be deleted/renamed, and if the initiator of the renameor delete isn’t the super user, then an EACCES or EPERM error has to be returned.This is also handled by RadFS in addition to general permission checking.In the case of an error, RadFS also has to roll back metadata changes it madeprior to performing the filesystem backed call that failed.4.1.3.2 Content hashingRadFS FUSE applications initiate the content hashing process for regular files associated with each of their virtual filesystems by sending a request to the RadFShash server. The content hash request is sent whenever a file is closed which isdone in the release FUSE callback. After the close filesystem call, RadFS FUSEsends a content hash request to the RadFS metadata server with the virtual filesystern path and the virtual filesystem’s branch ID. The metadata server checks if thefile is regular and if the node representing the file is on the read-write branch of thevirtual filesystem and if so, forwards the request with the actual disk path of thefile to the RadFS hash server. The RadFS hash server generates a content hash andlinks the file into the content hash dump. (see Section 4.3)A file in the content hash dump can be shared by multiple virtual filesystemsand any file that is shared should be copied out to the virtual filesystem ondisk read32write path for modification. The RadFS FUSE application, in addition to checkingwhether a particular file is present in the read-only branch also checks whetherit is being shared by some other virtual filesystem using the content hash dump,in which case the file is also considered read-only and the same read-only COWsemantics apply even if the file is present in the read-write branch of the virtualfilesystem. Since files end up being shared by being hard linked to each other, away to check if a file is shared is if its hard link count attribute nlink is greater thanone. If that is the case, and if the file isn’t hard linked within the virtual filesystem(which can be checked by seeing if the node corresponding to the file is part of ahard link list), then the file is shared among more than one virtual filesystem andshould be considered read-only. This reasoning is done in the GET-BRANCH-IDoperation which returns either the read-write branch ID of the virtual filesystem ora value that indicates if the file is on the read-only branch, is hard linked in thevirtual filesystem or is shared using the content hash dump. The virtual filesystemmanaged by RadFS FUSE then acts accordingly to preserve COW semantics andsharing using the content hash dump.4.1.3.3 Miscellaneous implementation detailsAn issue that had to be considered when implementing the symlink and readlinkcallbacks was that of absolute paths in the root filesystem escaping the mount pointand referring to the backing filesystem’s directories and files. The RadFS FUSEapplication handles all filesystem calls that are directed to the mount point of thevirtual filesystem that it maintains. So a virtual filesystem mounted at /mnt/vfslworks fine when the paths are relative to the mount point. But when a virtualfilesystem is populated with a root filesystem image which has symlinks with absolute paths, a readlink gives an absolute path which points to outside the virtualfilesystem. The basic problem is that a path in the root filesystem mounted at aparticular mount point is not absolute with respect to the mount point. But this iseasily fixed in RadFS because of the way it exports the virtual filesystem via NFSto Xen virtual machines. This creates a ‘chroot jail’ kind of environment with theroot fixed as the mount point and any path being absolute with respect to the mountpoint.33Another interesting fact worth mentioning about POSIX filesystem rules is inthe implementation of the rename callback for hard linked files. When a filefoo ishard linked to another file bar and a rename(foo, bar) filesystem call is executed,the file foo will still exist even though rename returns 0. This is the expectedbehaviour as set out by POSIX but breaks certain filesystem tests which expect thesource file in the rename operation to get deleted after the rename succeeds with areturn value 0. The callback for rename in RadFS FUSE checks if the source fileexists after the rename operation, and if so deletes it as is done by my.4.2 RadFS metadata serverThe RadFS metadata server manages the radix tree metadata structure and satisfiesrequests from the RadFS FUSE application and the CLI using the UNIX socketinterface. The metadata server is single threaded and handles multiple connectionsusing select. It also handles persistence of the virtual filesystems by keeping a logof all operations that modify the metadata tree and writing the whole radix tree todisk whenever the operation log grows beyond a specific size. In the beginning,when the server is initialized, the base read-only branch is built up from the ondiskroot filesystem image, the path of which is specified as a command line argumentto the RadFS metadata server application. Thus when the server is initialized, ithas one branch with branch ID 0, that forms the base read-only virtual filesystemthat other virtual filesystems can use as their read-only branch. If the content hashing mechanism needs to be enabled, a user can instruct the server to initialize thecontent hash dump through the CLI (see Section 3.5). The RadFS FUSE applications check for the presence of the content hash dump and if present, uses thecontent hashing mechanism. The server then waits for requests from either theRadFS FUSE or the CLI applications. The following subsections describe how thevarious radix tree operations supported are implemented in the server along withdetails about how persistence is handled and how snapshots are created.4.2.1 Radix treeThe radix tree used for storing metadata for the different virtual filesystems is madeup of a base ‘gold’ read-only branch which is initialised at server start-up by pars34ing the base root filesystem in an EXT3 directory. Other metadata branches arebuilt on this read-only branch and the other branch nodes get added to the radix treeusing the read-only branch nodes. This is done by attaching the new branch nodesto the base ‘gold’ branch node using the branch pointers described in Section 3.2.Whenever there is a file modified or a file or directory created in a read-write branchrepresenting a virtual filesystem, the corresponding node is added to the particularbranch, either by adding it as a child to an existing read-write branch node of thesame branch ID or attaching it to the corresponding node in the read-only basebranch using the branch pointers. Thus the metadata radix tree starts from a singleread-only branch that represents initial virtual filesystems. With file modificationsand new file creations, new branch nodes get added to the base gold branch torepresent the modified virtual filesystem; we call this the ‘growth-on-gold’ model.Removing a node either results in creation and attachment of a ‘white-out’ node toa ‘gold’ node in the case of a node in the read-only branch, or an actual removal ofthe node if the node is part of a read-write branch. Thus at any point, the radix treeconsists of branches representing the various virtual filesystems, and containingnodes that are read-only or read-write (depending on the virtual filesystem branchID and the node’s branch ID) or white-outs. Insertions, deletions and other radixtree operations necessary used by RadFS FUSE applications to update the metadatafor the various virtual filesystems are described in the next section.4.2.2 Radix tree operationsThe core operations performed on the radix tree metadata structure (described earlier in Section 3.2) are rtree-search, rtree-insert and rtree-delete. rtree-search takesa virtual filesystem path, parses it using the radix tree and returns the target file/directory node if found. rtree-insert is used to insert a file or directory node into thetree and fix up its branch, child, parent, sibling and, if necessary hard link pointers(see Figure 3.2). rtree-delete is called when a file or directory is deleted or renamedand deletes the corresponding metadata node with corresponding linkage fix ups orcreates a white-out node if the branch is read-only.354.2.2.1 rtree-search operationTraversing the radix tree for a particular path involves breaking the path into pathsegments, starting from the root directory ‘I’ and leading up to the target file ordirectory of a given branch based on the branch ID. When a virtual filesystem iscreated, only the root directory ‘I’ is created with its child pointer pointing to thechild node of the base read-only branch’s root directory. Thus a virtual filesystemcreated would initially be identical to the base read-only filesystem as any look-upbeyond the root node leads to the read-only branch nodes. rtree-search of a particular path starts at the root node and ends when the whole path has been parsed orthe parsing fails at a particular level. The basic algorithm is described in AlgorithmAlgorithm: rtree-searchInput: Root node (cur), path to be parsed (path), branch ID (br), ‘insert orlook-up’ flagOutput: node corresponding to path target or to the last path segment ofpath succesfully parsedpath-seg = get-first-path-segment (path)while there is a sibling nodefor cur and cur ‘c path-seg != path-seg docur = cur’s next-siblingendif cur s path-seg != path-seg thenreturn cur’s parentelseif the search isfor insert and branch ID ofcur ! br thenreturn cur’s parentendendremove path-seg from beginning of pathifpath has more path-seg thenIrtree-search(cur’s child, path, br, flag)elsereturn curendAlgorithm 1: rtree-searchThus a search involves parsing the path a segment at a time, starting from the36base read-only branchread-write branchFigure 4.2: rtree-search for /home/johnljohn.docroot. A search operation can be either for insertion or for checking the existenceof a particular path. Levels (directories) in the filesystem hierarchy are traversedthrough child pointers while the sibling pointers are used to search within a directory level. A node returned in the case of searching for insertion is guaranteedto have the branch ID of the virtual filesystem for which the search is being performed. In this case, if the node corresponding to the target file or directory isn’tfound, the parent node is returned so that the remaining path can be inserted at theparent node using rtree-insert. A node returned in the case of searching for existence of a path can have a branch ID that is of any of the branches that make up thevirtual filesystem. Thus a search with flag set for insertion always returns a nodeassociated with the read-write branch of the virtual filesystem. A search with flagset to look-up can return nodes from the read-only branch too. Figure 4.2 showsan example where /home/john/john.doc is looked up.4.2.2.2 rtree-insert operationThe rtree-insert function takes a node, a relative path and a branch ID as parametersand inserts the nodes corresponding to the directories leading to the target as wellas the target file/directory of the path, recursively. As mentioned before, nodes areinserted into the tree based on a ‘growth-on-gold’ model wherein rtree-insert firsttries to find a corresponding node in the virtual filesystem’s read-only branch tochild linkn4 parent link( ) sibling double linkbranch double linklook-up pathI II I37which it can attach the new branch node to, using the branch pointers. If no suchread-only node is found, then the node is inserted as the child of the parent nodewith the same branch ID (corresponding to the directory that will contain the newfile/directory being created) which is guaranteed to be present in the read-writebranch due to the recursive nature of rtree-insert. The rtree-insert is invoked forinserting the section of a path that is not present in the radix tree for a particularbranch at a given node. Algorithm 2 shows the basic logic used in rtree-insert.The fixing up of the pointers of a newly created node is done so that the newnode is connected to any virtual branch nodes of the virtual filesystem it is part of.The parent of the newly created node is always from the read-write branch of thevirtual filesystem.The child is always from the read-only branch because the creation of a childalways follows the creation of its parent directory node in the read-write branch.This insertion of a node into the read-write branch of the radix tree is mirroredin the actual ondisk path for the read-write branch where the directories that leadto the file or directory corresponding to the newly created node are also createdondisk.The branch pointer of the newly created node might link it to its correspondingread-only branch node and in that case the virtual paths of the read-only and read-write nodes are the same but backed by different ondisk directories.The sibling pointers are updated so that they point to the next and previoussibling nodes in the read-write branch if they exist or to read-only branch nodesotherwise. In the case where next and previous read-write branch nodes are foundto link to, those nodes’ previous and sibling pointers are also updated so that ifa look-up gets to a particular read-write branch node, it stays with the read-writebranch as long as it can.A white-out is a node that indicates the absence of a file or directory. It differsfrom a regular node by the fact that the disk-path field of the white-out node willbe set to NULL. Updating white-outs during inserts requires changing the diskpath from NULL to a valid value. Fixing up the updated white-out node’s links isunnecessary as the node is already part of the read-write branch and would havebeen updated as part of the read-write branch updates. Figure 4.3 shows the nodes38Algorithm: rtree-insertInput: Node at which path should be inserted (cur), path to be inserted(path), branch ID (br)path-seg = get-first-path-segment (path)create new node (node) with path-segif cur doesn ‘t have a child thencur’s child = nodenode’s parent curelseparent = curcur = cur’s childwhile there is a sibling nodefor cur and cur ‘s path-seg ! path-seg docur = cur’s siblingendif cur s path segment != path-seg thennode’s next-sibling = parent’s childreplace parent’s child with nodefix up sibling, parent and child pointers for nodeelseif cur’s branch ID br thenfound cur, a white-out node representing a deleted nodediscard new node and update white-out insteadelsefound cur, a read-only branch node corresponding to nodeattach node to cur using branch pointersfix up sibling, parent and child pointers for nodeendendendremove path-seg from beginning of pathifpath has more path-seg thenIrtree-insert(node, path, br)endAlgorithm 2: rtree-insert39child link — base read-only branchparent link—.. read-write branchsibling double linkbranch double inkFigure 4.3: rtree-insert for /usr/bin/john-appcreated in to the radix tree for inserting path /usr/bin/john-app.4.2.2.3 rtree-delete operationThe rtree-delete operation deletes nodes from the radix tree. In the case of deletinga node in the read-only branch, a new white-out node having the branch ID ofthe read-write branch is attached to the read-only node via branch pointers. Anylook-up for the node using the particular branch ID will thus lead to the white-(a) (b)(C) (d)40out signifying that the node has been deleted. In the case of deleting a node onthe read-write branch but which is attached to a corresponding read-only branchnode, the node is updated as a white-out node to prevent look-ups returning theread-only branch node. In the case where the read-write node does not have acorresponding read-only node in any of the read-only branches that make up thevirtual filesystem, an actual deletion of the node is performed. The basic logic isdescribed by Algorithm 3.Algorithm: rtree-deleteInput: path to be deleted(path), branch ID (br)node = rtree-search(root node of branch br, path, br, look-up)if node branch ID != br thencreate white-out with branch ID br fix up parent, child and siblingpointers of white-outelseif node ‘is branch pointers !‘ NULL thenIupdate node as white-outelsefix up pointers to node delete nodeendendAlgorithm 3: rtree-deleteFixing up node links in rtree-delete is similar to fix ups in rtree-insert in thecase involving white-out creation. A special case to handle arising from the structure of the radix tree is when creating a white-out for the first child of a particularnode. In this case, it is necessary to check for a previous sibling of the read-writebranch which won’t be linked to the read-only node to be deleted through a previous sibling link. rtree-delete also needs to fix up the hard link pointers of anode that is to be deleted. Adding a node to the hard link list need not be donein rtree-insert as a specific rtree-link operation is implemented and used after theinsert operation which adds the newly inserted node to the list of other nodes thatit is hard linked to. Figure 4.4 shows rtree-delete for read-only node /usr/bin/awk,read-write node /usr/bin/john-app, and read-only node /etc/fstab.41C a a CD CD CD CD CD oflD;IIR!SLit-)D 9-i;=tro 0- —SDCDII-S CD 0 C CD CM0)n a04.2.2.4 Other radix tree operationsrtree-link links metadata nodes that represent files that are hard linked in a virtualfilesystem. It is used after a hard link operation to add the node correspondingto the new file that was hard linked, to a list of nodes corresponding to the fileshard linked to each other and now the new file. This list is used in copy-on-writeoperations for the files that are hard linked, to work around the problem of hard-link behaviour breakage described in Section 4.1.3. create-hard-links is a helperfunction that ‘copies’ hard linked files by copying the initial hard linked file thatwas written to from the read-only to read-write disk path and then creating the hardlinks to it using the hard link list associated with the initial file node.rtree-move moves the nodes corresponding to files and sub-directories of a directory from under the corresponding directory metadata node to another directorymetadata node. It is used after a rename filesystem call to link the nodes corresponding to the sub-directories and files of a directory to the new node representingthe renamed directory. It also fixes up the parent pointers for all the file and subdirectory nodes to point to the new node. When a directory is renamed, only thedisk path of the directory and the parent pointers of all its children (sub-directoriesand files) are updated. The disk paths of all the nodes below the renamed directory will still be pointing to the previous read-only or read-write disk location. Inthe case of read-only directory renames this creates no problem as the disk pathshould still refer to the read-only disk path as there are no corresponding read-write disk paths for the sub-directories and files as the rename is only in virtualfilesystem metadata. But in the case of read-write directory renames, an actualrename backing filesystem operation is performed which leaves the disk paths ofall nodes below the renamed directory inconsistent. For a read-write branch node,the path parsed to get to the node in the radix tree, suffixed with the ondisk branchroot directory, gives the disk path for that particular node. The RadFS metadataserver deals with inconsistent disk path in read-write paths lazily by updating thedisk paths whenever there is an access to it for a read-write node.get-branch-id searches for a node using rtree-search and if found returns thebranch ID. It also returns values that specify if the file or directory associated withthe node is read-only or hard linked. The hard link status returned is checked for43by the RadFS FUSE application to initiate the content hashing mechanism and willbe described in Section 4.3.get-children returns the names and disk paths of the children (sub-directoriesand files) of a particular node. This is used by the RadFS FUSE readdir functionto fill a buffer for directory listings, get-children takes a path and branch ID asparameters, searches for the corresponding directory node, follows its child pointerto get all its children using the child and its siblings.The get-stat function uses the filesystem call stat to get the attributes of afile/directory and then overlays that data with the attributes stored in the corresponding node in the radix tree. The set-stat function allocates a stat structure tostore the uld, gid, mode, atime, ctime, and mtime attributes from a stat parameterpassed to it. The set-stat function as opposed to the get-stat function is invokedonly for regular files as in all other cases, all the attributes are obtained from disk.The get-disk-path function searches for the node corresponding to a particularvirtual filesystem path and then returns the disk path stored in the node. It alsoupdates the disk path if the node is on the read-write branch and the path parsed toget to the node in the radix tree, prefixed with the read-write branch root disk pathis different from the disk path stored in the node.The get-writeable-disk-path function inserts the part of a path that is not presentin a read-write virtual filesystem branch. This involves inserting hierarchically thenodes leading up to and including the target file or directory node and also creatingthe read-write branch’s ondisk directory structure leading up to the target file ordirectory.4.2.3 PersistencePersistence of the virtual filesystem metadata is handled by the RadFS metadataserver by maintaining a log of operations that modify the radix tree metadata structure. It also writes the radix tree out to disk when the number of log entries reach apredefined count. In the event of a crash and restart, the metadata server builds thebase read-only branch again from disk by parsing the root filesystem directory andcreating nodes for each file and directory contained in it. It then builds the othervirtual filesystem branches by scanning the tree nodes written to the tree log and44then replaying the operations logged in the operation log.The metadata server when processing requests that modif,’ the radix tree, writesout a log entry for the request which includes a RadFS operation ID, the read-writebranch ID of the virtual filesystem issuing the request and other parameters neededto re-execute the RadFS operation.The tree log stores the nodes of all the virtual filesystem branches except thebase read-only branch which can be built up from disk as it is read-only. Theonly part of a read-only branch node that is modified is the branch pointer whichcan be set when the corresponding other branch nodes get inserted into the treeduring recovery. Each entry in the tree log is a serialized metadata node whichincludes the file or directory name, the branch ID, the disk path and also flags thatindicate whether the node represents a regular file (the node includes file attributeinformation) or a hard link file (the node is part of a hard link list). Dependingon the flags, additional information such as the serialized file attributes structure(for regular files) and the disk path of the previous node in the hard link list (forhard linked files) is also written to the tree log immediately after writing the relatedserialized node. The tree is written out in a depth first fashion exhausting all thebranches of a particular node before moving on to its child node and then finally toits sibling nodes.During recovery, the file/directory name, the branch ID and disk path obtainedfrom the tree log are enough to insert and link the node into the tree. The flagsare then used to do further reads of the tree log to set up the file attributes in thecase of a regular file and to link it into the hard link list if it is a hard linked file.After all the nodes written out to the tree log have been restored into the radixtree, the restore procedure moves on to the operations log which would contain alloperations performed after the radix tree had been written to disk. These operationswhen replayed in order is enough to bring the radix tree to the state it was in beforethe metadata server shutdown or crash.4.2.4 SnapshotsRadFS supports light-weight snapshots. A snapshot of a virtual filesystem A makesit read-only and creates a new read-write virtual filesystem A :timestamp whose45state is that ofA at the time of the snapshot. Thus a snapshot of a virtual filesystemin use is created in RadFS by marking the virtual filesystem read-only, creating anew read-write virtual filesystem which then becomes the virtual filesystem in use.In RadFS a snapshot request is initiated by the RadFS CLI application and handledby the RadFS metadata server.Along with the response to every metadata operation requested by a particularRadFS FUSE application, the metadata server sends the read-write branch ID ofthe associated virtual filesystem. This is assigned to a virtual filesystem when itis initialized and is the handle used to identify it in all requests sent by it to themetadata server. The virtual filesystem uses its read-write branch ID to decide oncopy-on-write operations. If the branch ID returned from a GET-BRANCH-IDoperation doesn’t match the read-write branch ID, the virtual filesystem assumesthe file or directory associated with the path is read-only.The metadata server processes a snapshot request for a virtual filesystem witha particular branch ID by marking the branch ID as read-only, creating a new virtual filesystem based on the virtual filesystem just marked as read-only and thenresponding to any further requests from the snapshotted virtual filesystem with thebranch ID of the newly created virtual filesystem. This effectively changes theread-write branch ID for the virtual filesystem that was snapshot, making the previous branch ID read-only. The snapshot operation is atomic at the granularity of aRadFS operation as any subsequent RadFS operation leads to a change in the read-write branch ID of the virtual filesystem. The snapshot operation does not closeopen file descriptors and hence there will be an inconsistency in read-only semantics for those files that are open for writing during the time of the snapshot but thisis limited to the time when the file gets closed. In the case of a database applicationthat keeps a file open for long periods of time, this guarantee of read-only semantics on a close is unsatisfactory. To work around this, the RadFS FUSE applicationcan check for a read-write branch ID change in each read or write call-back thatdoes not involve a RadFS metadata operation. If a change in the read-write branchID is detected, a partial copy-on-write operation of the file being written to can beperformed and the file handle in the read or write updated with a handle to thiscopied file on the new read-write disk path.464.3 RadFS Hash ServerThe RadFS hash server maintains the content hash dump, processes requests forcontent hashing and is implemented using pthreads. The content hash dump is initialised using the CLI. The initialization involves parsing the read-only root filesystern ondisk and hashing regular files contained in it. This forms the base contenthash dump that is shared in read-only mode by virtual filesystems. New files created or existing files that are modified are also hashed and added to the hash dump.The RadFS hash server spawns a set of worker threads which handle contenthash requests received from the RadFS metadata server. Each worker thread receives the disk path of the file to be hashed from the metadata server through aUNIX socket interface. Since a hash request is sent whenever a regular and write-able file in a virtual filesystem is closed, there can sometimes be multiple identicalhash requests that are sent to the hash server. To avoid manipulating the hard linksto the content hash multiple times unnecessarily, the server buffers requests (essentially disk paths to files that need to be hashed) using a binary search tree. Thebuffer stores current requests received and also being processed and thus coalescesmultiple requests. A worker thread removes a disk path from the tree once thecontent hash for the corresponding file has been generated.The SHA-l hash of a file is generated using the GNU gcrypt library by splittingthe file into 4KB blocks and creating a digest of their hashes. When the contenthash has been generated without there being any further requests, the hash serverthread sees if the content hash dump has a file with the generated hash as its name.If so, it replaces the file in the read-write branch of the virtual filesystem with ahard link to the identical file in the content hash dump. It does so atomically bycreating a temporary hard link and then renaming it to the name of the file in theread-write disk branch of the virtual filesystem. If the content hash dump does nothave a file with the generated hash as its name, a hard link is created in the contenthash dump to the file in the read-write branch with its name as the 20 byte SHA1content hash.47Chapter 5EvaluationRadFS is a virtual filesystem that builds on an existing filesystem to provide copy-on-write sharing, fast filesystem deployment with snapshots, and content hashingto save disk space. But this sharing and disk space saving should not be at a costof significant performance degradation or more importantly, the correctness of operation of the filesystem. The following section shows how RadFS is evaluated forcorrectness and performance.RadFS is backed by the ext3 [15] filesystem which is a P0SIX compliantfilesystem. Hence to check for RadFS POSIX compliance, a POSIX filesystemtest suite called pjd [8] is used. The test suite has 1957 regression tests that checkfor POSIX compliance for the chmod, chown, link, mkdir, mkfifo, open, rename,rmdfr symlink, truncate, and unlink filesystem calls.RadFS is built on FUSE, a user level filesystem library. Referring to Figure 1.1,each filesystem call to a virtual filesystem leads to 6 context switches. But sincea file operation on average takes an order of magnitude more time than that takenfor a context switch, the overhead of context switches associated with a filesystemoperation is negligible. In RadFS filesystems are virtualized by using FUSE and bymaintaining virtual filesystem metadata. Each file operation involves, in additionto a look-up of the backing filesystem metadata, a look-up of the virtual filesystemmetadata too. Data transfer in a FUSE filesystem also involves copying of datamore than once from user to kernel space but FUSE uses the kernel page cacheby default which minimizes the associated overhead. To quantif’ RadFS metadata48and throughput overhead, the Bonnie++ filesystem benchmark is used.RadFS is basically built for creating virtual machine filesystems, with a goalbeing quick deployment of multiple virtual machines. This necessitates that RadFSscales well when serving an increasing number of virtual filesystems. To evaluateRadFS scaling, an increasing number of virtual filesystems are spawned with eachperforming a common program compilation. This stresses concurrency and theRadFS metadata server’s ability to handle multiple virtual filesystem requests in atimely and correct manner.The following subsections go into more details about each evaluation with thecorresponding results and discussion.5.1 Test environmentAll the tests were performed on an AMD Athlon64 X2 Dual Core Processor 3800+,with 2GB of RAIVI and a Hitachi HDT72503 320GB 7200RPM SATA disk drive.The Linux 2.6.27-rc6 kernel was used with the FUSE 2.8.0-pre 1 library. The Linuxkernel FUSE module supports NFS exports by default starting from version 2.6.27,but issues still need to be ironed out, one being an NFS stale file handle problemwhich arises due to the added layer of abstraction in FUSE filesystems. A filehandle associated with a particular path can change underneath the NFS client asit is exporting a RadFS virtual filesystem that can switch paths from read-only toread-write. This causes a mismatch between handles expected and found leadingto ESTALE errors. There is a solution in the works in FUSE that makes FUSEremember file handles indefinitely but it has not yet been implemented.5.2 POSIX compliance using pjdThe results of running the POSIX pjd test suite on RadFS are shown below:Failed test Stat Wstat Total Fail Failed List of Failed/pjd/tests/chmod/02.t 5 1 20.00% 5/pjd/tests/chownfOO.t 171 10 5.85% 36—37 68—69 83—84 141145 149 153/pjd/tests/chown/02.t 5 1 20.00% 5/pjd/tests/chown/05.t 15 2 13.33% 11—12/pjd/tests/rename/01.t 8 1 12.50% 849/pjd/tests/rmdir/02.t 4 1 25.00% 4/pjcl/tests/truncate/02.t 5 1 20.00% 5/pjd/tests/truncate/12.t 3 1 33.33% 2/pjd/tests/truncate/13.t 4 2 50.00% 2—3/pjd/tests/unhink/02.t 4 1 25.00% 4Failed 10/166 test scripts 93.98% okay. 21/1724 subtests tailed 98.78% okay.The above listing shows the tests that failed to comply with the POSIX filesystern standard. Most of the failed tests are because of not including supplementarygroup permission checks during file/directory access. While FUSE provides an interface to get the uid and gid of a user using fuse-get-context, it does not includesupplementary group information. A mechanism to get supplementary group information for a user (maybe using /proc/tid/task/tid/status) needs to be built intoeach FUSE filesystem. This has not been currently implemented in RadFS. Otherfailures are due to a difference in error handling order between what is expected bypjd-fstest and what is done in RadFS. One example would be the test for ENAMETOOLONG return value for file or directory names which are greater than 255characters. The test issues a filesystem call with a file name of length greater than255 characters but it does so for a non-existent file. RadFS checks existence beforepassing it on to the underlying filesystern and thus these tests return an ENOENT(file not found) error as opposed to ENAMETOOLONG (file/directory name toolong).5.3 Throughput and metadata operation benchmarkusing Bonnie++To evaluate and compare RadFS with other filesystems, the Bonnie++ filesystembenchmarking tool is used. RadFS is backed by the EXT3 filesystem. Hence to geta measure of RadFS virtualization overhead on EXT3, Bonnie++ results for EXT3are compared with RadFS using FUSE over EXT3. Since the virtual filesystemswould be exported to virtual machines using NFS, it also helps to quantify performance of RadFS over NFS. But with the interface between FUSE and NFS beingstill unstable, the test parameters for Bonnie++ had to be modified to get resultswithout Bonnie++ failing. Hence the metadata operations benchmark for RadFSover NFS was for a set of 4000 files as opposed to 16000 for other tests. NTFS-3G50Throughput - Bonnie+÷4GB file block I/O7000060000I EXT350000I RadFSon EXT3SNTFS-3G40000ITRadES on NTFS-3G•EXT30nNFS- 30000TRadES on EXT3 onNFS20000100000Figure 5.1: RadFS throughputis another FUSE filesystem which provides support for NTFS filesystem mountingin Linux. It has been in development for more than 2 years, is quite stable and canbe seen as a target performance base that may be reachable with optimizations toRadFS. Figure 5.1 and Figure 5.2 show throughput as well as metadata operationbenchmark results.As can be seen from Figure 5.1, RadFS throughput closely matches that provided by EXT3, NTFS-3G or any underlying backing filesystem. This is not surprising as FUSE and the latest Linux kernel have been optimized to perform readand write operations between user and kernel space efficiently with an optimizedblock size per transfer. FUSE also uses the kernel page cache which further eliminates overhead. The EXT3 performance lagging behind NTFS-3G can be attributedto journalling.RadFS metadata operation benchmarks (Figure 5.2) show that the virtualization done by RadFS, requiring look-ups in the RadFS metadata structure needs tobe optimized. Numbers for metadata operations for NTFS-3G and EXT3 are absent because they are too large to quantify according to Bonnie++. A major reasonfor the low numbers for RadFS in the create, delete and read metadata operationsis because of the absence of any caching by RadFS in metadata look-ups. EachSequential Output Sequential Input51Metadata operation - Bonnie++C00a)U)a.U)ci)U-7000Set of 4000 filesfile operation requires the look-up of a disk path. This in turn requires the look-upof a RadFS metadata radix tree node representing the target file or directory overUNIX sockets. Caching needs to be done at the RadFS FUSE side as well as theRadFS metadata server side so that repeated look-ups of recently accessed files ordirectories are eliminated. The equivalent to a dentry cache should be built at theRadFS FUSE side so that disk paths are cached with cache invalidation following arename, delete or other metadata modification operations. Since file attributes areaccessed regularly, the data returned from the GET-STAT operation should also becached and invalidated as necessary. At the RadFS metadata server side, the equivalent of an mode cache needs to be built so that repeated metadata node look-ups,involving parsing the radix tree can be avoided. This should not only eliminateredundant look-ups but also unnecessary interprocess communication and associated context switching over UNIX sockets between the FUSE application and themetadata server.60005000400030002000iooo[ ISeq Create Seq Read Seq Delete Rand Create Rand Read Rand DeleteFigure 5.2: RadFS metadata operation benchmark•RadFSRadFS over NFSNTFS-3G•EXT3525.4 RadFS filesystem scalingRadFS can host multiple copy-on-write virtual filesystems that share a commonroot filesystem and this sharing is key to saving disk space. The program compiletest seeks to demonstrate scalability of RadFS in handling multiple, simultaneousvirtual filesystem requests. The test includes compiling the MPlayer video playeron a single virtual filesystem and then ramps the load up by having multiple virtual filesystems simultaneously hosting MPlayer compiles. With content hashing,the MPlayer package and extracted source occupies only the space needed by onecopy as all the virtual filesystems share it read-only through the content hash store.Any temporary and object files created during compile are created in each virtualfilesystem’s private read-write branch. But these also get shared in the contenthash store after hashing if they are identical. Since the MPlayer source is shared,the overhead of simultaneous compile should be set back by the use of a commonpage cache by the backing EXT3 filesystem. Figure 5.3 shows the results for theMPlayer compile test.It should be noted that the compilation process is CPU intensive and therewas CPU contention among the various ccl processes when the number of virtual filesystems performing simultaneous compile was increased to more than 4.The RadFS metadata server CPU utilization was dominated by the various cclprocesses’s CPU usage. Hence the scalability is CPU limited. For disk intensiveor throughput oriented operations involving single files, RadFS performs on parwith EXT3 as no metadata operations need to be performed after the file handle isobtained. Thus the only overhead is the context switches between RadFS FUSEand the kernel which can also be minimized by increasing the block size of eachtransfer in a read/write system call.5314001200Z5 1000ci)0a800E600E 4000C-)2000RadFS Scalability1 2 3 4Simultaneous virtual filesystemsFigure 5.3: RadFS scalability554Chapter 6ConclusionsWe have built a prototype of a virtual filesystem that saves disk space throughpersistent sharing of data using a copy-on-write approach and content hashing. Itcaters to the need in virtual machines for storage that is quickly and easily deployable, supports snapshots and the ability to base one virtual filesystem off anotherrecursively. The copy-on-write (COW) approach aids the creation of multiple virtual filesystems (or virtual machine disks) based on a common root filesystem thatis shared among all the virtual filesystems. COW provides very quick deploymentand snapshots and is the initial disk space saving mechanism through read-onlysharing. Further and continued disk space saving is provided for by a content hashing module that maintains a content hash store which stores a single copy of anynumber of identical files across virtual filesystems.RadFS throughput is at par with its backing filesystem EXT3 and since FUSEprovides an abstraction at the Linux kernel VFS level, any filesystem supported byVFS can be used as a backing filesystem for RadFS. RadFS thus virtualizes a VFSfilesystem and extends it to provide copy-on-write and content hashing mechanisms. Its implementation at the file level with the performance potential ofNTFS3G is proof that its possible to build a filesystem in userspace that provides varioususeful extensions without the fear of compromising kernel stability. This thesisshows that building a virtual filesystem with FUSE to extend existing filesystemsto meet other requirements without too much overhead is definitely viable.556.0.1 Future WorkRadFS is a 98.78% okay POSIX compliant filesystem that has data throughputequivalent to that of EXT3 or as mentioned before the throughput equivalent tothat of any backing filesystem supported by the Linux VFS. Metadata operationsneed more improvement and so future work would involve providing metadatacaching at the FUSE application side as well as the metadata server side. Moreoptimization also needs to be performed to streamline the protocol used betweenthe metadata server and the FUSE application. The FUSE-NFS interface is stillnascent and once its standardized, more NFS performance oriented optimizationsneed to be worked out. Making RadFS 100% POSIX compliant requires supportfor supplementary groups to be built into either FUSE or RadFS.56Bibliography[1] Filesystem in userspace. http://fuse.sourceforge.net/.[2] Gnu libgcrypt reference manual.http://www.gnupg.org/documentationlmanuals/gcrypt/.[3] Microsoft single instance store.http://download.microsoft.com/downloadl0/c/a/0cad7d83 -2ef5-498a-af5 1-791 1a10175b0/SISJWP.doc,2008.[4] qcow. http://www.gnome.org/ markmc/qcow-image-format.html.[5] Qemu. http://bellard.org/qemu/.[6] Unionfs-odf. http://www.filesystems.org/unionfs-odf.txt,.[7] Unionfs - a stackable unification file system.http://www.filesystems.org/project-unionfs.html,.[8] P. J. Dawidek. Posix filesystem test suite.http://www.ntfs-3g.org/pjd-fstest.html.[9] T. Garfinkel and M. Rosenbium. When virtual is harder than real: Securitychallenges in virtual machine based computing environments. HotOS, 2005.[10] S. Jam, F. Shafique, V. Djeric, and A. Goel. Application-level isolation andrecovery with solitude. Eurosys, 2008.[11] D. T. Meyer, G. Aggarwal, B. Cully, G. Lefebvre, M. 3. Feeley, N. C.Hutchinson, and A. Waruleld. Parallax: Virtual disks for virtual machines.EuroSys, 2008.[12] B. Pfaff, T. Garfinkel, and M. Rosenblum. Virtualization aware file systems:Getting beyond the limitations of virtual disks. NSDI, 2006.57[13] S. Quinlan and S. Dorward. Venti: a new approach to archival storage.FAST, 2002.[14] S. Szabolcs. Ntfs-3g. http://www.ntfs-3g.org.[15] S. Tweedie. Ext3, journaling filesystem, 2000.[16] A. Warfield and J. Chesterfield. blktap - xen wiki.http://wiki.xensource.com/xenwiki/b1ktap, June 2006.58

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051466/manifest

Comment

Related Items