UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Elephant : the file system that never forgets Santry, Douglas J. 1999

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1999-0597.pdf [ 2.75MB ]
Metadata
JSON: 831-1.0051213.json
JSON-LD: 831-1.0051213-ld.json
RDF/XML (Pretty): 831-1.0051213-rdf.xml
RDF/JSON: 831-1.0051213-rdf.json
Turtle: 831-1.0051213-turtle.txt
N-Triples: 831-1.0051213-rdf-ntriples.txt
Original Record: 831-1.0051213-source.json
Full Text
831-1.0051213-fulltext.txt
Citation
831-1.0051213.ris

Full Text

Elephan t : T h e F i l e Sys tem T h a t N e v e r Forgets by Douglas J . Santry B S c , McMaster University, Hamilton, Canada, 1994 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Department of Computer Science) we accept this thesis as conforming to the required standard T h e Un ive r s i t y of B r i t i s h C o l u m b i a August 1999 © Douglas J . Santry, 1999 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of £o,yy\f>u.4gr- Sc (9\^Cf^ The University of British Columbia Vancouver, Canada Date \(f DE-6 (2/88) Abstract Modern file systems associate the deletion of a file with the immediate release of storage, and file writes with the irrevocable change of file contents. We argue that this behavior is a relic of the past, when disk storage was a scarce resource. Today, large cheap disks make it possible for the file system to protect valuable data from accidental delete or overwrite. This thesis describes the design, implementation, and performance of the Elephant file system, which automatically retains all important versions of user files. Users name previous file versions by combining a traditional pathname with a time when the desired version of a file or directory existed. Storage in Elephant is managed by the system using file-grain retention policies specified by users. This approach contrasts with checkpointing file systems such as Plan-9, AFS, and WAFL that periodically generate efficient checkpoints of entire file systems and thus restrict retention to be guided by a single policy for all files within that file system. Most file systems have no built in support for recovering data. 11 Contents Abstract ii Contents il I* List of Tables vii' List of Figures Acknowledgements ix Dedication x 1 Introduction 1 2 Related Work 6 2.1 Explicit User Versioning Systems 6 2.1.1 The Revision Control System 6 2.1.2 Clear Case 7 2.2 Versioning File Systems 7 2.3 Checkpointing File Systems 8 iii 2.3.1 Plan-9 8 2.3.2 WAFL 9 2.3.3 Discussion 9 3 Background 11 3.1 The BSD Kernel 11 3.1.1 The Virtual File System and Vnodes 11 3.1.2 The FreeBSD Buffer Cache 13 3.1.3 The Virtual Memory Subsystem Page Cache 14 3.1.4 The Buffer Cache 14 4 The Elephant File System 17 4.1 Design Issues 17 4.1.1 Not All Files are Created Equal 18 4.1.2 Goals: Undo and Long-Term History 19 4.1.3 Pruning the Long-Term File History 20 4.1.4 Elephant Design Requirements 22 4.1.5 Summary of Issues 23 4.2 Design Overview 23 4.3 File Retention Policies 25 4.4 Keep One: No Versioning 25 4.5 Keep Safe: Undo Protection 26 4.6 Keep Al l : Complete Histories 26 4.7 Keep Landmarks: Long-Term History 27 5 Prototype 29 5.1 Overview of Versioning 29 iv 5.2 Meta Data 30 5.3 Directories 33 5.4 System Interface 35 5.5 Storage Reclamation 36 5.6 Tools 38 5.7 Ramifications of the Implementation on Kernel Caches 39 5.7.1 Ordinary Files 39 5.7.2 Directories 40 6 Performance and Feasibility 42 6.1 Prototype Performance 42 6.2 Trace Studies 47 6.3 File System Profile 48 6.4 Summary 50 7 Implementation Alternatives 51 7.1 Log Structured File Systems 51 7.2 Journaling File Systems 53 7.3 Write in Place File Systems 54 7.4 Logical Disks 56 8 Conclusions and Future W o r k 58 8.1 Future Work and Issues 58 8.2 Version History Export and Import 58 8.3 Disconnected Operation 59 8.4 Version History Merge and Branch 59 8.5 Application Denned Retention Policies 60 v 8.6 Conclusions 61 Bibliography 64 vi List of Tables 5.1 Kernel interface for user-mode applications 35 5.2 Kernel interface for the cleaner process 36 6.1 Performance of basic file system operations 43 6.2 File distribution of 12-GB research-lab workstation home directorie s. 49 vi i List of Figures 5.1 Meta data: the IMap, inodes logs, and the inode file 31 5.2 Directory structure 33 vii i Acknowledgements Boddingtons Beer of Manchester D O U G L A S J . S A N T R Y The University of British Columbia August 1999 ix To Mike Harris and the Conservative Dream x Chapter 1 Introduction Disks are becoming ever cheaper and larger. Human productivity, however, remains constant. This affords system designers an opportunity to re-examine the way file systems use disk stores. In particular, the current model of user-controlled storage management may no longer be ideal. In a traditional file system, users control what is stored on disk by explicitly creating, writing, and deleting files. The key weakness of this model is that user actions have an immediate and irrevocable effect on disk storage. If a user mistakenly deletes or overwrites a valuable file, the data it stores is immediately lost and, unless a backup copy of the file exists, lost forever. Over the years, a main focus of file system research has been to protect data from failure. Excellent solutions exist to protect data from a wide variety of network, system, and media failures. The user of an appropriately configured file system can now rest assured that her valuable data is safe and available, protected from all forms of failure; all, that is, but failures of her own making. As soon as the user modifies or deletes a file, none of the carefully engineered protections in the file 1 system can save her from herself or from the applications she runs. Some partial solutions and coping mechanisms do exist, but none adequately solves the problem. Today use of the Internet is very wide spread. This scenario has lead to a heightened awareness of security issues. Hackers employ root kits to delay the detection of their intrusions and hide their tracks after discovery. Root kits exploit the fact that file systems write over old data with the new data. This allows for access logs and connection logs to be purged of any trace of the intrusion. These logs are often key for the administrator to protect the system. Some early file systems such as Cedar provided a degree of protection from accidental overwrite, but not delete, by automatically retaining the last few versions of a file in copy-on-write fashion [25, 6, 18]. Limited storage space, however, meant that only a few versions could be retained. The choice of which version to prune was either left to the user or the oldest versions were deleted. In either case, valuable file versions could easily be lost. Personal computer operating systems provide a degree of protection from accidental delete, but not overwrite, using the "trash can" metaphor, which requires a two step process to really delete a file. The trash can, however, provides only limited undo capability. Files are also required to have unique names not just unique paths. Eventually storage becomes constrained and the trash can must be emptied. File deletions that occur shortly before the empty are afforded only a very limited period in which they can be undone. In most well-maintained file systems, off-line backup storage is used to protect users from system failures, media failures, and their own mistakes. Checkpointing file systems such as Plan-9, AFS, and WAFL build this support into the file system using copy-on-write techniques to create periodic file-system checkpoints [21, 9, 10]. 2 These checkpoints are available online and they allow users to access out-dated ver-sions that were captured by a checkpoint. Changes that occur between checkpoints (or backups), however, are not recoverable. Furthermore, the fact that checkpointing occurs at the granularity of the entire file system limits the number and frequency of checkpoints. The lack of file system support has required users and application developers to devise various coping mechanisms to protect their data. If a user wants to main-tain older versions of files she must explicitly make and maintain multiple copies. A savvy user tends to be conservative, making many copies of data and avoiding deletes whenever possible. File-editing applications often provide operation-grain undo capability and perform live editing in a copy of the original file that replaces the file in an atomic operation when the user commits her changes. Today, information is valuable and storage is cheap. It thus makes sense for the file system to use some of this cheap storage to ensure that valuable files are inever lost due to the failure of a user to make a copy (or make the right copy) before modifying them, or because of the accidental or uninformed deletion of a file that is in fact valuable. Furthermore, we believe that the amount of storage dedicated to files that are modified by users is growing slowly compared to the total amount of data stored in the file system. As the capacity of a single inexpensive disk approaches 50-GB, only a small fraction of this space wil l be occupied by files that require protection from user mistakes. The rest wil l be temporary, derived, and cached data that can be ignored, re-created, or.re-fetched if lost. There are many issues that a file system must address to address the above problems. The file system must protect the data without incurring a performance 3 penalty for file operations. Disks are getting larger but they are still finite. The file system should favour keeping important data as long as possible over replacing it with casual data. To make such a system useful users should be able to easily access and manage the multiple versions that result. A key issue to management of multiple versions is how versions are named. The names should be easy for users to use. Finally these names need to be communicated to the file system. In Elephant, old versions of files are automatically retained and storage is managed by the file system. Users specify retention policies for individual files, groups of files, or directories. The goal of Elephant is to allow users to retain important old versions of all of their files. User actions such as delete and file write are thus easily revocable by rolling back a file system, a directory, or an individual file to an earlier point in time. Verification of our ideas is conducted through the construction of a prototype system. The layout of the following chapters is outlined below: • Chapter 2 reviews the prior research and commercial systems that address the problem of data protection in file systems. The approaches of these systems vary and their individual drawbacks and tradeoffs are examined. • Chapter 3 provides the background required to understand the assumptions made by U N I X kernels about their file systems. This is important to under-stand the prototype and its drawbacks. • Chapter 4 discusses the design of the Elephant file system. The key strategy of the design is to separate storage management from the common file system operations. Block reclamation decisions are put off until the future when better decisions can be made. A user level process is used to decide when to 4 •reclaim disk blocks. • Chapter 5 presents the implementation of our Elephant prototype. Elephant was implemented in a FreeBSD kernel. The file system is is fully integrated into BSD's virtual file system framework. • Chapter 6 reports on the performance of our Elephant prototype. Standard benchmarks are provided as well as micro-benchmarks. The results are ex-plained and compared to the results of running the benchmarks on BSD's Fast File System. In addition to the performance of the prototype the feasi-bility of the system is examined in the context of file system trace statistics based on a live file system. • Chapter 7 examines some alternative implementation possibilities for an Ele-phant file system. Their relative strengths are compared using metrics ranging from ease of implementation to performance. • Chapter 8 summarizes the thesis and provides suggestions for future work. 5 Chapter 2 Related Work This chapter discusses prior work related to the thesis. The first section discusses systems which support versioning files but require explicit user intervention. These systems are not built into the file system but are interposed between the user and the file system. Section 2.2 reviews older file systems with similar aims to Elephant. Their solutions were limited by the expense of disk space when they were conceived. Newer systems are examined in section 2.3. These systems were built to meet dif-ferent needs than Elephant but they do support versioned files. Finally the chapter concludes with how this thesis is different from these other works. 2.1 Explicit User Versioning Systems 2.1.1 The Revision Control System The goal of keeping multiple versions of data automatically, compactly, and in an organized way is reminiscent of software version control systems [22, 29]. These systems are implemented by application programs running on top of a traditional 6 file system. Users checkout a version from a version-controlled repository, modify a local copy of that version in the file system, and then return the modified version to the repository, which compresses it with respect to older versions. In essence, the goal of Elephant is to extend this idea to all clients of the file system by moving the versioning semantics into the file system, while supporting the traditional file system interface, and thus freeing users from the details of version management. 2.1.2 ClearCase ClearCase [1] is a configuration management system. It uses the check-in/ check-out model to version files. It also supports versioning of directories and other file system objects. Users mount a clear case volume over a network via NFS. Branching and merging is strongly supported. Space is reclaimed by the user explicitly deleting branches or versions. ClearCase is not a file system, it supports its users on the server end by versioning file system objects in relational databases. It is built on top of the file system. 2.2 Versioning File Systems The idea of versioned files was first proposed for the Cedar file system from Xerox PARC [25, 8]. In Cedar, files were immutable; writing to a file produced a new version of the file and file names included a version number (e.g., filename! 10). A similar idea was found in the RSX, VMS [6], and TOPS-10/-20 [18] operating systems from Digital. The approach taken by these systems has two key limitations. First, the maximum number of file versions retained by the system was assigned as a per-file parameter; when this threshold was reached, the oldest version was deleted. Cedar 7 supports the user designating versions as landmarks and thus saving them from being recycled. However, the deletion of the oldest version is a poor heuristic for deciding which files are valuable. Interesting versions of files may be discarded while undesirable or less interesting versions still exist. Second, versioning did not apply to directories. Operations such as renaming a file, creating or destroying a directory, or, in some cases, deleting a file, were thus not revocable. 2.3 Checkpointing File Systems Several recent file systems have taken a different approach to. versioning. In sys-tems such as A F S [10], Plan-9 [21, 20], and W A F L [9] an efficient checkpoint of an entire file system can be created to facilitate backup or to provide users with some protection from accidental deletes and overwrites. A checkpoint is typically created and maintained in a copy-on-write fashion in parallel with the active file system.. The old version thus represents a consistent snapshot of the file system sufficient for creating a consistent backup while the file system remains available for modification by users. The snapshot also allows users to easily retrieve an older version of a file. 2.3.1 Plan-9 Checkpointing in Plan-9 is tightly integrated into the functioning of the file system. The Plan-9 file system views the world as a three tiered cache. The first tier is physical memory, the second is disk and the final tier is W O R M optical platters. As data is created and gets older it migrates from core to the W O R M . The snapshots that reference the block are informed of the move. Once every 24 hours the file system is checkpointed. When checkpointed blocks are written to the old version of the block is moved to the W O R M drive. Snapshots are never thrown away; they live 8 forever on the W O R M . Old versions of data are available through the name space. Off the root of the file system is a directory for every snapshot that is available. They are named by the date they represent, e.g. /02-24-1995. Under each snapshot directory one can find an image of the file system as it existed at the time of the snapshot. 2.3.2 WAFL W A F L can take snapshots very frequently, as often as every 15 minutes. Snaphots are not kept indefinitely. They are recycled by the system and W A F L supports keeping 20 online. Administrators can arrange for a weekly snapshot and daily snapshots in addition to the periodic 15 minute snapshots. Files in snapshots are accessed individually, that is, each file has to be recovered one at a time. Directories contain hidden files named .snapshot. These contain snaphots of the files in the directory. There is no way to rollback the entire file system at once. 2.3.3 Discussion Checkpointing file systems have two major limitations. First, checkpoints apply to all files equally, but files have different usage patterns and retention requirements. While it is not feasible to retain every version of every file, it may be important to keep every version of some files. Unfortunately, this dilemma cannot be solved using a file system-grain approach to checkpointing. Elephant addresses this limitation using file-grain retention policies that can be specified by the user. Second, changes that occur between checkpoints cannot be rolled back. For instance, users of daily-checkpointing systems such as Plan-9 or A F S are as vulnerable as F F S users to losing all their morning's work in the afternoon, due to an inadvertent file deletion 9 or overwrite. Plan-9 and A F S cannot take checkpoints more frequently because of the amount of storage required by a checkpoint hence the second limitation is a result of the first. Elephant is different from the work discussed. It takes the approach of snap-shot file systems of building version support directly into the file system. Borrowing from revision control systems, Elephant allows users to determine which files are to be versioned and how often. This is done by specifying retention policies at the granularity of files instead of the whole file system. 10 Chapter 3 Background 3.1 The BSD Kernel To understand the decisions made while implementing Elephant one needs to un-derstand the support provided to file systems by the FreeBSD kernel. The services provided by the system fall into two general areas. These are the buffer cache/virtual memory system and the virtual file system interface/vnode interface. The two are inextricably intertwined. 3.1.1 The Virtual File System and Vnodes As U N I X gained more features the need to support multiple types of file systems eventually arose. For instance, the original release of System V could not support both F F S and NFS in the same kernel. The central problem was in the routine namei. A l l interactions with the file system start with looking up a name. The routine responsible for this is called namei. The original namei understood the on-disk format of directories and expected the 11 file system to be backed by a local disk. Crossing mount points doing lookups was supported but the types of the file systems were always the same, s5fs. The result of a name lookup was a s5fs-inode. The kernel assumed s5fs modes everywhere (paging, user I/O). With the advent of NFS and CD-ROMs, to name but a few useful file systems, this was a problem. The Virtual File System The solution was the Virtual File System [12]. VFS is an object oriented file system interface. To implement a file system in the kernel one inherits from the abstract class that abstracts a file system. The class includes a number of pure virtual functions. The functions represent actions which can be taken on the file system. A second abstract class was created to replace the use of inodes in kernel subsystems. This second class was a vnode and included pure virtual functions to perform actions on them. Namei was purged of all knowledge of the specific file system s5fs. It instead used vnodes and invoked methods on them. The original namei became the lookup method for s5fs vnodes. The kernel maintains two important data structures to support this. One is the root vnode of the entire name space and the second is a list of all the file systems currently mounted. The process of looking for a name is now consistent for a name space com-prised of heterogeneous file systems. For an absolute path, namei calls lookup on the root vnode with next component as the argument. If it exists a new vnode will be returned. If there is a file system mounted there then namei looks it up and asks it for its root vnode. If it is an ordinary directory the returned vnode is used. Its lookup method is then used for the next component. This is repeated until the path 12 is resolved. The path name may cross a number of mount points and file system types and namei handles it all seamlessly. When creating a file system, developers inherit from the V F S and vnode classes adding new data, fields and implementing the methods that make sense for their file system. Vnodes A l l interactions with a file revolve around its vnode. To access a file one calls namei which wil l provide the caller with the vnode for the file. A l l users of a particular file share the same vnode, that is, for a given device number and inode number there is only one vnode. To manage this vnodes include support for locks and reference counts. When the reference count on a vnode reaches zero its pre-destructor method is called. The vnode is then placed on the free list but remains associated with its file system until the point of recycling. When the vnode is recycled its destructor method is called and it is then associated with the new file. The process of recycling includes purging the buffer cache of any trace of the vnode. If it is needed before it is recycled it is taken of the free list. Files used after their vnodes have been recycled will be assigned new vnodes taken off the free list. 3.1.2 The FreeBSD Buffer Cache FreeBSD has a unified VM/buffer cache. The cache has two levels of organization. Programs request data explicitly through the buffer cache. Paging bypasses the buffer cache and deals directly with the V M subsystem. Blocks can exist in the V M cache without being in the buffer cache but the reverse is not true. If a block is 13 in the buffer cache it must be in the V M cache too. The set of pages in the buffer cache is a subset of those found in the V M cache. 3.1.3 The Virtual Memory Subsystem Page Cache The first time an I /O is performed on a vnode a vm_object is associated with the vnode. This vm_object wil l contain all in core pages for this vnode regardless of the buffer cache state. Each vnode is associated with precisely one vm_object and vice-versa. A physical page can only be in one vm_object at a time. Pages are added to the vm_object by the buffer cache or the paging system. When I /O is done on a vnode , either from a page fault or read/write on a file, physical pages are allocated. These pages are inserted into the vm_object and their offsets into the vnode are recorded. When a page is required again the offset into the object is used to find it. The page is inserted into the object before it is filled hence it is also locked. After the page is filled it is unlocked and processes waiting for it are woken up. This prevents race conditions and multiple processes attempting to bring the same page simultaneously. 3.1.4 The Buffer Cache Paging operations go through the V M interface in FreeBSD. Explicit program ac-cesses to the file system, that is, file descriptor based I /O , still uses the original buffer cache as described in [3]. The buffer cache uses the same pages that are in the vm.object but encapsulate the pages with a buffer header. A l l pages that must be filled or written, no matter which subsystem they originated in, must be encapsulated by a buffer header before being presented to the backing store of the vnode. The header contains information which is required 14 by the device and also by the file system. This information includes the device.ID of the device, the logical and physical address and the size of the buffer. Blocks containing meta-data need to be read by the file systems. For the file systems to read the meta-data in blocks the pages containing them must be mapped. Pages are only mapped when they are in the buffer cache. Pages that are in the vm_object only are not mapped in kernel space. Buffers map the pages they encapsulate for access by the kernel. There is a region of kernel virtual memory assigned to each buffer when it is created. The buffer contains the base address of the region as well as its length. In FreeBSD 2.2 this length is 64k. Note, it is 64k of virtual memory not physical memory so buffers containing blocks for NFS which are 8k only have pages mapped in the first 8k. As in the original buffer cache, buffers are chosen for replacement by the buffer cache sub-system by the least-recently-used algorithm. However, only the buffer headers themselves are recycled. When a buffers time has come to be reused, the physical pages it is associated with are left in the vm_object, they are merely unmapped in the kernel. The pagedaemon is the entity which steals pages. The buffer's virtual address space for buffer pages remains constant through the life of the buffer. Buffer pages are mapped and unmapped as needed in the buffer's virtual memory region. This scheme allows all pages in the system to compete for existence. FreeBSD can adjust to different load conditions such as I /O or large processes. What is decided at compile time is the number of buffer headers the kernel is allowed to create hence the need to recycle them. Pages associated with buffers are not eligible to be stolen. Buffers are tracked on hash queues for quick access. They are hashed on their logical block number (offset into vnode) and the vnode's virtual address (pointer). 15 Device numbers and inode numbers are not used. Since vnodes can represent many different files over the course of their life when they are recycled there can be no trace of them in the cache when they are recycled. Their vm_object is cleaned and deallocated and all buffer's in the buffer cache are relinquished. Even if the pages containing the data are not required by the system they must be freed and made anonymous. This is undesirable but necessary as the vnode/blkno pair would find blocks which are in another file after recycling. This could easily be fixed by using inode numbers and device ID instead. As will be seen, this 'bug' is crucial to the operation of our prototype. 16 Chapter 4 The Elephant File System The design of the Elephant file system is based upon assumptions of how files ought to be treated by a file system. Not all files have the same retention requirements. Here we discuss the different classes of files Elephant manages and their treatment. This leads to a design for Elephant. 4.1 Design Issues A file system that protects users from their mistakes must separate storage manage-ment from the common file-system operations available to users (e.g., open, write, close, and delete). Deleting a file must not release its storage and file updates must not overwrite existing file data. To achieve this goal, the file system must retain sufficient information to be able to reconstruct old versions of files and directories. To undo an update, the previous content must be retained. To undo a delete, both the file's name and content must be retained. It is obviously not feasible, however, to retain every version of every file. The system or user must decide what historic data can be discarded and when it should 17 be freed. The key question is what are the respective roles of the system and the user in making this decision? There are two competing factors to consider. First, increasing direct user control of reclamation tends to decrease the protection the user has from her mis-takes. Storage reclamation is, by definition, an irreversible process. On the other hand, if the system controls reclamation, how do we ensure that the decisions it makes respect user needs? Only the user knows which versions of her data are important. Solving this dilemma requires a compromise between the competing needs of user protection and user control. To understand how best to strike this compromise it is useful to review in more detail the different types of data users store in a file system, the different ways they access this data, and how these factors impact the question of what sort of history information the file system should retain. The remainder of this section summarizes our observations taken from several UNIX file system traces [19, 7, 23]. The observations are based on the behaviour of users. It was postulated that users show intense interest in a particular set of files for a period of time and then ignore them. This pattern of behaviour was indeed observed in the file system traces. 4.1.1 Not All Files are Created Equal The first key observation is that a file system stores many different types of files that each require different levels of versioning. The following simplified taxonomy demonstrates these differences. • Read-only files such as application executables, libraries, header files, etc. have no version history. 18 • Derived files such as object files require no history or undo capability as they can be regenerated from their original sources. • Cached files such as those maintained by a web browser require no versioning or history. • T e m p o r a r y files have short lifetimes and, while they may benefit from short-term histories for undo purposes if they are modified by users, long-term his-tories are not necessary. • User-modified files may require histories, but even these files need varying degrees of versioning. Clearly, a variety of file retention policies are required. User control can be achieved by allowing users to associate a policy with each of their files. User protec-tion can be assured by placing the mechanism that implements these policies inside of the file system. In this way, a user can indicate that versioning is not required for some files and, for other files, the user can indicate what type of versioning is re-quired. The next key question is thus what different types of versioning are possible and useful? 4.1.2 Goals: Undo and Long-Term History The goal of protecting users from their mistakes is really two related goals: (1) providing users the ability to undo a recent change and (2) maintaining a long-term history of important versions. The two differ in the amount of time they cover and in the completeness of the history they retain. Undo requires that a complete history be retained, but only for a limited period of time. The history must be complete, because there is typically no good 19 way to predict which of her actions a user might later want to undo. Limiting the amount of storage needed to support undo thus requires restricting the period of time in which an undo is possible. The system might, for example, allow the user an hour, a day, or a week to realize that an operation was in error. Within that interval, the user can reverse any delete or overwrite. Once that second-chance interval passes, however, changes become permanent. Long-term history serves a different purpose. Once the undo period has passed, it may still be necessary to retain certain important versions of a file. It is typically not appropriate or useful to retain every version of the file. Instead, users select key landmark versions to be retained long term. In an RCS system, for example, users select landmark versions by checking them into the repository. Intermediate changes the user makes in their working directory are not retained by RCS. These two goals are closely related. It may be useful to have intermediate regimes in which some undo's become impossible, but more versions are retained than just the landmark versions. Additionally, the importance of a landmark version may degrade over time to the point where some old landmarks can be deleted. 4.1.3 Pruning the Long-Term File History The final remaining issue is how to prune the long-term history and how to identify landmark versions. To understand this issue it is helpful to review how users use current file systems and how increasing disk storage capacity impacts the decisions they make. As stated previously, users cope with the fragility of current file systems by making frequent copies of important data. Data that is changed frequently is 20 copied frequently. Furthermore, users are reluctant to delete files, because doing so permanently removes them from the file system. Ironically, as storage becomes larger, it becomes more difficult for users to manage the versions they create. When storage is fairly constrained, users are re-quired to frequently assess the files that they are maintaining, and to delete those that are no longer necessary. Typically, the ability of the user to make this assess-ment effectively deteriorates over time. After a few weeks or months, the user is unlikely to remember why certain versions are being maintained. While there may be value in maintaining these old versions, it becomes more and more difficult to make sense of them. When it becomes necessary to delete something, what strategy should the user employ? Unless they have carefully remembered what version is what, their probable strategy is to delete the oldest versions. This, however, is often the wrong strategy because a version history typically contains certain landmark versions sur-rounded by other versions whose time frame of interest is much shorter. Unfortu-nately, the user may have no good way to tell which old version is important. Any solution that relies solely on the user to identify landmark versions is problematic, because failure to identify an important version as a landmark can result in the loss of important data. It is thus important for the file system to assist the user in detecting landmark versions. Fortunately, it is often possible to detect landmark versions by looking at a time line of the updates to a file. In our study of U N I X file-system traces, we have seen that for many files, these updates are grouped into short barrages of edits separated by longer periods of stability. A good heuristic is to treat the newest version of each group as a landmark, in addition to any landmarks explicitly 21 identified by the user. 4.1.4 Elephant Design Requirements There are a number of possible designs for Elephant. The ideas in Elephant do not impose any demands on the on-disk layout of the implementation. Nor does it impose any contracts regarding ordering or synchronization. This implies a number of different solutions are possible from the file system spectrum which ranges from log structured file systems to traditional write in place systems. No matter what the design, all file systems that are to be used to build an Elephant system should have certain characteristics. All designs should share these characteristics. These qualities comprise the Canonical Elephant Axioms. • No Write in Place: Elephant does not destroy previous contents of files with new data. To support this efficiently an Elephant file system cannot write in place. • File Version Records: Files have multiple versions, these versions need to be tracked so they can be retrieved. The file system will have to support this directly. • Temporal Parameter Elephant names the versions of a file by time. The file system will need to index its records of versions with time. Any system which contains the characteristics of Canonical Elephant can be used to build an Elephant system. 22 4 .1.5 Summary of Issues In summary, the following issues are key to the implementation of a file-system storage management mechanism that protects users from their mistakes. • Storage reclamation must be separated from file write and delete. • Files require a variety of retention policies. • Policies must be specified by the user, but implemented by the system. • Undo requires complete history for a limited period of time. • Long-term histories do not retain all versions. • The file system must assist the user in deciding what versions to retain in the long-term history. 4.2 Design Overview This section describes the design of the Elephant File System and the file retention policies it uses to manage storage. The key design principle of Elephant is to separate storage management from the common file system operations available to users (e.g., open, write, close, and delete). The file system, not users, handles storage reclamation. Deleting a file does not release its storage and file writes are handled in a copy-on-write fashion, creating a new version of a file block when it is written. Files versions are named by combining the file's pathname with a date and time. Versions are indexed by their creation time (i.e., the file's modification time) and versioning is extended to directories as well as files; only the current version of 23 a file can be modified. The system interprets the pathname-time pair to locate the version that existed at the specified time. A file version thus does not have a unique name; any time in the interval in which a version was current can be used to name the version. By rolling a directory back in time, a user can see the directory exactly as it existed at that earlier time, including any files that the user subsequently deleted. Delete thus becomes a reversible name-space management operation. Storage is managed by the system using retention policies associated with each file. Users can specify policies for a file, a group of files, or for a directory; newly created files by default inherit the policy of the directory in which they are created. Periodically, a file system cleaner process examines the file system and uses these policies to decide when and which disk blocks to reclaim, compress, or move to tertiary storage. This approach allows users to indicate which files require versioning and what sort of versioning they require. As a result, the majority of files can be handled by keeping a single version with no history, as is done today, while providing detailed history information for only the files that require it. The key points of the design are thus summarized as follows: • The system, not the user, manages storage. • Files are copy-on-write and delete is a reversible name-space change. • File versions are named by any time when they were current. • Users specify file-grain retention policies. • A background cleaner process reclaims storage according to these policies. 24 4.3 File Retention Policies We have identified the following four file retention policies for Elephant and have implemented these policies in our prototype. • Keep One • Keep A l l • Keep Safe • Keep Landmarks Keep One provides the non-versioned semantics of a standard file system. Keep A l l retains every version of the file. Keep Safe provides versioning for undo but does not retain any long-term history. Keep Landmarks enhances Keep Safe to also retain a long-term history of landmark versions. The remainder of this section discusses the details of these policies. 4.4 Keep One: No Versioning Keep One provides semantics identical to a standard file system, retaining only the current version of a file. Deleting a Keep-One file removes the file immediately. Writing a Keep-One file updates the file in place; copy-on-write is not used. A n important property of Keep One is that it is the only policy that allows users to directly control storage reclamation. There are many classes of files that this policy suits well. Files that are unimportant (e.g., files in /tmp, core files, etc.) or files that are easily recreated (e.g., object files, files in a Web browser cache, etc.) are good candidates for Keep-One retention. 25 We anticipate, however, that most files that users modify directly, even tem-porary files, will have at least Keep Safe semantics. 4.5 Keep Safe: Undo Protection Keep Safe protects users from their mistakes but retains no long-term history. The policy is parameterized by the length of the second-chance interval. The system guarantees that all file-system operations can be undone for a period defined by this parameter. A second-chance interval of seven days, for example, means that the user can rollback any change that occurred with the past seven days. Any change that occurred more than seven days ago, however, becomes permanent. Implementing Keep Safe requires that every version of a file be retained until it is no longer needed for rollback. Notice that simply retaining a version until it is older than the second-chance interval is not sufficient. Instead, a version must be retained until a younger version, or the file's deletion, becomes permanent. For example, if a file with a year-old version is modified or deleted, the year-old version must be retained until the second-chance interval for that modification or deletion expires. 4.6 Keep All: Complete Histories The Keep A l l policy provides complete version histories of files. Files with this policy will never have a version deleted. For most files this policy would retain too much information and the cost could become quite high. However, some file would benefit from it such as mail logs. 26 4.7 Keep Landmarks: Long-Term History Files that store important user information often require histories. For these files, in addition to the protections of Keep Safe, the system also maintains a long-term history. Keep Landmarks retains only landmark versions in a file's history; non-landmark versions are freed as necessary to reclaim storage. The key issue, however, is how to determine which versions are landmarks. We follow a combined approach that provides users with direct control, but also seeks to protect users from their mistakes. A user can specify any version as a landmark. In addition, the system uses a heuristic to conservatively tag other versions as possible landmarks. The cleaner frees only versions that the Keep-Landmarks policy determines are unlikely to be landmarks. This landmark-designation heuristic is based on the assumption that as ver-sions of files get older without being accessed, the ability of the user to distinguish between two neighboring versions decreases, For example, we might designate every version of a file generated in the past week as a landmark. For versions that are a month old, however, we might assume that versions generated within one minute of each other are now indistinguishable to the user. If so, we can designate only the newest version of any such collection of versions to be a landmark, possibly freeing some versions for deletion. Freeing non-landmark versions introduces discontinuities in a file's history. A user may yet request the freed version of the file. The presence of these discon-tinuities is important information to the user. We thus retain information about freed versions of a file in its history and allow the user to examine the history for discontinuities. This information is important, for example, for the user to roll back 27 a set of files to a consistent point in the past. The user can only be certain that the specified point is consistent if the histories of all files are continuous at that point in time. Finally, the Keep-Landmarks policy allows the user to group files for con-. sideration by the policy. Grouping is important for inter-dependent files whose consistency requires viewing all files as of the same point in time. When this type of inter-dependency exists, the files are inconsistent if a discontinuity exists in any of the files, because the missing version is required for the consistent view. Keep-Landmarks solves this problem by ensuring that no file in a group has an disconti-nuity that overlaps a landmark version of any other file in the group. 28 Chapter 5 Prototype We have implemented a complete prototype of Elephant in FreeBSD 2.2.8, which is a freely available version of BSD for personal computers. Elephant is fully inte-grated into the FreeBSD kernel and uses BSD's VFS/vnode interface. The standard U N I X file interface is augmented with a new A P I to access the advanced features of Elephant, but all U N I X utilities work unmodified on current as well as older; versions of files. This section describes the design and implementation of our prototype, de-tailing the file system's meta data, versioned directory structure, enhanced kernel interface, storage reclamation mechanism, and user-mode tools. 5.1 Overview of Versioning Elephant versions all files except those assigned the Keep-One policy. Disk blocks of versioned files are protected by copy-on-write and only the current version of a file can be modified. Keep-One files can be updated in place, following the same procedure as a standard U N I X file system. 29 A version is defined by the updates bracketed between an open and a close'. The first write to a versioned file following an open causes its inode to be duplicated, creating a new version of the file. The first time an existing block is written after an open, the modified block is given a new physical disk address and the new inode is updated accordingly. A l l subsequent writes to that block before the close are performed in place. When the file is closed, the inode is appended to an inode log maintained for that file. This inode log constitutes the file's history and is indexed by the time each inode in the log was last modified. Concurrent sharing of a file is supported by performing copy-on-write on the first open and updating the inode log on the last close. To economize disk space, files that have no history are stored in a standard inode instead of an inode log. The system adaptively switches between an inode and inode log as necessary. For example, a newly created Keep-Safe file is stored in a single inode. When the file is modified, an inode log is created for the file to store both the original and new versions of the file. When the current version is sufficiently old, the cleaner frees all historic versions and the system frees the file's inode log and replaces it with a single inode. 5.2 Meta Data Traditional file systems have one inode per name (but the reverse does not hold). Files can thus be uniquely and unambiguously named by their inumber, an index to their inode's disk address. Elephant departs from this model, because files can have multiple inodes, one for each version of the file. To maintain the important naming properties of inumbers, Elephant redefines inumbers to index an IMap, similar to L F S [24]. 30 The IMap As depicted in Figure 5.1, the IMap provides a level of indirection between an inumber and the disk address of a file's inode or inode log. There is an entry in the IMap for every file. Each entry is 16 bytes, storing a pointer to the file's inode or inode log, a temperature, a policyGroup, and a set of one-bit flags. The IMap is stored on disk and cached in memory; its disk address is recorded at a fixed location on disk. IMap Inode File r CD E • ^ meta-data pointer-temperature policyGroup flags T CD CD CD CD CD T3 -a TS T J TS O o O O o ... c c c c c older—• Figure 5.1: Meta data: the IMap, inodes logs, and the inode file The meta-data pointer can have one of two meanings. It either an offset into an inode file or a disk address for an inode log. Elephant checks the flags field of the imap to determine the pointer's meaning. The inode file and inode logs are described in detail below. Temperature is a heuristic used to guide the system cleaner to files likely to have the most reclaimable storage. A file's temperature is recomputed by its policy module when the cleaner examines the file and when the file is closed. The temperature encodes a value and an expiration date. The system artificially raises 31 the value of an expired temperature to ensure that the cleaner eventually re-examines the file. The cleaner is described in Section 5.5. Finally, policy Group organizes files that are considered as a group by their storage retention policy, as described in Section 4.7. The IMap entries for a group's files are linked into a circular list; policyGroup stores the inumber of the next file in the group. Non-Versioned Inodes Elephant stores inodes in a distinguished file called the inode file, unlike traditional U N I X file systems, which pre-allocate inodes at fixed disk locations. This is similar to the scheme found in the Episode file system [4]. The inode file itself is stored in an inode log, not an inode, in order to avoid circularity. Our approach has the advantage that inode storage space need not be pre-allocated when the file system is initialized. The inode file can grow and shrink as easily as any other file. A new inode is allocated by picking a random entry in the inode file and scanning forward until a free inode is found. If a free slot is not found after scanning a bounded number of inodes, a new block is added to the inode file and the inode is allocated there. The inode file can be compacted by moving inodes to create empty blocks and then removing those blocks from the file. Inode Logs Inode logs are allocated and named at the granularity of disk blocks and they thus consume considerably more storage than a single inode, which consumes only 168 bytes. A n inode log stores the versions of a file as an ordered list of inodes. If the log spans multiple blocks, the blocks are linked together in descending-chronological 32 order, with the block containing the current inode at the head of the list. The file's'. IMap entry stores the address if this block and the block.stores the offset to the current inode. Reading the inode of the current version of a file is thus as efficient as reading the inode of a non-versioned file in Elephant or any standard U N I X file, system. Older versions are located by a linear scan of the log. The inode log also uses inodes to record information about reclaimed versions. Recall that we retain information about reclaimed versions so that users can detect. discontinuities in a file's history. To do this, the system replaces reclaimed inodes with an inode that describes the discontinuous interval they leave behind. Similarly, when a file is deleted, an inode is added to the end of the log to record the delete time. The delete time is needed by the Keep-Safe policy, for example, to determine when the delete becomes permanent. The cleaner knows the file was deleted because the final inode will have a link count of zero. The modification time of this inode is the time the file was deleted. 5 .3 Directories Inode File Active Partition inode inode name, inumber,. . . , created, deleted History Partition (if needed) name, inumber created, deleted Figure 5.2: Directory structure. Directories map names to inumbers. They differ in their usage from files in that files are explicitly opened and closed whereas directory modifications are 33 implicit side effects of other system calls. For this reason Elephant handles the versioning of directories differently from that of ordinary files. • Elephant directories store versioning information explicitly, and so are imple-mented using standard inodes, not inode logs. Each directory entry stores a name's creation time and, if deleted, its deletion time. It is possible for multiple instances of the same name to co-exist in a directory, provided that no two of them existed at the same time. A name remains in a directory as long as at least one version of the file it names remains in the file system. Directory layout is depicted in Figure 5.2. Initially all directory entries are stored in a single inode. If a directory accumulates a sufficiently large number of deleted names, a second inode is allocated to store the deleted names. As necessary, the system periodically moves deleted, names from the active inode to the history inode and compacts the active inode. A name lookup on the current version of a directory scans only the active node, while a lookup on an past version of the directory scans both inodes. This partitioning thus ensures that the history information stored in a directory does not significantly slow the common case operation of looking up a. name in the current version of the directory. A n alternative to keeping versioning information in directories would have been to treat directories in the same fashion as versioned files. The result, however, would be wasteful of inodes and data blocks, because each name creation or deletion would require a new data block and thus a new inode for the directory. Copying a large directory, for example, would create a new directory inode for every name in the directory. 34 se tCurrentEpoch (timestamp) getHistory (file) =>- history setLandmark (file) ' unsetLandmark (file) setPolicy (file, policylD) groupFiles (fileA, fileB) ungroupFile (file) Table 5.1: Kernel interface for user-mode applications. 5.4 System Interface Elephant allows users to add a timestamp tag to any pathname they present to the file system. If this tag is present, Elephant accesses the version of the file that existed at the specified time. If a user types "cd .Otoday: 11:30", for example, her working directory is changed to the version that existed at 11:30 of the current day. If a timestamp tag is not specified, the selected version is determined by either the timestamp of the current working directory, if a relative pathname is specified (e.g., " f i l e " ) , or the process's current epoch, if a complete pathname is specified (e.g., " / f i l e " ) . Users can change a process's current epoch as described next and child processes inherit their parent's current epoch when they are created. Table 5.1 shows the seven new system calls added for Elephant. The setCur-rentEpoch operation changes the current epoch of the calling process. The ge tH-istory operation returns a digest of the inodes in a file's inode log, including key fields such as the timestamp of each inode. The se tLandmark operation designates the specified file version to be a landmark so that the Keep-Landmarks policy will never remove the version; unsetLandmark removes this designation. The set-Policy operation assigns a retention policy to a file. The groupFiles operation places fi leA and fileB in the same group for treatment by their retention policy, as described in Section 4.7. The new group contains the transitive closure of any files 35 m a p I M a p (pathname) =>- mntPt lockFile (iNumber) unlockFile (iNumber) readBlock (mntPt, block) writeBlock (mntPt, block) freeBlock (mntPt, block) Table 5.2: Kernel interface for the cleaner process. previously grouped with either of the two files. The ungroupFile operation removes file from its group, if one exists. 5.5 Storage Reclamation Storage reclamation is handled by a system cleaner that implements the per-file retention policies described in Section 4.3. The cleaner is a privileged and trusted user-mode process that uses a customized kernel interface to interact with the Ele-phant file system. Table 5.2 details the kernel interface used by the cleaner. The m a p I M a p call maps the inode map for the Elephant file system named by pathname into the cleaner's virtual address space. Also, it returns mntPt , the file system's mount-point id, which the cleaner uses on subsequent calls to the kernel. The mount-point id is necessary to allow multiple Elephant file systems to co-exist on one system. The lockFile and unlockFile operations lock and unlock the selected file's inode log; no other process is permitted to operate on the file while the cleaner has it locked. The cleaner reads and synchronously writes the inode log directly using the readBlock and writeBlock calls; blocks are named by mount point and relative disk address. Finally, freeBlock frees the specified physical block. 36 The cleaner process runs in the background, scanning the memory-cached IMap for a high-temperature file. When it selects a file, it calls lockFile and then uses readBlock to read the file's inode log into memory. It then scans the log to build a binary tree of the physical blocks allocated to the file and counts the number of inodes that reference each block. If the po l i cyGroup field in the IMap indicates that the file is part of a policy group, it repeats this process for every file in the group. To avoid deadlock, if the cleaner encounters a file that is already locked, it immediately releases all of its locks and skips the group. The cleaner proceeds by invoking the file's policy to pick zero or more in-odes for deletion. It deletes each inode by updating the in-memory inode log and decrementing the block-list reference counts for each of the inode's blocks. Finally, it calls writeBlock to write the modified inode log to disk, calls freeBlock to free any block in the block list with a reference count of zero, and calls unlockFile to release its lock on the file's inode log. The cleaner protects the file system from failures that occur during cleaning by ordering disk updates to ensure that a block is not freed if any inode on disk stores, a reference to it. A recovery procedure reclaims disk blocks that are unreachable from any inode. If the cleaner process fails independently of the rest of the system, locked inodes are inaccessible and partially freed disk blocks are not reclaimed. The system detects and recovers from cleaner failure by terminating the cleaner process, releasing all inode-log locks it holds, scheduling the recovery procedure, and restarting the cleaner. Finally, it is important to distinguish the Elephant cleaner from the cleaner of a Log Structured File System [24, 27, 14]. A n LFS cleaner serves two roles: it frees obsolete blocks and it coalesces free space. In contrast, the Elephant cleaner's role 37 is simply to free obsolete blocks. As a result, the Elephant cleaner has significantly lower overhead than an LFS cleaner, because Elephant's cleaning is performed with-out reading any file data blocks, only the inode log need be accessed. In contrast, the LFS cleaner must typically read every data block at least once, even obsolete blocks, and it may read and write active blocks multiple times. 5.6 Tools We have augmented the standard set of UNIX file manipulation tools to add several new tools that manage the file system's new temporal dimension. The tls tool is similar to Is, but instead outputs a list of a file's history, including information about deleted versions. Similarly, tgrep searches for patterns in every version of a file in much the same way that grep searches a list of files. Finally, we have implemented a history browser called tview that is useful for reconciling file histories with respect to discontinuous intervals for rollback in a group of files. If, for example, a user tries to rollback to "Jan 15 at 11:30:00" and a file version that existed at that time has been freed, this rollback is incomplete. The user can use the version history tool to identify a time in the continuous interval (e.g., "Jan 15 at 11:25:00") and suitably refine her rollback. Of course, if the user had known that "11:30:00" was going to be important, she could have used setLandmark to manually specified it as a landmark and thus prevent the system from creating the discontinuity. Furthermore, she could have used groupFiles to ensure that a landmark in one file protects all files in the group. 38 5.7 Ramif icat ions of the Implementa t ion on K e r n e l Caches As discussed in section 3.1 the BSD kernel maintains two caches. These are the vnode cache and the buffer cache. This section discusses the implications of using Elephant with BSD's vnode cache and buffer cache. The kernel's caches were designed to minimize the number of disk accesses performed during normal operation of the system. BSD does this by making as-sumptions on the way the file systems layout data on the disk. Elephant violates these assumptions resulting in block aliasing the buffer cache. Block aliasing is the condition whereby two (or more) different names in the buffer cache are the same physical disk block. The names use separate memory pages to store the disk block. Block aliasing is detrimental to the performance of the cache for two reasons that both result in increasing the number of disk transfers that take place. The first reason, duplication of data in the cache, effectively reduces the size of the cache and hence the lifetime of a block in the cache. Since lifetimes, are shortened disk accesses are increased. The second reason that performance is decreased is that since the blocks have different names in the cache, looking for one of the names and not finding it will result in an unecessary disk transfer to fetch it, even though it is . already in the cache under another name. The two file types that produce these problems are ordinary data files and directories. Block aliasing results for different reasons in these two file types and hence we shall discuss them seperatly. 5.7.1 Ordinary Files To access a file in the FreeBSD kernel one needs a vnode. Ordinary files in Elephant can have multiple versions. This presents some problems that have not been fully 39 addressed in the current implementation. Vnodes have file system specific inodes associated with them. When a read or a write is done on a vnode the file system uses the inode attached to the vnode to find the file's data blocks. Versions of files have some data blocks that differ, so every version of a file needs a different inode, as a vnode can only point to one inode each version of a file must have a different vnode. As will be seen, this results in block aliasing in the buffer cache. Blocks in the cache are tracked by the vnode pointer and logical block numbers. If multiple versions of a file reference the same physical disk block, which is possible with copy-on-write protection, then there can be multiple copies of the disk block in memory since all versions of a file have their own vnode. Fixing this problem is non-trivial and would involve changing the basic as-sumptions of the whole physical page management system. The major assumption is a virtual memory subsystem invariant, recall from section 3.1 that each vnode has one vm.object associated with it. A physical page can only be in one vm_object at a time. The entire kernel from the buffer cache to the pagedaemon and swapper process assume pages can only exist in one object at a time. This precludes Elephant giving multiple names to the same physical page. Block aliasing does simplify one aspect of managing Elephant blocks. As all versions of files already have separate copies of disk blocks if they are in memory, modifying the latest version does not result in consistency problems. 5.7.2 Directories Elephant supports viewing directories in the past. However, as presented in section 5.3, Elephant directories have only one version. They track changes with name 40 logs. A l l vnodes for a directory, no matter what time the directory is to be viewed at, wil l have inodes that point to the same data blocks. This presents a problem when using a vnode for a past view of a directory. For instance, consider the current working directory of a shell where the directory is not the current version. The inode associated with the vnode for the directory points to the data blocks of the current version of the directory yet name lookups must behave differently. To effect correct behaviour in lookups a time must be used when examining the name log. This time, the time the directory is being viewed at, is recorded in the vnode. Regular files by contrast do not require the time after the vnode has been obtained as they have multiple versions. Browsing a directory at many different times will result in many different vnodes for the same directory. Each vnode wil l have its own copy of data in the buffer cache thus directories exhibit very acute block aliasing. 41 Chapter 6 Performance and Feasibility This section assesses the feasibility of our design. First, we compare the performance of our Elephant prototype to the standard FreeBSD FFS file system. Second, we analyze file-system trace data to estimate how much extra storage a Elephant file system might consume for history information. Third, we examine the types of files stored by a large file system to estimate what portion of its files would be versioned. Finally, we describe a user study based on a modified Elephant prototype that shadows an NFS server. 6.1 Prototype Performance Our experimental setup consists of a 200-Mhz Pentium M M X with 64-MB of R A M and a 3-GB IDE disk. Our prototype is a modified FreeBSD 2.2.8 kernel. We compare the performance of the prototype to an unmodified FreeBSD 2.2.8 kernel, which uses the U N I X fast file system (FFS) [16]. Both file systems were configured with a 4 -KB block size. The measurements were taken by configuring each file system in the same disk slice to normalize for the effects of disk geometry. 42 Operation E X P - V EXP-O FFS (/zs) (MS) (fiS) open (current) 134 133 132 open (20th ver) 140 — — open (75th ver) 160 — — write (4-KB) 58.7 54.5 47.3 close (upd) 35.8 34.5 34.9 close (upd 24th ver) 121 — — create (0-KB) 5040 3750 3930 delete (0-KB) 452 2154 3010 delete (64-KB) 446 4522 4732 Table 6.1: Performance of basic file system operations. In the following tables, FFS refers to the fast file system, E X P - O refers to files using the Keep-One policy whose inodes are stored in the inode file, and E X P - V refers to files using the Keep-All policy whose inodes are stored in inode logs (the performance of Keep Safe and Keep Landmarks is identical to Keep Al l ) . Micro-benchmarks Table 6.1 summarizes the performance of the basic file system operations. A l l mea-surements were taken by running a user-mode program and using the Pentium cycle counter to measure the elapsed time each system call spends in the kernel. The times do not include the overhead of entering and exiting the kernel. Each number represents the median of 10,000 trials. The first three lines of Table 6.1 show the time to open a file when its inode (or inode log) is cached in memory. The first line shows that the time to open a file's current version is virtually the same in Elephant and in F F S . This is also true if the inode and inode logs are not cached; in this case, both E X P and F F S are slowed by the latency of a single-block disk read. The next two lines show the 43 cost of opening an older version, wRich requires scanning the inode log and possibly. reading additional inode-log blocks: each.4-KB inode-log block stores 24 inodes. Opening the 20th version is 6-/us slower than opening the current version, 0.3 /J,S for each inode scanned. Opening the 75th version is 26-/us slower. If the inode and inode-log blocks are not cached in memory, open (20th ver) requires one disk read and open (75th ver) requires four, one for each inode-log block it must scan. The cost of the first write of a 4-KB block after an open is 58.7 / i s in E X P - V and 47.3 ^xs in FFS , including the time to schedule an asynchronous disk write. E X P -V is l l - / u s slower due to the copy-on-write operation required for this first write. The performance of subsequent writes to the block before the close is the same as in E X P - O . E X P - 0 is 7.2-fis slower than FFS, because the prototype currently performs a modified copy-on-write for these writes as well. This use of copy-on-write is an artifact of the way our prototype evolved; it is not required for versioning and could be replaced with an update-in-place scheme with the same performance as FFS . The next two lines of Table 6.1 show the time to close a modified file. For E X P - V , the close produces a new version of the file. If the file's new inode fits in the current inode-log block (i.e., close (upd)) the performance of E X P - V is virtually the same as E X P - 0 and FFS . If it does not fit (i.e., close (upd 24th ver)) a new inode-log block is allocated and overhead increases by 85.2 /j,s. A l l E X P and F F S close operations require a single disk write; in the times reported, this write is handled asynchronously. Finally, if the inode or inode log is not cached in memory, all E X P and FFS close operations require one disk read. Closing a read-only file is the same in E X P and FFS . The last three lines of Table 6.1 show the time to create and delete a file. Creating an empty versioned file requires 5.04 ms. This time is 1.11-ms slower than 44 F F S , because E X P must allocate a new inode log for the file and write it to disk. Deleting a versioned file in E X P is considerably faster than FFS , because a E X P delete does not release the file or its disk blocks. Finally, we measured the performance.impact of versioned directories. Recall that a Elephant directory stores both active and deleted names and that a second inode is used to archive deleted names. The system periodically moves deleted names to the archive and compacts the active names; the overhead of this operation is 0.22 /is per name in the active inode. Name lookup in the current version of a directory is slowed by the number of deleted names in the active inode that must be skipped. Currently, we trigger compaction when the active inode has 20% deleted names. The impact on name-lookup time, however, is much less than 20%, because a deleted entry can be skipped without performing a name comparison, the operation that dominates lookup overhead. The Andrew File System Benchmark We ran the modified Andrew File System Benchmark for E X P - V , E X P - O , and F F S . This standard benchmark is designed to represent the actions of a typical U N I X user. It creates a directory hierarchy, copies 70 source files totaling 200 K B into the hierarchy, traverses the hierarchy to examine the status of every file, reads every byte of every file, and compiles and links the files. To be conservative, we modified the prototype to force all E X P - V files to be stored in an inode log. We did this because Andrew does not overwrite or delete files and thus the E X P adaptive meta-data allocation scheme would normally store, these single-version files in an inode, not an inode log. The benchmark's elapsed time was 19 s for E X P - V and E X P - O , and 18 s for 45 F F S . F F S was one second faster in the compile-and-link phase of the benchmark. The total meta data consumed by files was 18 K B for E X P - 0 and F F S and 444 K B for E X P - V . E X P - V consumed 426-KB more space, because it stored files in 4-KB inode logs instead of 168-B inodes. Copying a Large Directory We also measured the performance of copying a directory that is substantially larger than that used in the Andrew Benchmark. For this experiment, we copied the FreeBSD-kernel source tree, which consists of 1525 files totaling 20 M B . Again, we forced all E X P - V files to use an inode log. We measured an elapsed time of 49 s for E X P - V , 56 s for E X P - O , and 115 s for F F S . E X P is faster than F F S because E X P can place a newly created inode or inode log anywhere on disk; it is thus able to group meta-data and data writes more effectively than FFS. The total meta-data consumed in E X P - 0 and F F S was 0.24 M B ; E X P - V consumed 5.9 M B . Cleaner Performance To measure the ability of the cleaner to free disk blocks, we created three versions of 1000 different files. Each version modified every block of a file. File sizes ranged between 1-KB and 1-MB, with most of the files less than 4-KB; the mean file size was 145-KB. The Keep Safe policy was assigned to each file and the cleaner was triggered when the changes became permanent. The cleaner then removed two versions of each file for a total of 284 M B . In this experiment, the cleaner was able to release storage at a rate of 5.6-46 M B / s . 6.2 Trace Studies The file system trace data we examined was gathered on a set of twenty instructional U N I X workstations used by undergraduate classes at U C Berkeley [23]. We had hoped to use these traces to demonstrate thata Elephant file system would not consume significantly more storage than a.standard file system. Unfortunately, a key problem with these and other such traces is that they do not include file names, but only unique file identifiers (in this device number and file identifier). This obscures the true history of the files, as many applications wil l perform file changes in a separate copy of the file, which is then renamed to the original. This effectively means the same file name can exist with multiple file identifiers in the trace, and this information is effectively hidden from us. We have, however, used these traces to place a rough upper bound on the growth of an Elephant file system. To accomplish this, we processed the traces by: • Removing all records for temporary and system (e.g. root) file systems. Files in these file systems either do not need versioning, or change only very slowly. This reduces the number of writes by an average of 20%. • Removing all records for files that are executed, and for files that are larger than 100 K B . These files are unlikely to need versioning. This reduces the amount of data written by a further 90%. • The final step is to coalesce writes between closes, and to convert the byte ranges in files that change into block counts (which gives a more accurate picture of the potential storage increase under a Elephant file system). 47 The result of these processing steps is that the average write throughput in these traces is 2.64-MB/workstation/day. This final result is very promising, for three reasons. First, 2.6 M B / d a y is not a large amount of data: it is only a small percentage of modern disk drive capacity. Second, this is a conservative figure. Many of the files included in the final trace would not be versioned in a Elephant system. Finally, the analysis above does not account for file deletion. It is certain that file deletions would reduce the amount of storage growth, even in the face of some of these files being versioned. Impact on Buffer Cache A potential drawback of Elephant is that it reduces the effectiveness of buffer-cache write absorption and thus increases the number of disk writes. In most U N I X systems, two writes to the same file system block within a certain time period (typically thirty seconds) will be absorbed by the buffer cache, and will result in only a single disk write. For Elephant versioned files, however, if there is an intervening close between the file system writes, then two separate disk writes are required. We used the Berkeley traces to determine the increase in disk write traffic that would occur with a Elephant system. To do this, we counted the number of times writes to the same block occurred with an intervening close, and determined the potential increase in disk write traffic to be 5.5%. 6.3 File System Profile We have examined the content of the workstation home-directory server for a major research lab. This server contains approximately 12-GB of data in 1.7 million files. 48 File T y p e Files (%) Bytes (%) Read Only 15.4 8.8 Derived 6.8 25.9 Temp/Tar/Compressed 5.7 1.1 Larger-than- 100-KB 2.4 53.7 Other 69.7 10.5 Table 6.2: File distribution of 12-GB research-lab workstation home directorie s. We divided these files into the following five categories. • R e a d - O n l y includes files that are not writable by users other than root. • Derived includes executable files and files whose name ends with . o or . a, under the assumption that these are object and library files respectively. • T e m p / T a r / C o m p r e s s e d includes files who's path name includes tmp or temp and files who's name ends with . tar , . Z, or .gz. • Larger - than-100 -KB includes files not included in the other categories that are larger than 100 K B . • Other includes all other files. Table 6.2 shows the distribution of files into these five categories. Note that 69.7% of the files, but only 10.5% of the blocks fall into the Other category, the category most likely to have files that users will consider important enough to be versionable. Furthermore, it would be reasonable to assume that most Read-Only and Derived files would use the Keep-One policy, most Temp/Tar/Compress and Larger-than-100-KB files would use the Keep-Safe policy, and most Other would use the Keep-Landmarks policy. If this assumption holds, the distribution of files to policies breaks down as follows. 49 • Keep One: 22.2% of files - 34.7% of bytes. • Keep Safe: 8.1% of files - 54.8% of bytes. • Keep Landmarks: 69.7% of files - 10.5% of bytes. 6.4 Summary Our performance measurements show that there is no show-stopper in the Elephant prototype. Performance is competitive with the standard FreeBSD F F S across a broad range of micro-benchmarks and some simple macro-benchmarks. We show that meta-data storage for versioned files can be 24-times larger than for non-versioned files, if a versioned file is stored in an inode log instead of an inode. This fact demonstrates the importance of our adaptive approach that uses inode logs only where necessary, to store files that currently have a history. Our analysis of the file system traces and studies of the distribution of file types tell a common story. The majority of files, and specifically writes, are to executables. The number of files for which versioning would be desirable comprise only a small fraction, approximately 10%, of typical file systems. This result is validated by the relatively small write rate we have calculated for the Berkeley traces on unclassified files. Given these results, we believe that the extra storage and disk write overhead incurred by using a file system such as Elephant is of minimal cost compared to the convenience and time gains (due to not having to restore or recreate accidentally-deleted files) made possible. Finally, using NFS-server shadowing, we are able to exercise the prototype with a real user workload and we are beginning to gain some information from users regarding the usability of the system. 50 Chapter 7 Implementation Alternatives The design for Elephant which was implemented is not the only possible design. There are a number of alternatives. Here we outline and contrast some of the other designs considered. The discussion is based on the changes needed to make a file system comply with section 4.1.4. 7.1 Log Structured File Systems The operation of a log structured file system (LFS) "[24] is well suited for an Elephant file system. A L F S does not write in place in an effort to utilize as much of the write bandwidth as possible from the underlying device. Whenever a new version of data is written to disk a new inode is also written out containing the new disk addresses for its data. The old inode is lost so the old data blocks are no longer reachable. A n inode map (imap) maps inode numbers to the current location of inodes on the disk. LFS has all the information on disk that it needs to be the basis for an Elephant system. It versions data, for it never writes in place, and keeps records in 51 the form of multiple inodes. Accessing the multiple versions of a file is not supported by L F S . Building Elephant support into LFS would be trivial, it requires only the addition of back pointers to inodes. The inode map would point to the most recent inode which in turn would point back to the one that existed prior to it. This would form a chain of inodes that can be indexed temporally by examining the modification times of the inodes. Policies could be implemented by a user level process that examines the logs and decides which versions of a file to deallocate. While seemingly trivial to implement this design suffers from other short-comings. Implementing this design essentially gives you a L F S . This implies the implementation would suffer from the short falls of LFSs. The main drawback of LFSs is the cleaner process. L F S logs everything in the file system and carves the disk up into large chunks called segments. The file system amalgamates many meta-data changes and data writes into a segment and writes out the entire segment. As the system evolves all the segments on the disk are used but most of the segments contain some invalid data (due to deletes). The cleaner's job is to examine the segments and collect live data into full segments freeing up other segments for fu-ture use. The cleaner process can consume a large amount of disk bandwidth as well as C P U time. W i t h the addition of a second process to implement policies the potential for creating load on the system is quite high. The main reason this design was not chosen was the lack of a working set of sources to build from. The main advantage of this design is the easy modification of an existing system, if there does not exist a system to modify then this advantage is negated. 52 7.2 Journaling File Systems Journaling file systems (JFS) differ from LFSs in that data is written in place while only meta-data is logged. Some examples of journaling file systems are Silicon Graphics' X F S [28] and Veritas Software Corporation's Veritas File System [2]. The type of meta-data logged varies between implementations. Inodes are always logged as are block allocation data but name manipulations, such as a name deletion, are not always included in the log. Some implementations log transactions instead of meta-data and we do not discuss them here. The log is typically a preallocated region on the device. Normally the structure of the log is a circular queue. The semantics of the log usually fall into one of two groups, redo-only and undo-redo. Some systems employ a second device for the log. Journaling file systems offer many advantages over other types of file systems. They offer quick recovery from crashes as the corrupted file system can be fixed by replaying the log. The entire disk does not need to be examined. Bringing up JFSs after a crash is a function of the size of the log not the size of the disk. Wi th clustering and extent based block allocation they can offer very competitive write performance compared to LFSs. Even operations which are meta-data intensive, the only area performance that LFSs still claim superior write performance [26], can be done quickly as the meta-data is appended to the end of log and not scattered around on the disk. Most commercial file systems being implemented today are journaling file systems. There is a dearth of public codes implementing JFSs so this option was not used. Elephant can be implemented fairly easily on top of journaling file systems. Meta-data is already being logged. Wi th the addition of back pointers to the inodes, the inodes form a chain of versions through the log. The different versions can be 53 indexed by time using the modification time field in the inodes. Systems employing a circular queue for a log would need to be modified to support unbounded logs. However, these systems typically write data in place which violates an Elephant axiom. Data blocks would need to be protected by copy-on-write mechanisms. Policies would be implemented by a user level process replaying the log and deciding which versions of files to destroy. Files that are not to be versioned would have to have their inodes stored outside of the log. Keeping the non-versioned inodes in the log would result in fragmentation of the log as they were deleted. 7.3 Write in Place File Systems File systems that write meta-data and data in-place are called write-in-place file systems. The most famous example of this sort of a system is Berkeley's Fast File System (FFS) [16]. Fast File System variants are as ubiquitous as FFS is influential. FFS is feature rich, robust and people trust it. Over the years is has been repeatedly tuned keeping its performance competitive. However, F F S is not well suited to implementing Elephant. None of the preconditions to Canonical Elephant, described in section 4.1.4, exist in FFS . Data blocks are updated in place as is meta-data. The clustering code can on occasion change the disk address of a data block to build a contiguous cluster but it is not guaranteed. Even when the clustering code does reallocate data blocks the previous data is lost as inodes are updated in place. Wi th the addition of copy-on-write to F F S multiple versions of files would be produced. Still , F F S has no facility for tracking these versions. What is needed is a journal of inodes that reference the different versions and is temporally ordered. Elephant requires more information than F F S does to manage files. This additional information includes 54 the file's policy and file group membership. This information must be stored effi-ciently somewhere in the file system. Finally, some entity must scan the journals to implement Elephant's policies. To understand the solution a brief explanation of F F S inode book keeping is in order. FFS inode numbers are used to calculate the disk address of the inode. This is done by hashing on the inode number and parameters set at file system creation time. Every time a file is modified its inode is overwritten at the exact same spot on the disk. This overwriting of inodes makes it difficult to build the journal of inodes that is required to track the multiple versions that the copy-on-write semantics wil l produce. A journal of inodes can be effected by encapsulating the journal in a meta-file. The meta-file is a list of inodes ordered on their modification times. New inodes are appended to the end of the meta-file as they are created ensuring a temporal ordering. The default inode, that is, the youngest inode or current inode, is always found at the end of the file. Like all files, the meta-file will need an inode to encapsulate it. This meta-file inode can be stored where the file's ordinary inode used to live. The ordinary inodes now live in the meta-file so they no longer need the space. FFS inodes have unused fields in them, these fields were intended for future use. Elephant's file parameters, such as retention policy and file group membership, can be stored in these fields. This scheme would make finding a file's inode would be slower as two disk accesses are now required, one to find the inode for the meta-file and a second to read the meta-file to find the inode. The last piece missing to make FFS an Elephant system is policy implemen-tation. The retention polices could be implemented as a user level process. User level processes need access to the meta-file. Meta-files could be provided to pro-55 cesses through file descriptors. When a process opens a file it is provided with a file descriptor for the file. U N I X provides an interface, the fcntl(2) system call, to processes to alter the semantics of file descriptors, fcntl could be used to change the file descriptor to point to the meta-file instead of the file and thus allowing processes to use the standard file descriptor based system calls to access meta-files. Processes could then examine the meta-file and decide which versions to discard and write the meta-file back. Synchronization between the kernel and the cleaner process could be achieved through standard file locking mechanisms. This design has some advantages over the Elephant prototype we imple-mented. Writing file systems is hard. The interface between kernel subsystems and a file system is undocumented and asynchronous. Applications expect many be-haviours that can only be found by trial and error. Many weeks were spent during the building of the Elephant prototype not writing code but reading kernel code and making applications happy. FFS already 'works' and it is fast. Modifying F F S instead of writing a new file system would have given us a full featured file system. The end result would have been much more usable. The fcntl interface would also obviate a number of ideas in the prototype that would have made it simpler. The current prototype required special kernel interfaces for the cleaner for it to read the imap, read/write logs and free disk blocks, fcntl would have reduced the number of system calls required. 7.4 Logical Disks Logical disks [5] reside underneath file systems. They are software interposed be-tween the file system and the physical device. They appear as the backing device to their client file systems. File systems do not know what the underlying device is 56 doing. This software can do processing on the disk requests before passing them on the physical device including changing the addresses in the requests. Building an Elephant logical disk would support, theoretically, running FFS or any file system on top of it unaltered. This has the enormous advantage of writing Elephant once and getting many Elephant file systems in return. The greatest challenge in writing Elephant as a logical disk is that logical disks are very low-level. Logical disks only know about raw disk blocks. They are viewed as devices by their client file systems. Files and other abstractions are totally foreign to them. This requires a different approach to the design. In the absence of access to the file system's abstractions the only objects a logical disk can version are disk blocks, or, moreover their disk addresses. There are a number of problems with this approach. The semantics of Elephant are such that blocks are copied only once between one open/close pair even if the block is written to multiple times. Disks unfortu-nately know nothing of opens and closes so the logical disk does not know when to perform a copy on write. Choosing versions to deallocate is also problematic as the logical disk does not know about files nor about versions of files. The only information the logical disk has access to is ages of disk blocks. Clearly the logical disk it not suited for this sort of an application. Elephant requires intimate knowledge of file system objects and operations. The logical disk is too low level to have access to this information. 57 Chapter 8 Conclusions and Future Work 8.1 Future Work and Issues This section briefly discusses several remaining issues related to our design and prototype implementation that are worth considering. 8.2 Version History Export and Import Currently, Elephant stores files as version histories, but users can only access files one version at a time. This interface makes sense, because it matches the way that people access files in standard file systems. It is also useful, however, to allow users and applications to manipulate files at the granularity of their complete history. This facility is needed, for example, to move a file from one Elephant file system to another or to backup a file and its history. We envision adding two new file-system operations to export and import file histories, either as kernel operations or as user-mode tools. Export would generate a single intermediate file containing the exported file's complete history. Import would 58 reconstitute a versioned file from an intermediate file. Copies between Elephant file systems could be handled using a combined export-import operation, optimized to avoid creating the intermediate file. A n important open question is how to handle files that are managed by an application-defined policy. Ideally, the intermediate file would form a closure of both the exported file and its policy module. A file imported from an other Elephant file system could then really be identical to the exported file. 8.3 Disconnected Operation Export and import operations allow users to use multiple Elephant systems to access a file, by copying it to the appropriate system before accessing it. It would be useful to extend this modest support to allow for full-fledged disconnected operation. In any file system, the key issue for disconnected operation is handling updates to a file that occur concurrently in multiple file systems [11]. For Elephant, the issue is somewhat more complicated, because it requires that an import operation be able to merge the version histories of multiple copies of a file. The key problem is that the merged history may have intervals in which multiple versions co-existed. 8.4 Version History Merge and Branch Fundamental to Elephant's naming scheme is the assumption the multiple versions of a file never co-exist. We have just seen, however, that disconnected operation can violate this assumption. The same is true for software revision control systems. A revision control system typically views file history as a rooted acyclic graph, allowing for branch and merge points. A branch occurs when a single version splits 59 into two or more new versions of the file that can be modified independently. A merge combines multiple concurrent versions of a file into a single version. Merging thus allows branched versions to be reconciled periodically, before a software release, for example. To support disconnected operation and applications such as RCS, we are exploring the idea of adding branch and merge points to Elephant file histories. The key issue is naming. Our current approach that uses time to name files must be augmented or replaced in order to resolve ambiguities that occur when multiple versions co-exist. One strategy is to add a tag to each concurrent history created by a branch. Versions could then be named by the pathname-time-tag triple when necessary to disambiguate among concurrent versions of a file. 8.5 Application Defined Retention Policies The policies described in this thesis are not the only policies possible. While finding alternative policies is an active research area it would be convenient for policies to be added without changing the base operating system. For instance, applications may wish to implement their own cleaning policies. These policies should only be installed on the Elephant system if the applications are installed on the system. There are a number of models possible to support application specific clean-ing. The major classes of designs are system cleaner extensions and combining a trusted cleaner process with untrusted application cleaner processes. The design we are exploring falls into the system cleaner extension-taxonomy. Application specific policies are cleaning policies provided by applications for use when cleaning their files. This involves an application writing the policies and getting Elephant to use them. Our approach involves two steps. The first involves 60 the application registering with the kernel. The kernel provides a policy ID. This policy ID is then used as a parameter by the application with the setpolicy system call to set policies on its files. Whenever the cleaner comes across a file bearing the policy ID it must clean it with the application's policy. The policies are provided by the application by way of callbacks. When the application registers with the kernel it provides the callbacks for use by the cleaner. 8.6 Conclusions Since their inception, file systems have contracted with their users to reliably store the most recent version of each file until that file is deleted. File systems have evolved excellent solutions to address a wide variety of network, system, and media failures. We believe that it is time to offer a richer contract in which the file system also protects the user from her own mistakes. This has become feasible due to the recent arrival of very large cheap disk storage. Our analysis of file system trace data indicates that the amount of space required to provide this level of protection is moderate, on the order of 2.6 MBytes per user per day. The amount of data that is read/write in a file system has remained fairly constant while the space available on disks is growing enormously. Providing protection from user mistakes requires the separation of file sys-tem modification operations and file system storage reclamation. A l l operations in the file system that modify data must be revocable, meaning that copy-on-write techniques must be used to maintain all file system data, and no regular file system operation can free storage. Since file system modification has been separated from storage reclamation, we must define mechanisms and policies for storage reclama-tion. We have argued that the system must support the specification of storage 61 reclamation policies at the granularity of individual files or groups of files. We have also described four storage reclamation policies that we believe will be valuable to users, and also how both these system-defined policies and application-defined poli-cies can be implemented using a simple interface to the file versioning information maintained by the file system. The use of Elephant in production environments raises some interesting is-sues. The work presented in this thesis did hot, and can not, take all file system usage patterns into account. The work was based on research and development environments. Clearly there are a number of other environments. Academic environments exhibit very different usage patterns from research usage patterns. In addition to the different usage pattern schools frequently employ disk quota schemes to ensure resource availability for all students. At first glance a quota system is counter to Elephant's primary premise, disk space is abundant for users. In a quota system a user typically enjoys a very tight quota of disk space hence storage is at a premium. In fact students, and other users of quota systems, need the protection that Elephant provides just as badly other users. We would expect them to compensate for lack of space by judicious use of policies. For instance only before assignments are handed in are they versioned and afterwards they are set to be non-versioned. This would be safe as the assigment is no longer active after being handed in hence they do not evolve. Installations concerned about security may use Elephant very differently than other sites. Elephant without a cleaner running and no way to change policies on files would record all versions of files forever. This would allow the security conscientious to track the use of their systems very closely. Internet web sites could detect intrusion. Employees carrying out illegal activities with sensitive data could 62 be detected and prosecuted far more easily. This sort of system, depending on usage, would consume disk space far more rapidly than our numbers suggest. The true cost of Elephant under these scenarios is unknown. This thesis has presented our arguments that this new contract between the file system and the user is desirable and feasible, and has described our initial attempt to build a file system, Elephant, which implements this new contract. Our experience with the system to date indicates that there is no substantial performance penalty for providing this additional level of protection. 63 Bibliography [1] Clearcase - product information. Technical report. [2] The Veritas file system. Technical report. [3] M . J . Bach. The Design of the UNIX Operating System. Prentice-Hall, Engle-wood Cliffs, New Jersey. [4] Sailesh Chutani, Owen T. Anderson, Michael L . Kazar, Bruce W . Leverett, W. Anthony Mason, and Robert N . Sidebotham. The Episode file system. In Proceedings of the Usenix Winter 1992 Technical Conference, pages 43-60, Berkeley, C A , USA, January 1992. Usenix Association. [5] Wiebren de Jonge, M.Frans Kaashoek, and Wilson C. Hsieh. The logical disk: A new approach to improving file systems. In Proceedings of the l^th Symposium on Operating Systems Princi pies, pages 15-28, December 1993. [6] Digital. Vax/VMS System Software Handbook. Bedford, 1985. [7] James Griffioen and Randy Appleton. Reducing file system latency using a predictive approach. In Proceedings of the Usenix Summer Conference, pages 197-208, Boston, M A , June 1994. Usenix. [8] R. Hagmann. Reimplementing the cedar file system using logging and group commit. Proceedings of the 11th ACM Symposium on Operating Systems Prin-ciples, 21(5):155-162, November 1987. [9] Dave Hitz, James Lau, and Michael Malcolm. File system design for a file server appliance. In Proceedings of the 1994 Winter USENIX Technical Conference, pages 235-245, San Francisco, C A , January 1994. Usenix. [10] John H . Howard, Michael L . Kazar, Sherri G. Menees, David A . Nichols, M . Satyanarayanan, Robert N . Sidebotham, and Michael J . West. Scale and performance in a distributed file system. A CM Transactions on Computer Sys-tems, 6(1):51-81, February 1988. 64 [11] James J . Kistler and M . Satyanarayanan. Disconnected operation in the coda file system. In Proceedings of the Thirteenth ACM Symposium on Operating Syste m Principles, pages 213-225, Pacific Grove, C A , October 1991. A C M . [12] S.R. Kleiman. Vnodes: A n architecture formultiple file system types in sun unix. In Proceedings of the Summer 1986 USENIX Technical Conference, pages 238-247, June 1986. [13] K.Smith and M.Seltzer. A comparison of ffs disk allocation policies. In Proc. 1996 USENIX Conference, pages 15-26, January 1996. [14] Jeanna Neefe Matthews, Drew Roselli, Adam M . Costello, Randolph Y . Wang, and Thomas E . Anderson. Improving the performance of log-structured file systems with adaptative methods. In Proceedings of the 16th Symposium on Operating Systems Princi pies, pages 238-252, 1997. [15] M . McDonald and R. Bunt. Improving file system performance by dynamically restructuring disk space. In Proc. of Phoenix Conference on Computers and Communication, pages 264-269, March 1989. [16] M . K . McKusick, W . N . Joy, S. J . Leffler, and R. S. Fabry. A fast file system for U N I X . A CM Transactions on Computer Systems, 2(3):181-197, August 1984. [17] L. McVoy and S. Kleiman. Extent^ike performance from a unix file system. In Proceedings of the 1990 Summer Usenix, pages 137-144, June 1990. [18] Lisa Moses. TOPS-20 User's manual. USC/Information Sciences Institute, Internal manual, Marina del Rey, California. [19] J . K . Ousterhout, H . Da Costa, D. Harrison, J .A. Kunze, M . Kupfer, and J .G. Thompson. A trace-driven analysis of the unix 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15-24, Orcas Island, W A , December 1985. [20] David Presotto. Plan 9. In Proceedings of the Workshop on Micro-kernels and Other Kernel Architectures, pages 31-38, Seattle, W A , USA, Apr i l 1992. U S E N I X Association. [21] Sean Quinlan. A cached worm file system. Software—Practice and Experience, 21(12):1289-1299, December 1991. [22] Marc J . Rochkind. The source code control system. IEEE Transactions on Software Engineering, l(4):364-370, December 1975. 65 [23] Drew Roselli. Characteristics of file system workloads. Technical Report UCB//CSD-98-1029, 1998. [24] Mendel Rosenblum and John K . Ousterhout. The design and implementa-tion of a log-structured file system. ACM Transactions on Computer Systems, 10(l):26-52, February 1992. [25] Michael D. Schroeder, David K . Gifford, and Roger M . Needham. A caching file system for a programmer's workstation. In Proceedings of the 10th ACM Symposium on Operating Systems Principles, pages 25-34, Orcas Island W A (USA), December 1985. A C M . [26] M . Seltzer, K . Smith, H . Balakrishnan, J . Chang, S. McMains, and V . Padman-abhan. File system logging versus clustering. In Proc. 1995 Winter USENIX Conference, pages 249-264, January 1995. [27] Margo Seltzer, Keith Bostic, Marshall K i rk McKusick, and Carl Staelin. A n implementation of a log-structured file system for U N I X . In U S E N I X Associ-ation, editor, Proceedings of the Winter 1993 USENIX Conference: January 25-29, 1993, San Diego, California, USA, pages 307-326, Berkeley, C A , USA, Winter 1993. USENIX. [28] Adam Sweeney, Don Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. Scalability in the xfs file system. In Proceedings of the USENIX 1996 Annual Technical Conference, pages 1-14, Jan 1996. [29] Walter F. Tichy. RCS: A system for version control. Software — Practice and Experience, 15(7):637-654, July 1985. 66 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051213/manifest

Comment

Related Items