UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Inter-server communication in the Mammoth file system Pomkoski, Jody James. 2002

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2002-0536.pdf [ 2.52MB ]
JSON: 831-1.0051495.json
JSON-LD: 831-1.0051495-ld.json
RDF/XML (Pretty): 831-1.0051495-rdf.xml
RDF/JSON: 831-1.0051495-rdf.json
Turtle: 831-1.0051495-turtle.txt
N-Triples: 831-1.0051495-rdf-ntriples.txt
Original Record: 831-1.0051495-source.json
Full Text

Full Text

Inter-Server Communication in the Mammoth File System by  Jody James Pomkoski B . S c , Bishop's University, 1999 A THESIS SUBMITTED I N PARTIAL F U L F I L M E N T OF THE REQUIREMENTS FOR T H E D E G R E E OF M A S T E R  O F  SCIENCE  in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science)  We accept this thesis as conforming to the Q u i r e d standard  THE UNIVERSITY OF BRITISH C O L U M B I A August 22, 2002 © Jody James Pomkoski, 2002  In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission.  Department of Computer Science The University of British Columbia Vancouver, Canada  Abstract  ii  Abstract The Mammoth file system uses a collection of loosely coupled file servers to provide a highly available, widely distributed, scalable file system. Mammoth servers act as peers to cooperatively provide replicated back-up free storage. Mammoth approaches the management of file data at the granularity of whole files. Files in Mammoth are versioned upon any operation that would modify the data. Versions are immutable and retained by the system. Versioning simplifies conflict management and permits relaxing the consistency model to tolerate the latency inherent in propagating replicas. Each file or directory within the system expresses its replication and distribution requirements explicitly in the meta-data by naming cooperating nodes by IP address. The resulting system is thus inherently more scalable because all nodes do not need to monitor the entire Mammoth system. This thesis describes the design and prototype implementation'of a Distribution Manager. This module accesses file and directory meta-data and replication policies to direct its inter-server communication. The distribution manager is composed of two threads which run within a modified userlevel NFS server. These threads provide network communication via the T C P / I P protocol. A socket cache is implemented in order to amortise the relatively expensive set-up of stream sockets. Shared message queues allow asynchronous message processing across threads and nodes. Failures are actively detected and automatically trigger fault recovery.  Contents  iii  Contents Abstract  ii  Contents  . . ,  iii  List of Tables  v  List of Figures  vi  Acknowledgements  '. .  Dedication  vii: viii  1  Introduction  •  2  Design 2.1 System Architecture 2.2 The Communication Protocol . ' 2.2.1 Update Messages . . . . 2.2.2 Request - Reply Messages 2.2.3 System Management Communication 2.3 Meta-Data . 2.4 Replication and Replica Set Management 2.5 Failure Detection and Recovery 2.5.1 Network Communication Failure 2.5.2 Permanent Node Failure  3 Implementation 3.1 The Distribution Manager 3.2 Communication Threads 3.3 Mammoth Queues 3.3.1 Mammoth File Handle  •  1 4 4 . . • 8. 9 • • • • 10 11 12 13 14 14 14 16 16 18 20 21  Contents  3.4  4  Networking 3.4.1 The Mammoth Network Message 3.4.2 Underlying Transport Protocol 3.4.3 Communication Failures  Evaluating the Prototype 4.1 Functionality 4.2 Architecture 4.3 Performance  iv  22 24 24 26 27 27 28 28  5 Related W o r k 5.1 File Systems 5.1.1 Local File Systems 5.1.2 Client-Server Systems 5.1.3 Peer-to-Peer Systems 5.2 Alternate Storage Models  31 31 32 32 34 35  6  Future W o r k  37  7  Conclusion  40  Bibliography  43  List of Tables  v  List of Tables 2.1 2.2  Operations requiring an update message Request—Reply Messages  10 11  4.1 4.2 4.3  Overhead Propagating meta-data Replicating data  30 30 30  List o f Figures  vi  List of Figures 2.1  Node Deployment Model  5  3.1 3.2 3.3  Prototype Architecture The Extended File Handle Basic Message Structure  17 23 24  Acknowledgements  vii  Acknowledgements There are many people I wish to acknowledge and thank: first and foremost, thank you Mom and Dad for always being there. Your love and support sustains me. Thank you Dale — for being you and reminding me of who I am. Thanks also to Dr. Nelly Khouzam for encouraging me to pursue a Masters degree. You planted the seed and convinced me I could do it. This thesis would never have been completed if not for the support of my peers in the D S G and my supervisor. Specifically I would like to acknowledge Alex Brodsky, Yvonne Coady, Bruno Godin, Joon Ong, Matt Pedersen, and Andy Warfield. A very special acknowledgement is made to Dima Brodsky from whom I have learned a great deal. Finally I would like to thank my supervisor, Mike Feeley, for being so patient and understanding, not to mention for placing his signature on this thesis. Thanks also goes to Norm Hutchinson for all his advise and for reading and signing this thesis.  JODY JAMES POMKOSKI University of British Columbia August 2002  For M o m and Dad.  Chapter  1.  Introduction  1  Chapter 1 Introduction File system design is no longer exclusively driven by considerations of hardware capacity, raw speed, or the need to minimise communication. Disks are getting larger, faster, and more reliable and the network is quickly becoming ubiquitous. The way that users think about and use file data is also changing. Users want to store a larger and richer set of file data. They also want to have access to the data from multiple locations with minimal or no management overhead. In short file system design is now also driven by.user behaviour and to an extent the nature of the data the file system is meant to contain. File systems provide one of the basic resources that users of information systems require: persistent stable storage of data. Distributed file systems are in this respect no different. Indeed distribution of the file storage facility is used to enhance and increase its utility. A distributed system has the potential to allow access to a much greater aggregate storage resource and the ability to replicate data in many places. One of the basic reasons we build systems is to efficiently and effectively manage the available raw resource. In this case the raw resource is the total available network accessible disk storage. The goal then is to effectively manage this resource while meeting the user's requirements. A distributed file system should provide high availability and fault tolerance. The basic facility for providing these properties is to replicate the data at several places within the system. Replication, however, introduces complexity into the system. File systems are required to behave deterministically, which requires coordinating and sharing file system metadata across replicas. The additional communication overhead required for the system to effectively reason about its files introduces a scalability concern. In order to design a scalable distributed file system tradeoffs must be made. Traditionally file system replication has been done at the granularity of the whole system or large portions of the system such as volumes. This approach is sound, however, it generally entails grouping files with similar replication needs together and treating them as a whole. Large repositories  Chapter 1 . Introduction  2  or systems that are geographically distributed over many nodes will not scale because of the amount of communication required to manage the system. A, file system that only replicates a given file to a subset of system nodes is inherently more scalable in the number of nodes and the total number of files. A l l files are not created equal; files can have very different replication requirements. We observe, for instance, that an object file can be regenerated by compiling the corresponding source file. Replicating the object file becomes much less of a concern if the source file is available. This is only one example of how a file's intrinsic properties impacts on its replication requirements. There are many issues that impact on any particular file's replication needs. These may include: • The type of file. • The access pattern of the file. • The file's age. • The file's sharing pattern. • Ownership of the file. We propose that each file describe its replication requirements to the system and allow the system to manage those requirements. Essentially this means we must expand our idea of file meta-data. We have also chosen to use a versioning scheme where each version of a file is immutable and all of a files versions, are arranged in a tree representing the file's history. This decision will allow the system to be more tolerant to the inherent communication latencies observed in distributed file systems and allows us to describe the system behaviour as eventual consistency. In order to provide this model of file storage and access we allow single copy availability and use a distributed advisory locking mechanism to reduce the possibility of conflicts. Single copy availability describes Mammoth's ability to serve the version of a file that it believes to be the up-to-date current version even though it cannot verify the fact, perhaps due to a network or node failure. The Mammoth file system is a large ongoing project with many researchers contributing to both its design and prototype implementation. This thesis presents work done on the Distribution Manager. The distribution  Chapter  1.  Introduction  3  manager is the Mammoth subsystem that handles network communication between cooperating Mammoth servers. The remainder of this thesis will be presented as follows: Chapter 2 has a description of the overall design of the Mammoth file system and a detailed discussion of the required file meta-data and the inter-server communication framework. A prototype implementation of Mammoth is described in Chapter 3 along with a description of its functional behaviour. Chapter 4 presents the evaluation of the prototype. Chapter 5 presents a short overview of related work. Chapter 6 provides a discussion of ongoing and future work. Chapter 7 summarises and concludes.  Chapter  2.  Design  4  Chapter 2 Design The Mammoth file system requires multiple loosely coupled peers to cooperate in order to provide replicated storage of user files. Cooperation requires network communication. In this instance the target network across which Mammoth peers will communicate is the internet. The nature of the network connecting the Mammoth nodes and the desired behaviour of the file system are interrelated. The network's inherent latency, susceptibility to failure, and available transport protocols characterise the fundamental communication infrastructure within which Mammoth must operate. The desired distributed semantics of Mammoth dictate the file system data and metadata that needs to be transmitted. One of the basic design considerations in Mammoth is to ensure that any file system operation that requires network communication is tolerant of the inherent limitations of the communication protocol and conversely that the system provide all needed communication functions to ensure that Mammoth can meet its specified replication guarantee. This chapter presents the design of Mammoth's inter-server communication protocol. The overall system architecture is presented first. This shows how the system nodes are deployed and explains in broad strokes the distributed nature of the Mammoth system. The communication protocol design is then described in order to characterise how the individual system nodes are coupled and shows in detail the information that needs to be communicated. The design of the meta-data is also presented and includes a discussion of how it is used to direct the inter-server communication. The chapter concludes with a discussion about fault tolerance.  2.1  System Architecture  The Mammoth file system consists of file servers and clients. The design and implementation of the client-server interface is beyond the scope of this thesis; it is sufficient to understand that a client interacts with the overall  Chapter 2.  Design  5  system by communicating with any one of the Mammoth servers. Figure 2.1 shows the topology of the system. Clients and servers can be connected via a L A N or through the internet. There may be servers with no local clients and there may be clients with no local access to a server. Correct operation requires only that servers have access to one another via the internet and that clients can access at least one of the servers. The Mammoth file system uses geographic distribution of server nodes to increase overall system robustness. The more distant any group of cooperating server peers are from each other the less likely that any individual event could completely cut off system access or destroy all replicas. Unfortunately as geographic distribution increases there are negative impacts ^hat must be taken into account. These include the higher likelihood of individual node failures, network partitions, increased latency, and a higher likelihood of traversing congested network links. The Mammoth file system is designed to manage and reason about file data at the granularity of whole files. The replication requirements of each file are fully expressed within the file's meta-data. Each file's meta-data contains two important lists of Mammoth server-peers that represent a replica group and an interest group. The replica group servers are the system nodes which cooperate to store and maintain the replicated file data. The interest group servers are nodes in the system which have expressed a desire to be actively c  Chapter  2.  Design  6  informed of any changes to the file's meta-data. The replica group can be viewed as the main backing store for the file whereas the interested nodes are an important optimisation that allows for aggressive update notifications to be directed to nodes that can benefit from knowing a file's state without necessarily needing to immediately possess a copy of the data. Embedding the distribution characteristics of each file into the file's meta-: data has many consequences. The system now reasons on a file-by-file basis rather than arbitrarily grouping files into volumes or any other such agglomeration. Selecting the replica and interested group members can be optimised: for file access patterns, sharing relationships, and data safety requirements. : In fact by dealing with replication requirements at the file level Mammoth decouples any notion of replication behaviour from hierarchical file system structure. Systems that mirror directories for instance require a user to tailor his hierarchical organisation of information such that it resides in a particular directory. Mammoth allows the user, and especially groups of users, to use hierarchical organisation strategies for their intended purpose — to organise, data — as opposed to administering it. Explicitly representing the replica and interested sets by lists of named peers endows Mammoth with an easy method of encoding the results of any replica selection algorithm it may choose. A simple choice, for instance, might be to automatically associate all of a users files with a replica group that includes his workstation, his lab's file server, and his home machine. Another approach would be to employ a more complex algorithm that might involve an inter-administrative domain negotiation for storage resource allotment to ensure equity between storage providers. Another important property of these lists is that after even the most catastrophic network and node failures the file meta-data provides a great deal of forensic information that an administrator or automated tool could use to track down cooperating nodes. The key point is that these lists, although simple, provide for a great deal of future flexibility. This feature makes replica node selection modular, configurable and extendible. The replica list in particular should not be viewed as a prototype implementation artifact, it is an important design element. Of course having a separate list for interest nodes provides flexibility in other ways. Interested nodes are intended to be Mammoth nodes that are not responsible for maintaining a file's data but can provide local clients access to the file. They benefit from knowing if their locally cached copy is current or if they need to fetch an update from one of the replica nodes. 1  Chapter 2. Design  7  Mammoth's file-by-file approach is only the first of two major items that impact on inter-server communication. Mammoth also provides Elephantlike semantics for each file in the system. Elephant [17] is a file system which extended the traditional Unix file system semantics by preserving a file's history as a series of versions arranged in time. The birth of the Mammoth file system project was in fact an attempt to extend Elephant, a local file system, and transmogrify it into a distributed system. Mammoth remains true to the original's main goals with respect to a file's history, however, the nature of distributed computing enriched what was once a strictly linear arrangement of versions in time. A Mammoth file is represented as a collection of versions replicated across a set of nodes. Node or network failure, and to a lesser extent, network latency, makes it highly desirable to provide single copy availability. In a single copy availability system a client is able to access and potentially update a file if it has access to at least one replica. Providing this mode of access is a choice, not a necessary property of a distributed file system. It would be possible, and even easier, to devise a strict locking mechanism that would deny access to prevent conflicts. The earliest proposition of the Mammoth system was that conflicts arising out of allowing single copy availability could be effectively managed. The question that needed to be answered was: Is it useful, desirable, or even safe to allow access to a file if the system cannot guarantee that it is the most current version? The Mammoth system's ability to preserve the history of a file in a series of immutable versions visible to users coupled with system behaviour to prevent conflicts when possible supports an affirmative response to this question. The caveat is of course that a user knows in advance that they are using a Mammoth file system and that they do not need strict consistency guarantees for their data. Any collection of versions of an Elephant file can be arranged in a single row, ordered in time, with each version flowing directly from the last (any gaps are created by cleaning unwanted intermediate versions). The same property cannot be guaranteed for a Mammoth file. It is easily demonstrated that separate updates to replicas on disconnected peers would result in a fork, or branch, in a file's evolution of versions. A Mammoth file therefore, is a collection of versions that can be best described as a tree. In order to minimise network communication and avoid conflicts when possible Mammoth will use a distributed advisory lock system. The basic nature of the lock is single writer — multiple reader. The locking mechanism works at the server level as opposed to the user id. In order for a client  Chapter  2.  Design  8  to perform a write on a file the Mammoth server which is performing the operation must first become the owner of the file. File ownership is persistent between file system operations, therefore in the common case where a client accesses a file multiple times through the same server there is no need for any network communication between the Mammoth servers. The server holding the write lock will assume it has the most recent file version. The other interest and replica set nodes are informed of the identity of the current owner. This is however only a hint, if a node was unreachable during an ownership change it can hold stale ownership information. A transfer of ownership involves finding the current owner. The best case scenario occurs when the hint is correct. The server requests that file ownership be transfered from the current owner and the owner either grants or denies the request. Upon transfer of ownership the new owner can proceed with write permission for the file. The previous owner updates its owner hint. The current owner can update the hint at the other interested and replica servers. As long as file ownership is transfered by mutual agreement there is enough information in the system for the owner to be tracked down. Network or node failure can occur at any time which is why the lock must be advisory. In the event that the owner cannot be reached a node can assume ownership and perform updates. In this case there can be multiple owners of a file performing updates and generating new versions along different branches in a file's history. As the failures heal, the servers need to resolve the ownership. Most recent update is a possible choice. Regardless of ownership resolution the file's version history is no longer linear, it has become a tree with branches at all points where ownership was assumed.  2.2  The Communication Protocol  The previous section outlined the basic distributed nature of the Mammoth system. Mammoth is meant to have many server-peers spread over a potentially large geographic area. These peers are required to cooperate to provide replicated storage and client access to Mammoth files. The glue holding the system together is the network. This section outlines the communication messages required to accomplish file replication and meta-data update propagation. The initial steps to design a communication protocol involved identifying the inter-server dependencies for nodes cooperating to replicate files. The  Chapter 2. Design  9  nature of this relationship determines what information needs to communicated. This was an iterative process that included fine tuning.file system behaviour several times. Once the design for behaviour had matured and stabilised the following three communication modes were identified: • Pushing: data or meta-data is pushed to other servers. • Pulling: a request for data or meta-data is made to a remote server. • System management: servers communicate to monitor the state of the system and to recover from failure. Pushing and pulling are modes of communication that allow the Mammoth system to accomplish its replication and data management tasks. These modes are all about distributing the file and are driven by file system operations. System management is a separate mode of communication which acts to actively monitor the healthiness of the system. This can include monitoring node liveliness, new node membership, or fault recovery. The system management mode is not driven by file system operations but is essential for Mammoth to validate its replication guarantees. Subsections 2.2.1, and 2.2.2 discuss pushing and pulling operations, respectively. System management is discussed in Subsection 2.2.3.  2.2.1  Update Messages  When a client operation causes a change to the file system a meta-data update must be done. The operations in Table 2.1 represent file system events that generate update messages. Whenever an update message is generated the originating server must access the appropriate file or directory meta-data and send the meta-data update to all interested and replica nodes. Any, new file versions must be replicated to the appropriate number of replica nodes to ensure the replication policy has been met. The protocol must log any send errors that indicate that system consistency across nodes has been compromised. The receiving Mammoth server node must unmarshal the update message and apply the update to its local copy of the meta-data. Once the update is applied the servers in the union of the replica and interested sets that are reachable should have up-to-date meta-data. Local file system operations that only require access to meta-data do not need to initiate any more communication.  Chapter 2. Design  10  Create new file Create new directory Remove file Remove directory Rename file Rename directory Modify file data Modify meta-data Table 2.1: Operations requiring an update message  Update messages that propagate changes that modify the replica list, are a special case. The union of replica nodes represents the nodes within Mammoth that must provide storage for the file. Mammoth can authoritatively determine the state of a file as long as all replica nodes have a consistent replica list for that file and they are also fully connected. Replica list management and replication are discussed further in Section 2.4.  2.2.2  Request - Reply Messages  The consequence of propagating meta-data changes more widely than data changes is that servers need a set of pull messages. Each of these request messages will be paired by a reply message. The reply message will invariably contain the required information or a negative acknowledgement. The request—reply pairs are given in Table 2.2. Inter-server request-reply interactions are initiated whenever a local server requires data or meta-data it does not have. The server must request that a remote server send it the required information via a reply message. The following scenario is a typical example: node A is an interest node for file Foo and has received subsequent meta-data updates. Eventually node A is required to serve a current version of Foo. It consults its meta-data and discovers the current owner and that newer versions exist for which it has only meta-data. Node A requests ownership of the file and also the file's data for the current version. If the file is not locked and the owner can grant node A ownership it will send the information via reply messages.  Chapter  2.  Design  11  Request directory meta-data Request file meta-data Request file data Request directory data Request transfer of file ownership Request transfer of directory ownership Owner identification query Register interest in a file Register interest in a directory Unregister interest in a file Unregister interest in a directory  Table 2.2: Request—Reply Messages  2.2.3  System Management Communication  Although the push and pull messages directly support the basic functions of Mammoth with respect to file storage and access there is a great deal of system maintenance that requires inter-server communication. This class of communication was the last to be considered and is the least developed in terms of design. There are several areas which have been identified that will require some form of management communication: • Liveliness monitoring • Heavy weight recovery • New node creation • Orderly node deletion • Resource reclamation (i.e. cleaning) Liveliness monitoring is the only management communication that has been designed. There is a need to actively monitor the liveliness of Mammoth servers because both network and node failures can occur. Failures create the possibility that the replication requirements of a file or directory are no longer being met. When the file or directory in question is frequently updated the regular monitoring of update message delivery provides notification of failure.  Chapter 2. Design  12  Files which are mainly read-only or only rarely updated do not generate enough update messages to guarantee that replica depth is maintained within some given time bound. The liveliness monitor is essentially a small add on to the protocol that generates heart beats that allow servers to monitor the .: liveliness of the other servers with which it cooperates but may not regularly • communicate otherwise. Heavy weight recovery, node creation and deletion are related tasks. They will need to effectively provide the system with enough communication capabilities to manage a node's entire file system. These are not trivial operations given that each file within a system has different needs. The development of efficient methods to accomplish node-based management remains a challenging area of research. Resource reclamation - the cleaner - will be Mammoth's distributed garbage collector. The Elephant file system did have a cleaner but it is unclear at this time how a distributed cleaner should operate. Items that complicate deleting data from the Mammoth system involve historical interfile dependency, remnants of replicas that are missed by the cleaner because of long term failure, and cached meta-data that refers to deleted versions. Elephant does offer some hints to possible approaches such as an algorithmic approach to automatically determine landmark versions based on update frequency. Resource reclamation will be essential to the final Mammoth system and is also an area of future research for project members.  2.3  Meta-Data  The meta-data was designed to support the local, distributed, and replicated requirements of the system. The importance of the replica and interested list were discussed in Section 2.1. It was shown that these elements supply explicit information that can be used to direct messages. Mammoth's main purpose is however to supply the user with historical information about files. The history of a file needs to be encoded in the meta-data in order to provide each server with enough information to reason about a file's history. Mammoth needs to be able to determine how a file's composite versions relate to one another, if here are any missing versions, and how to select the current version. These local server operations lead directly to inter-server communication to obtain the required information. A version naming scheme was designed to express how a version relates to  Chapter  2.  Design  13  a previous version. There are three possible evolutionary relationships that need to be expressed: • the version begins a branch or is the root of the tree • the version is the current version • the version is an intermediate version Versions are named by a tuple that includes the node id and the creation time based on the node's local clock. This is however only enough information to form a linear history at one node. Each version exists on a named branch. The branch is named by another tuple which includes the name of the version which started the branch and the previous version. Versions and named branches are still not enough to establish when a new branch should be created. The distributed lock as discussed in Section 2.1 is used to provide the needed information. If ownership cannot be obtained but a server elects to create a version by assuming ownership then a new branch is created. The possibility of multiple co-existing branches with ownership status requires that the system implement an algorithm to elect and mark a version as current. There is currently no specified algorithm designated as a Mammoth standard. More research and experience with the prototype and a user study would be helpful in defining an intuitively acceptable algorithm that users would find beneficial. There also remains the possibility of implementing a tool level application that could access meta-data and select a current version based on user input.  2.4  Replication and Replica Set Management  The list of replica nodes is a particularly important element of meta-data used to maintain consistent across all replica's. Whenever replicas are added or removed extra effort should be made to update every member of the cooperating set of nodes. The benefit of maintaining a consistent replica set is that whenever all cooperating replicas can interconnect they form the most authoritative body within the Mammoth design. Recall that it is the replica set, not the interested set, that is responsible for supplying the backing store for file data. The replication policy for a file should specify enough replication, given the nature of the deployment network, that interested nodes can contact at least one replica within an acceptable amount of time. The  Chapter  2.  Design  14  case where an interested node is serving client requests while disconnected from the entire replica set does not indicate a broken system, rather this is precisely what is allowed given our assumption that single copy availability is preferable and that any inconsistencies can be identified when connectivity to the replica set is regained.  2.5  Failure Detection and Recovery  The final element in Mammoth's inter-server communication is failure detection and recovery. There are two classes of failure that impact communication: 1. Intermittent short lived network communication failures. 2. Permanent node or equipment failure.  2.5.1  Network Communication Failure  Network communication failures are deemed to occur whenever a message cannot be confirmed as having been received. All such failures are used to indicate that a node has possibly failed. Every Mammoth server must log these failures and take corrective action to maintain Mammoth's replication guarantee. The purpose of the failed message establishes the seriousness of the failure, i.e., as stated in Section 2.4, updates to the replica set are particularly important. The primary approach to message failure is to re-send the message after a short time out and to increment the number of consecutive failures. In the case of a replica failure that persists another replica from the set should be selected for that particular message. If the replica set is not large enough to accomplish the needed level of replication then another replica needs to be added to the set. The failed replica is flagged as a possible permanent failure. Repeated consecutive communication failures to an interest set node is also flagged as possible permanent failure since the node most likely serves as a replica for other files in the system.  2.5.2  Permanent Node Failure  Permanent node failure occurs in two modes:  Chapter 2. Design  15  1. A Mammoth server remains unreachable beyond a system defined threshold and is assumed to be permanently unreachable. 2. The underlying file system fails resulting in the loss of all information but the node remains reachable. Recovery from permanent failure requires that the system determine which replication policies are no longer met by the system. In some ways, each Mammoth file is its own file system and the permanent failure of areplica node is a failure for each of the files the node had maintained. The loosely connected file-grain replication model makes recovery a heavy weight operation. The system will detect these failures either because of failed communication during regular update propagation or because of the liveliness monitor. The remaining nodes are required to identify files requiring replication and must coordinate with the remaining replicas to recruit a new replica node.' It is assumed that the specified number of replicas are enough, and that permanent failures sufficiently rare, that a long recovery period would be acceptable. The actual mechanism for recruiting new replicas and redistributing files is not complete and remains an active area of research. One. possibility is using a database to cross reference replica nodes and the files they- replicate.  Chapter  3.  Implementation  16  Chapter 3 Implementation Once the basic design characteristics for Mammoth were in hand it was possible to begin constructing a prototype. The prototype Mammoth server is a modified user-level NFS server running under Linux. The prototype client is similarly modified from its NFS counterpart. The ext2 local file system served as the storage substrate onto which all files were written. The basic architecture of the prototype is shown in Figure 3.1. The use of an existing server provided a good code base onto which the added Mammoth semantics code be grafted. In order to implement the inter-server communication a Distribution Manager was added. This chapter discusses the process of building that Distribution Manager.  3.1  The Distribution Manager  Mammoth server nodes are required to perform a great deal of inter-server communication. These operations can include performing replication, updating replicated meta-data, or fetching non-local data to service client requests. The distribution manager performs these operations on behalf of the locallyrunning, modified NFS server. Mammoth's design requires that the distribution manager have the ability to access the file meta-data in order to direct the correct messages to cooperating server-peers. It also requires that the distribution manager detect communication failures. When possible, it hides any such failures from the user by taking corrective action transparently. The Mammoth design set forth in Chapter 2 identifies the needed communication messages and also the the meta-data that defines a file's distributed nature. The basic implementation strategy was influenced by the initial work done to modify the user-level NFS server. The existing file system operations had been extended to perform local file versioning and the Mammoth metadata structures had been implemented. Becoming intimate with the existing code base, the NFS server modifications, and the data structures holding the meta-data, had established an initial desire to develop the distribution  Chapter  3.  Modified NFS Server  r  Client  17  Implementation  Distribution Manager  Local Disk  Mammoth Server  Figure 3.1: Prototype Architecture  manager as part of the server rather than as an entirely separate system. A multi-threaded server architecture was seen as an ideal way to proceed. Other factors that promoted the use of threads were: multi-threaded programming techniques were well understood, a stable threading library was available, and a multi-threaded environment simplified data-structure sharing between the extended server and distribution manager. Message processing can be described in its most basic modes as either sending or receiving. The distribution manager's basic structure.observes this classification by dividing its functions between two threads. These threads run within the modified NFS server and are fully described in Section 3.2. Deciding on a multi-threaded architecture was only the first step in developing the prototype distribution manager. The distribution manager needed a basic strategy for organising, performing, and monitoring communication. The basic communication functions of each Mammoth server can be divided into three categories: 1. Pushing locally created information out to other Mammoth servers. 2. Requesting information from remote Mammoth servers. 3. Replying to requests made by remote Mammoth servers. The distribution manager uses one FIFO queue for each of the above categories. The queues are the primary method of organising the different types of communication. A detailed discussion of the queues is given in Section 3.3.  Chapter  3.  Implementation  18  While multi-threading and queueing were decisions that established the basic structure of the distribution manager, the actual network message construction and transport needed to be fleshed out. The requirements promoted the implementation of a basic, generic, reusable message structure, and socket-level communication. The guiding principle for developing the networking code was to keep it simple. What was needed was a system to test the overall design rather than outperform existing systems. Since the prototype was being used as a proof-of-concept and as a tool to fine-tune the design, it was highly desirable to preserve within the system a high degree of traceability. In terms of debugging, it was essential that whatever was happening via network communication be easily examined. The design and structure of the networking code is discussed in Section 3.4.  3.2  Communication Threads  The Mammoth Distribution Manager is divided into two threads. They are the mam-DMthread that manages outgoing messages on behalf of the locally running server and the mam^DMlistener thread that listens for requests made by remote Mammoth nodes. The Mammoth server is started in the same fashion as the original NFS server except for the additional spawning of the distribution manager's threads. Upon system startup the mam_DMthread begins monitoring the message queues for outgoing communication messages. The mamJDMlistener initialises by binding to port 9090 and then monitors the port for incoming requests.; All communication in the prototype is carried out over T C P / I P stream sockets. The motivation and consequences of using T C P / I P are discussed in Section 3.4. The distribution manager's communication activities are driven by Mammoth file operations. The flow of events- for the creation of a new file at one node replicated on one other node illustrates the operation of the distribution manager's threads. The first steps in the process occur at the server creating the file: 1. A Mammoth client creates a new file. 2. The local Mammoth server creates and stores the new file and creates the Mammoth meta-data file populating it with a default list of replica nodes. 3. During the operation an update message is placed on the update queue.  Chapter 3. Implementation  19  4. When mam_DMthread wakes up it detects that there is a pending operation on the update queue. It dequeues and reads the update to determine the operation and the nodes to which the message needs to •be sent. 5. The mam_DMthread now creates the network message that contains the operation performed and the file name. 6. For each Mammoth replica and interested node identified in the file meta-data the mamJDMthread opens a connection and sends the message verifying that the send is successful. At this point the remote server node is engaged and continues the process by performing the following steps: 1. The remote Mammoth server's mam_DMlistener thread detects and accepts the connection. 2. Upon receiving the message the remote node's mamJDMlistener parses it and detects that a new file has been created. 3. The remote mamJDMlistener then creates and populates the MetaDatalnfo for the new file and writes the meta-data file to the local disk. 4. The message contained a replicate flag so the remote mamJDMlistener enqueues a request message to get the file data on the request queue. 5. When the remote mamJDMthread runs again it detects that a request has been made. The request is dequeued and a network message is formatted and sent just as the original update message was. The focus of activity shifts back to the original server. It needs to locate and transmit the file data to the requesting node: 1. The local Mammoth server that first generated the update now receives the request message through its listener. 2. The local mam JDMlistener processes the message by getting a file handle to the required file and enqueues a reply message on the reply queue.  Chapter  3.  Implementation  20  3. The reply message is detected when the local mam_DMthread wakes up and is processed by opening a stream socket and streaming the file data to the requesting node. The file and file meta-data for the newly created file has now been replicated. Section 3.3 discusses how file system operations generate messages and how those messages are managed within the system. Separating the incoming and outgoing network communication is beneficial for both the design and implementation phases for the distribution manager. There is an inherent structure within the system that mirrored the logical structure of the communication requirements. Send operations in one module are naturally matched with receive operations in the other. Essentially one module becomes "the server" and the other becomes "the client" in a client-server system. The process of actually getting the prototype to operate correctly is eased because each operation is implemented and debugged incrementally. The debugging environment consists of starting two Mammoth servers but having each run only one of the distribution manager's threads. A framework listener that accepts and parses messages is easy to code and helps verify that the manager is encoding and sending proper messages. A basic implementation cycle evolves: 1. first the send message is implemented, 2. then the send message is verified at the listener, 3. then the listener is extended to parse and process the message, 4. followed by verifying the listeners behaviour, 5. finally implementation of the next send message... The ability of the system to be incrementally implemented, one message at a time, illustrates the inherent extensibility of the prototype.  3.3  Mammoth Queues  The Mammoth file system uses three FIFO message queues to organise and manage message processing. The update, request, and reply queues. Each is protected by a mutex lock to prevent concurrent access by multiple threads. File system operations that occur at the local Mammoth server can generate  Chapter  3.  Implementation  21  update messages that need to be sent out to other Mammoth servers. The operations that generate update messages were shown in Table 2.1. Essentially update messages arise whenever there is a need to inform another Mammoth server about a change to a local file. These changes can be a result of client operations executed on a file or by the Mammoth server changing the replica or interest group membership. These operations represent changes that have been made persistent on the local disk. During the course of normal operation of Mammoth there are file system operations that cannot be performed immediately because the local server does not have up-to-date information or ownership of a file. For example a server receives a client request to open an existing file for writing but a remote server currently owns the file. Another example is a node that wants to open a version of a file that it knows to exist because it possesses version meta-data but not file data. When a local Mammoth server needs to contact a remote server for information about a file or directory it generates a request message and places it on the request queue. Each request message is paired with a reply. These message pairs are listed in Table 2.2. The reply queue holds messages that the local server has created in response to requests it has received from remote Mammoth servers. The reply will hold the required information of a failure message. Failure to reply will cause the file system operation to fail. The queues for updates and requests hold pointers to Mammoth file handles. File handles are used because they have already been created and populated with the required meta-data by the server thread that has been activated by a client request. The actual network message is constructed within the mam_DManager thread when the request or update is dequeued. The queue for outgoing replies holds a pointer to the prepared message. The iocal server does not necessarily have the the file opened in the server thread thus there is no file handle. The distribution manager opens the required meta-data file creates, the network message, and places it on the reply queue.  3.3.1  Mammoth File Handle  The Mammoth file handle is an extension of the file handle that the original NFS server used. The original motivation for this approach was to provide the main server with access to all the extended Mammoth meta-data. During the course of operation the server uses the file handle to access and update  Chapter  3.  Implementation  22  the Mammoth file and the meta-data in the same fashion that a normal NFS server uses the file handle to service client requests. The benefits of using the file handle occur when the distribution manager dequeues the file handle and has immediate access to the replica and interest sets as well as pointers to all file descriptors for the file data and meta-data files. Since Mammoth's communication is driven primarily by file system operations and directed by the contents of meta-data, the file handle represented a method to encapsulate directions to the distribution manager. Additionally the multithreaded architecture made sharing the file handle relatively easy. The distribution manager safely ignores most of the contents of the file. handle since it is only concerned with access to the in-memory Mammoth ' meta-data. The structure of particular interest to the distribution manager is the MetaDatalnfo. The distribution manager uses the information in the MetaDatalnfo to construct and address the appropriate network messages. Figure 3.2 shows the relevant contents of the extended file handle and other: data structures. The MetaDatalnfo contains four items of primary interest to the distribution manager: • The FileHistory or DirHistory is a struct that contains all the necessary information to identify the operation performed. • The ig is an array of Mammoth nodes that represent the interest group. • The rg is an array of Mammoth nodes that represent the replica group. • The FileMD or DirMD is a struct that identifies the history branch of the operation, the number of interested and replica nodes (the igc and rgc), and the number of entries in a file's history (the fhc). When the distribution manager's mamJDMthread dequeues a file handle it parses the FileHistory or DirHistory to identify the operation being propagated. The FileMD indicates the number of interested and replica nodes in the ig and rg, respectively. A network message can now be constructed as outlined in Section 3.4.  3.4  Networking  The networking functions of the distribution manager can be described in terms of creation, transition, reception, and parsing of network messages.  Chapter  3.  Implementation  23  mam info  FileHandle  pointer to mam_info — pointer to MetaDatalnfo  unsigned char pending unsigned char version unsigned char moved struct stat oldst unsigned char open unsigned char created time_t opent  MetaDatalnfo FileMD  ServNode BrCrNode time_t BrCrTime time_t move_t unsigned int igc unsigned int rgc unsigned int fhc  file descriptor fd MetaDataType FileMD DirMD file history fd pointer to FileHistory MamQ fhl dir history fd pointer to DirHistory MamQ dhl pointer to ig pointer to rg unsigned int refc  FileHistory  MamFileOp op ServNode VerCrNode time_t VerCrTime ServNode BrCrNode time_t BrCrTime time_t move_t unsigned int plen char *str  Array of ServNode Array of ServNode Figure 3.2: T h e Extended File Handle  Chapter  3.  24  Implementation  Message Type Replicate Flag Message Length Message Start  Message End  VARIABLE LENGTH  Figure 3.3: Basic Message Structure  This section outlines these functions by defining the message structure and then characterising the transport protocols used. A supplementary section discusses network faults.  3.4.1  The Mammoth Network Message  The primary function of the distribution manager is to perform network communication on behalf of the Mammoth file system. This task involves marshaling, transporting, and unmarshaling messages in order to replicate data and meta-data. In order to accomplish the task a basic message format was implemented. The Mammoth network message is illustrated in Figure 3.3. The message format is designed to be very flexible. There are three initial fields which indicate the type of message, whether the information needs to be replicated, and the length of the remainder of the message. The message data is of variable length and is simply an array of bytes. The length field is used to verify that the message was unmarshaled correctly upon reception. The message type field is used to select the appropriate message parsing and processing routine. The replicate flag is used to indicate that the receiving node has been selected to replicate any data that the message refers to (see the example in Section 3.2).  3.4.2  Underlying Transport Protocol  The use of T C P / I P is particularly important to the implementation of the prototype distribution manager because it provides a protocol that meets the underlying communication needs set out in Mammoth's design. The Mammoth file system communication protocol presupposes reliable in-order delivery of messages. The distribution manager uses one send call for each message and verifies its success. T C P / I P provides reliable stream-based communication. Each of these key properties simplify and enhance the prototype. The reliable nature of  Chapter  3.  Implementation  25  the protocol obviates the need for an explicit acknowledgement, thus reducing the total number of messages required to implement the distribution manager. Reinventing reliable message delivery can be done but at the cost of increased implementation complexity. T C P / I P effectively hides all the machinery required to implement this reliability while providing a well documented, robust, widely available protocol. The only added complexity comes from implementing Mammoth specific message management. Stream-based, or connection oriented communication, allows the implementation and use of variable length messages. T C P / I P effectively handles transporting these messages without any extra need for Mammoth to segment long messages, detect and repair out-of-sequence messages, or deal with duplicate messages. Since there is no bound on message length the prototype remains highly extensible. This is highly desirable given that new messages are much easier to add. The main drawback of using T C P is its increased overhead. The overhead is a direct result of the required bookkeeping that the T C P implementation must do as well as the setup of the connection-oriented socket. In order to mitigate the expensive operations of setting up and disposing of socket connections a socket cache was implemented. The cache maintains socket connections between cooperating nodes after inter-server operations. The cache uses a L R U replacement policy when it becomes full. Maintaining connections rather than throwing them away allows Mammoth to amortise the setup cost over several inter-server communication messages. Experience using the cache with the prototype is limited because the testbed for Mammoth has remained small, thus even a small cache size, such as 20 connections, never requires connections to be ejected. The socket cache has functioned well within this limited setting. Whether or not the socket cache retains its performance enhancing properties in a large scale deployment depends on how often connections are ejected out of the cache. The following scenario provides an example of a deployment where the cache continues to function well (a counter example follows): When fully deployed there are potentially hundreds or thousands of Mammoth server nodes. These nodes are primarily user workstations that act as both client and server. Any given peer in the system tends to communicate with only a subset of the entire population. The socket cache performs well in this scenario if the bulk of communication of any given server is to a small subset of peers, perhaps on the order of tens. The cache is particularly well suited to this scenario when inter-communication with this small subset is  Chapter 3. Implementation  26  frequent. This scenario appropriately fits an environment where the set of actively changing files within a given subset of cooperating peers remains small or the union of the distribution sets of active files contains mostly members of the subset. The socket cache degrades the system performance in any situation that leads to frequent cache ejection before a given connection is reused several times. This occurs in a fully deployed system where there is no well defined subset of cooperating peers. Each file in the system tends to have its own unique, unrelated distribution requirements. The number of active files on the system is high requiring frequent connections to uncached peers. ;  3.4.3  Communication Failures  The distribution manager monitors the communication connections for failure. The likelihood of a failure occurring increases as the number of nodes increases. Failures can be temporary or permanent and arise either because of network or node failure. When a communication failure occurs the type and duration of the failure is not known. The basic assumption Mammoth makes is that transient failures are much more common than permanent failures. The distribution manager is well designed to prevent the need for heavy weight recovery operations when failures are short lived and related only to communication. The first step the distribution manager takes when a message fails is to enqueue a copy of the message on a special failed queue. The failure can now be handled out-of-band of the normal distribution manager function. A separate communication thread can periodically wakeup and process the queued messages. When a failure is short-lived the message is delivered as normal with the only difference being the observed latency. After a threshold amount of time the failure is considered to be long-lived and is logged for failure recovery. When communication failures persist beyond a given amount of time Mammoth must use a more heavy weight mechanism to verify that all file replication policies have been maintained. If the remote node is classified as a permanent failure there are potentially many files which need to be rereplicated. The failed node must also be flagged so that if it does become available it is notified of files from which it has been removed as a replica or interested node. Currently heavy weight recovery of this type is not implemented and will be done as an add-on module in the future.  Chapter  4 . E v a l u a t i n g the P r o t o t y p e  27  Chapter 4 Evaluating the Prototype The Distribution Manager can be evaluated with respect to its functionality, performance and architecture. A n evaluation of functionality analyses the completeness of implementing the design and achieving the goals set forth at the beginning of the project. Evaluating the architecture of the prototype is a more subjective process which presents observations made during and after building the system. The performance of the distribution manager was evaluated within the context of the overall Mammoth file system.  4.1  Functionality  The distribution manager was able to implement all the required message processing to allow the system to execute all of its file system operations. Files and directories were created, accessed, and modified as if using a traditional Unix file system. The system was able to parse the meta-data, construct the required messages, transmit them across the network, and process them at a remote node. Network and node failures were actively detected whenever message sending failed. The undelivered messages were queued for re-send and used to recover from short term failure. The behaviour of the failure recovery however cannot be described as robust. Much work remains to improve fault tolerance and to develop reasonable time lines for establishing failures as permanent. There remains a few areas of the file system that have not yet been developed. There are currently no algorithms implemented that automatically elect new replica nodes from a group of available servers in the event of a permanent node failure. File system recovery for nodes which have returned after having been deemed permanently failed are also not yet designed nor implemented. Finally, no security measures are taken that are considered adequate in an internet scale file system.  C h a p t e r 4.  4.2  E v a l u a t i n g the P r o t o t y p e  28  Architecture  The architecture of the distribution manager was heavily influenced by using an existing NFS server as a code base. A multi-threaded implementation proved a good approach because it allowed development of common data structures and allowed sharing of those data structures between threads. The distributed nature of the files is encoded in the file meta-data. Coupling the communication subsystem and the file system together helped to provide insights into how each influenced the other. Extending a complicated user level NFS server also had many drawbacks. The code base is an implementation of an entirely different system and is designed to be a stateless as possible and deployed in a L A N environment. NFS overhead comes prepackaged with our prototype and makes evaluation more difficult. It also promoted working with the client server interface as a first step rather than concentrating on inter-server communication. A better approach may have been to concentrate on modelling and implementing the desired distributed behaviour of Mammoth files, perhaps in a very high level prototyping language. The model is then be used to extend an existing file system, the final step is providing client access.  4.3  Performance  The performance evaluation presented herein shows that the system is functional and can achieve the needed inter-server communication. The experimental environment is a 100Mb ethernet L A N populated with 4 P C workstations each equipped with a single 266MHz Pentium II processor and 128MB of R A M . Each result is the median of 1000 trials. Table 4.1 shows the overhead associated with performing remote operations. The time for the operation to complete locally is compared to the time for the remote node. Meta-data updates for file create are higher than for other updates because the meta-data files need to be created, replication and interest policies also need to be created and transmitted to the remote node. Other updates need only modify the existing meta-data. The cost of replication is given with respect to each additional 4KB of file data that needs to be sent. The Andrew file system benchmark was run with a single primary server and 0, 1, 2, and 3 remote nodes. Table 4.2 shows the cost of propagating  Chapter  4.  E v a l u a t i n g the P r o t o t y p e  29  meta-data to remote nodes while Table 4.3 shows the cost of performing data replication. Propagating meta-data shows the expected linear growth for adding additional nodes. The replication results show a decrease in time for 3 nodes as compared to 1 or 2 nodes. The measurements capture the time it takes the main thread on the primary server to complete the asynchronous operation. In the case of the 3 replica node result, the background replication thread continues to run after the trial is complete. The micro-benchmarks presented are unfortunately not the best measurements to fully evaluate the prototype. A much more meaningful measure is gained by using Mammoth with widely dispersed nodes connected via long haul links. Additionally actual user interaction with the system is required given that Mammoth presupposes the additional overhead of replication and maintaining a file's history is worthwhile. The micro-benchmarks do give an indication of the upper limit of any one node but fail to evaluate whether the basic premise of Mammoth is useful to the end user. Unfortunately the prototype system is not ready for a large scale user study. A large scale study has the potential to generate rich sharing and placement relationships and also to exercise the Elephant semantics of the system. Evaluating how the system is used and the number of times Mammoth succeeds and fails at providing access to a given file in the face of failure are really the best way to evaluate the prototype.  Chapter  4.  Operation  Evaluating the Prototype  Local  Node  30  Remote  Node  (Ms)  (fis)  M e t a - d a t a updates - file create  123 + 6 9 per remote node  9660  M e t a - d a t a updates - file u p d a t e  10 + 6 9 per remote node  3010  145 + 102 per 4 - K B  417 + 2 0 3 per 4 - K B  Data Replication  Table 4 . 1 : Overhead  Number of Servers Time (in seconds) 0  18.6  1  18.7  2  18.8  3  18.8  Table 4.2: Propagating meta-data.  Number of Servers Time (in seconds) 0  18.6  1  19.2  2  19.8  3  19.0  Table 4.3: Replicating data.  Chapter  5.  Related Work  31  Chapter 5 Related Work Providing distributed replicated storage is the common design goal of many systems. The motivations for this goal are often common as well: high availability, fault tolerance, and resource management. What changes from system to system is the environments in which they are developed and must operate. Environmental factors include: hardware and infrastructure, usage patterns, and user expectations. Environmental factors have a large impact on how a system implements its design goals, what assumptions about a system's behaviour are valid, and what compromises need to be made to obtain a particular goal. This interplay between environment and design goals has generated many inter-related systems. Mammoth benefits, and is inspired by, previous and related work. This chapter provides a discussion of how Mammoth fits in with other systems that either provide distributed replicated storage or have similar file system semantics to Mammoth. Additionally, all distributed storage systems need a mechanism to inter-connect their nodes. Mammoth's distribution manager is compared with the networking approaches used in related systems. Related file systems will be presented first, followed by examples of alternate storage models.  5.1  File Systems  Files are one of the oldest abstractions computer systems have used to implement stable persistent storage. It remains a very powerful abstraction because it is a very natural model for humans. It is also a good model for computers because it is a generic data containment model. File systems are appealing also because of their well defined interfaces and the relative ease of deploying new systems. A l l of the file systems that will be presented follow the traditional file and directory metaphor whether they are local, networked, or distributed.  Chapter  5.1.1  5.  Related W o r k  32  Local File Systems  Mammoth file semantics are inherited from the Elephant file system[17]. Santry et al. posited that a file system should by default preserve the previous versions of a file. Essentially they argued that given the current capacities of local disks the current default behaviour of most file systems, over-writing or abandoning the old data in favour of the new, was an anachronism. Elephant is a local file system whose prototype was implemented for FreeBSD using the vnode interface. A n attempt at distributing the system across many nodes gave rise to the Mammoth file system. In order to achieve the desired semantics and offer single copy availability Mammoth expresses file evolution as a tree of versions rather than as a linear succession. There is another local file system which is relevant to Elephant and to Mammoth. The C E D A R file system [4] effectively used versioning to simplify cache consistency management. Mammoth uses versioning to accomplish a related task - tolerating network latency in propagating file updates. The key property being exploited is that versions are immutable once created. Mammoth names versions by creating node, creation time, and direct ancestor version so that there can be no conflicts when arranging the historical versions in a tree representing the file's evolution.  5.1.2  Client-Server Systems  The LOCUS[20] distributed operating system employed a distributed file system that used replication to improve availability. The entire system shares the exact same view of the file hierarchy. Portions of the hierarchy are organised into groups and those groups are then associated with a number of possible storage sites. The replica group in Mammoth serves much the same purpose for file replication. A single node in LOCUS is responsible for synchronising access to the files within a given group to ensure that only the most current version of a file is visible. When there are conflicts the system can use automatic or user-level interaction to recover. The communication framework is directly supported by the operating system and is dependent on an ethernet network deployment. The Andrew file system [11] partitions the file system into many logical volumes spread over a group of centralised servers. The servers are tightly coupled via an ethernet and management of the server data store occurs at the granularity of logical volumes. A T C P based protocol exists between clients  Chapter  5.  Related Work  33  and the server. The cache manager uses T C P to transfer files to and from the servers for local access and modification. Mammoth's distribution manager provides similar file data transfers but supports inter-server communication as well. Andrew files are transfered and cached at clients before they can be read or modified. The server uses callbacks to invalidate a client's cache. Files are written back to the server after modification when the file is closed. Read-only volumes are replicated at many servers to increase performance. Mammoth has no notion of a volume of file group. Each file is individually maintained by its own group of replica servers. The CODA file system[18] is one of the best known distributed file systems. It, like Mammoth, uses a client-server model to provide user access to files but a separate inter-server communication channel to provide the cooperative storage. CODA's servers are however tightly coupled. The system also only preserves old versions of data when update conflicts have occurred due to disconnected operation. The conflicting versions of the file are preserved in a tar file and must be manually resolved. Conflicts in Mammoth are exposed to the user as branches in the file's history. C O D A differs from Mammoth in how it distributes replicas. Venus, the client cache manager in CODA, is responsible for transferring new file data to all replica volumes. It places much more responsibility on clients than the Mammoth system. Mammoth's distribution manager handles all distribution within the system on behalf of the client. The H A R P [10] file system also uses a replica group but at the granularity of an entire Unix partition. One server in the group acts as a primary and the others as secondary. A 2 phase commit protocol that ensures that operations are only committed after all secondary servers have replied. The ECHO[l] distributed file system used a single distributed file system hierarchy with volumes, primary and secondary replicas and strong .consistency through the use of write tokens. The inter-server communication for servers replicating a file volume is not asynchronous as it is in Mammoth. Clients are blocked until the primary server has been able to contact and store the file data on the replicas. Clients locate the appropriate primary server via a naming service built on top of the internet's DNS system. xFS[21] is intended to operate in the wide area but requires the participating servers to be arranged in a hierarchy. A n invalidation based write back cache coherence protocol is used to maintain strong consistency. Writer reader locks are used. The holder of the write lock is the owner. Ownership is only relinquished upon other client requests. Mammoth servers are similar  Chapter 5. Related Work  34  in that they relinquish ownership only upon request. Mammoth uses a looser consistency model that will allow servers to become the owner of a file by creating a branch in the file's history. Ficus[14] also uses a volume approach where all files within a volume share the same replication location. Single-copy availability is used to ensure high availability. Update conflicts are managed through the use of logs and version vectors. Whole file updates are atomic. Mammoth displays much the same behaviour but at the file as opposed to the volume level. Archipelago[7] distributes a file system at the directory level. It employs a hash on the directory pathname to map to a node id where the the data is to be located. Mammoth uses a list of replica nodes encoded within the file's meta-data to achieve distribution and replication.  5.1.3  Peer-to-Peer Systems  Napster[12] was a well known peer-to-peer file sharing application that was geared to allowing users to freely exchange mp3 files. The service used a centralised searchable database that indexed the location and content of other clients. Clients then were able to make peer-to-peer connections and copy selected files. Mammoth has no central searchable database and is geared towards use as a read-write file system. Gnutella[5] is a user-level peer-to-peer application intended to make file sharing across the internet anonymous. It behaves much like Napster but has no centralised database and protects the anonymity of users from each other. Mammoth is less paranoid, it actually exposes its list of cooperating peers in its meta-data. Freenet[2] is also primarily concerned with protecting the anonymity of authors and readers. The system is a peer-to-peer network of nodes that allow authors to publish files by hashing a string describing the file and making the description public. Freenet's communication framework is constructed so that it is infeasible to directly discern the actual storage location of a file. In order to accomplish this each peer in the Freenet system needs to agree to pass on messages on behalf of other peers and also to agree to serve a a return conduit for responses. Mammoth's distribution manager directly communicates with cooperating storage servers. Jetfile[6] uses a reliable IP multi-cast to inter-connect peers. Optimistic replication is used to improve performance. Jetfile uses immutable versions and version numbers to aid in conflict reconciliation. A versioning server is  Chapter  5.  Related Work  35  used to increment version numbers so that they are globally unique. Replication and update propagation is achieved by hashing the file id to a multi-cast channel. Mammoth uses pair-wise sequential communication and relies on its meta-data to locate cooperating peers. The IP multi-cast infrastructure, of the network serves much the same purpose as the distribution manager in Mammoth. The distribution manager's IP network requirements are lower, than those of Jetfile. PAST[16] is a peer-to-peer storage system built on top of PASTRY[15]. P A S T R Y is an overlay network that interconnects nodes with IP addresses. Each node has a unique numerical id. Each node tracks a small number of its neighbouring node ids in a table. These tables serve as routing tables. PAST uses a hash to produce a numerical file id for each file it wishes to store. The file id's most significant bits are used to route the file to a specified number of replicas whose bits most closely match. The use of P A S T R Y is a completely alternative method of managing distribution compared to Mammoth's replica and interest groups. The use of PASTRY completely obviates the need for a separate distribution manager. Individual nodes in PAST can choose the number of nodes a file is replicated to but have no control over which nodes PASTRY chooses. CFS[3] is a read-only file system. The system distributes files at the block level. A hashing system is used to distribute and locate the blocks. Mammoth distributes data at the granularity of files. The distribution of a file is dependent on the explicitly listed nodes within the file meta-data.  5.2  Alternate Storage Models  File systems are perhaps the most popular abstraction when dealing with data storage but there are other models. This section discusses a few of the alternate approaches. Mammoth has much less in common with these systems than with those mentioned in the previous section. Petal[9] uses a distributed disk metaphor. Rather that exposing a file system interface, it provides storage through a disk driver interface. Petal supports system backup through the use of an efficient snapshot technique. A snapshot causes the system to behave with copy-on-write semantics for subsequent writes. Frangiapanni[19] is a file system that uses a petal disk as its backing store. It is meant to run on a tightly coupled set of nodes. The BAYOU[13] system is an example of another approach to distributed  Chapter 5. Related Work  36  data storage. It uses a relational database and requires the application to supply the reconciliation semantics for conflicting update operations. Updates can occur at any replica but are logged. Conflicts are detected when logs are merged. Mammoth does not merge updates, rather it preserves a series of immutable versions. This difference is expressed in the very different approaches to communication. B A Y O U must log all updates and globally perform these updates in the same order. A n anti-entropy protocol is used to ensure that all updates are eventually propagated across all replicas. The communication model used in Mammoth includes inter-server messages that allow a server to become the owner of a file before updates occur and then obligate it to send update messages to all other interested nodes. Failures are logged and resent aggressively. Oceanstore[8] is a very ambitious one-size-fits-all approach to global storage. It uses a typed object metaphor as its storage model. Objects are given globally unique identifiers and are located though a system-wide data location and routing infrastructure. Oceanstore integrates security, resource management, and untrusted nodes into an internet scale system. Mammoth infrastructure for locating cooperating peers is limited to the interest and replica lists and the internet's ability to route messages correctly. Conceptually the Mammoth approach is much easier to implement but also more limited.  Chapter 6. Future Work  37  Chapter 6 Future W o r k The Mammoth file system is an ongoing research project. There have already been extensions and modification to the design and implementation of components described in this thesis. There has been some additional communication added to the system. The most significant, addition has been the design and implementation of a two phase commit protocol used to monitor system state with respect to group membership. The most important system behaviour that remains to be designed and implemented is a robust, well defined fault tolerance protocol. Currently only a very naive logging and re-transmission scheme exists. While this allows the system to recover from short term network failures it does not meet the required specifications set out in Mammoth's design. System failures that need to be addressed are: • Permanent catastrophic node failure: upon such a failure there is no mechanism in place to efficiently determine all the files which need to be re-replicated. • Long-term node disconnection: a node which remains unreachable for a long period of time is considered a permanent failure. When this node becomes connected there is no policy or mechanism for it to follow such that it orderly re-integrates into the system. Problem areas include relinquishing ownership, interest, and replica status. • Replica selection: when a replica node is identified as having failed there needs to be a mechanism to elect a new replica and update the replica group. There needs to be more experimentation to determine what default replication policies are be most useful. There is only one policy in the system and it is hardwired to replicate quite aggressively. Possible improvements are: • A mechanism to select replicas for geographic location to enhance data survivability;  Chapter 6.  Future Work  38  • an algorithm to place data on servers close to the users who access the data to improve availability and responsiveness; • a policy to prevent data or meta-data being placed where physical security of the machine compromises data security; • a method of accomplishing load balancing by distributing data across replicas. Allowing the replication algorithms to be administrable is also highly desirable. There is no method of distributed resource reclamation. A cleaner is needed to remove old, unused, or otherwise unimportant files. A possible direction to take is to establish a retention policy as part of a file's metadata that could be run locally at each node that would incrementally remove files as they age or reach a threshold number of versions. A n obvious problem is having to deal with client accesses for data that has been removed because the cleaner has run. A possible alternative strategy would be a spider that would run through the replica and interest nodes in a mark and sweep fashion. Dealing with unreachable nodes is the hurdle that needs to be cleared in this scenario. Mammoth has no mechanism to track and impose inter-file dependencies when replicating. Files can often be inter-related, i.e., a large source tree of a software project; granting access to a replicated file when others upon which it depends are unavailable is most likely not beneficial. A n efficient method of identifying these dependencies is needed to increase Mammoth's usefulness. There are also many system improvements that can be accomplished at the client-server interface. These include: • Currently whenever a local server cannot service a client request from a local replica the server fetches the data then services the request. A possible improvement to the system involves enhancing the client and server so that a remote server can service the client request. This can be accomplished by either informing the client of the identity of a replica server or by having the remote server send the data directly to the client. • The prototype client is a modified NFS client accessing the server via traditional R P C and does not cache any files. Given that typical work-  Chapter 6. Future Work  39  stations are equipped with vast amounts of available disk space and excellent connectivity ( L A N , W A N , wireless) improving the client to act as a replica server for the workstation owner's files is a potential future development. The client then benefits by using local file system calls but the likelihood of operating in a disconnected state increases. Mammoth is intended to operate over a wide geographic area and be interconnected via internet protocols. Neither the client-server nor the interserver communication facilities are appropriately secure. Questions of user authentication and privacy need to be addressed. Since the prototype uses connected sockets, a transport layer security approach such as SSL is available. Secure R P C is a possible alternative for the local client-server interface. Mammoth's communication framework continues to evolve. There is a current effort being undertaken to implement a reliable UDP transport mechanism to take the place of the current T C P / I P implementation. Although there is little question that U D P will provide inherently quicker sending of small messages, the added complexity does make working with and extending the prototype more difficult. There is a significant amount of bookkeeping required to re-implement reliable message transport and five extra threads have been added to the system. Given that Mammoth remains a prototype yet to achieve its main goal as proof-of-concept, work to optimise communication may very well be premature. Performance is not the only motivation for replacing T C P with UDP. The current implementation consumes up to 20 file descriptors for cached socket connections. The proposed UDP approach presupposes that the use of T C P will cause the file descriptor cache to be flushed often and that this will result in an unacceptable degrading of performance. This may certainly be true for a large node operating as a file server for many clients but is less an issue for a peer-to-peer deployment of cooperating personal workstations.  Chapter  7.  Conclusion  40  Chapter 7 Conclusion Users want highly available file systems that provide them with access to an ever increasing set of data files. Those files must be reliably stored and easily shared between different computers and possibly different users. A d ditionally, they want to be protected from all possible losses of data, be they human error or system failure. The Mammoth file system was developed in order to experiment with a system providing these properties. The Mammoth file system's main goals are to extend the Elephant file system semantics to a distributed file system. A hybrid of peer-to-peer and client-server models is used. The servers grant access to clients much as in an NFS file system but the servers are inter-connected in a peer-to-peer network. The servers cooperate on a file-by-file basis to provide replicated storage of a file and all of its historical versions. The servers can be interconnected over a wide geographic area to improve survivability and availability in the face of catastrophic disaster. In order to improve scalability there is no notion of a file grouping or volume. A file's replication and distribution requirements are encoded in its meta-data. This thesis presented the design and implementation of the Mammoth file system's Distribution Manager, the module responsible for inter-server communication. The Distribution Manager is able to access and parse a file's meta-data to determine which remote nodes in the system cooperate with respect to a given file. There are two lists of Mammoth server nodes encapsulated in a file's metadata. These represent a list of replica nodes and a list of nodes interested in knowing the state of the file. The replica group cooperates to provide the backing store for replicated data while all members of the interest group are kept informed of meta-data changes to the file. Interest nodes only fetch new versions of data when they are required. Replica groups, on the other hand, need to actively cooperate to ensure a file is replicated widely enough to maintain high availability. Replica groups introduce an inter-node coupling, or dependency. This coupling does not present a scalability concern as long as replica groups remain small and the union of all replica lists represents  Chapter  7.  Conclusion  41  only a portion of all Mammoth servers. File data consistency is managed by making every version immutable and naming the versions by their creation node and time and direct ancestor version. This scheme preserves enough information such that all available versions can be arranged in a tree representing how they are related to one another. The system uses a distributed advisory single reader multiple writer lock to prevent branches whenever possible. The distribution manager partitions its message processing between two threads. The mam_DMmanager thread performs all outgoing inter-server communication while a mamJDMlistener thread monitors the network for messages from remote Mammoth nodes. Messages are classified as being either updates,requests, or replies. There are three queues used to manage message processing across thread boundaries. Each queue is used to manage precisely one message class. The use of queues allows the distribution manager's threads to process messages asynchronously. This behaviour is particularly evident in the processing of the request-reply message pairs. A request does not block the distribution manager; the mamJDMlistener thread will place the reply on the reply queue whenever it arrives. Using multi-threading and message queues to accomplish asynchronous message processing resulted in an implementation that is easy to work with. This approach allowed the distribution manager to be incrementally implemented. Partitioning the message processing between two threads provided an implementation environment where each thread could be separately tested more easily. Using a single thread is an alternative strategy for implementing Mammoth's distribution manager. The update message processing would not change but the processing of request-reply pairs could be achieved by using a data structure to indicate whether or not a request had been satisfied by a reply. Asynchronous message processing can be maintained by using signals-based programming techniques. The remote procedure call interface represents yet another approach that can be taken. R P C hides the network interface from the developer, and provides non-blocking calls thus providing a method to preserve asynchronous messgae processing. Mammoth's current implementation actually benefits from using the socket interface. The node lists encoded in the file metadata promote an implementation that can use the distribution information directly. Experience with the prototype has illustrated that, although design and  C h a p t e r 7.  Conclusion  42  implementation of the main operations is straightforward in the absence of failure, most techniques used to increase scalability, i.e. file-by-file replication and loose coupling, result in more complicated fault recovery operations. Efficient methods of enumerating the files which need to be replicated due to a permanent node failure remain undefined. Further development of the prototype and a larger user study are required to examine optimal parameters for the number of replicas, replica node selection policy, actual user behaviour given the added file system semantics. The prototype distribution manager uses T C P / I P to perform inter-server communication. The stream-based, connection oriented protocol provides an excellent off-the-shelf mechanism to build the inter-server communication protocol. In order to amortise the expensive operation of connection setup a socket cache is used to keep connections alive between communication operations. The file system communicates with the distribution manager by enqueueing operations that require inter-server communication. Three queues used for update, request, and reply messages allows the distribution manager to perform the communication asynchronously. The prototype is proving to be a useful tool for experimenting with Mammoth files. The framework for accomplishing update propagation and data replication is complete. The appropriate policies for file replication to guarantee availability are being developed.  Bibliography  43  Bibliography [1] A . D. Birrell, A . Hisgen, C. Jerian, T. Mann, and G. Swart. The Echo distributed file system. Technical Report 111, Digital Equipment Corporation, Palo Alto, C A , USA, October 1993. [2] Ian Clarke, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong. Freenet: A distributed anonymous information storage and retrieval system. In Workshop on Design Issues in Anonymity and Unobservability, pages 46-66, 2000. [3] Frank Dubek, M . Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. Wide-area cooperative storage with cfs. In Proceedings of the 18th Symposium on Operating Systems Principles, Operating Systems Review, pages 202-215, October 2001. [4] D. K . Gifford, R. M . Needham, and M . D. Schroeder. The Cedar file system. Communications of the ACM, 31(3):288-298, March 1988. [5] Gnutella, http://www.gnutella.com. [6] Bjorn Gronvall, Assar Westerlund, and Stephen Pink. The design of a multicast-based distributed file system. In Operating Systems Design and Implementation, pages 251-264, 1999. [7] M . Ji, E. W. Felten, R. Wang, and J. Pal Singh. Archipelago: A n islandbased file system for highly available and scalable internet services. In Proceedings of 4th USENIX Windows Systems Symposium, August 2000. [8] John Kubiatowicz, David Bindel, Yan Chen, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westly Weimer, Christopher Wells, and Ben Zhao. Oceanstore: A n architecture for global-scale persistent storage. In Proceedings of ACM ASPLOS. A C M , November 2000.  Bibliography  44  [9] E. K . Lee and C. A . Thekkath. Petal: Distributed virtual disks. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages .and Operating Systems (ASPLOS VII), Computer Architecture News, pages 84-93, October 1996. [10] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira, and M . Williams. Replication in the Harp file system. In Proceedings of 13th ACM Symposium on Operating Systems Principles, pages 226-38, October 1991. [11] J. H . Morris, M . Satyanarayanan, M . H . Conner, J. H . Howard, D. S. Rosenthal, and F. D. Smith. Andrew: A distributed personal computing environment. Communications of the ACM, 29(3):184-201, March 1986. [12] Napster, http://www.napster.com. [13] Karin Petersen, Mike J. Spreitzer, Douglas B. Terry, Marv in M . Theimer, and Alan J. Demers. Flexible update propagation for weakly consistent replication. In Sixteenth ACM Symposium on Operating Systems Principles, Saint Malo, France, 1997. [14] Peter L. Reiher, John S. Heidemann, David Ratner, Gregory Skinner, and Gerald J. Popek. Resolving file conflicts in the ficus file system. In USENIX Summer, pages 183-195, 1994. [15] Antony Rowstron and Peter Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM Middleware 2001, pages 329-350, November 2001. [16] Antony Rowstron and Peter Druschel. Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. In Proceedings of the 18th Symposium on Operating Systems Principles, Operating Systems Review, pages 188-201, October 2001. [17] D. S. Santry, M . J. Feeley, N . C. H . , A . C. Veitch, R. W. Carton, and J. Ofir. Deciding when to forget in the Elephant file system. In Proceedings of the 17th A CM Symposium on Operating Systems Principles, pages 110-123, December 1999.  Bibliography  45  [18] M . Satyanarayanan. Coda: A highly available file system for a distributed workstation environment. In Second IEEE Workshop on Workstation Operating Systems, September 1989. [19] C. A . Thekkath, T. Mann, and E. K . Lee. Frangipani: A scalable distributed file system. In Proceedings of 16th ACM Symposium on Operating Systems Principles, pages 224-237, October 1997. [20] Bruce Walker, Gerald Popek, Robert English, Charles Kline, and Greg Thiel. The LOCUS distributed operating system. In Proceedings of the 9th Symposium on Operating Systems Principles, Operating Systems Review, pages 49-69, October 1983. [21] R. Y . Wang and T. E. Anderson. xFS: A wide area mass storage file system. In Proceedings of the Fourth Workshop on Workstation Operating Systems, pages 71-78, October 1993.  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items