@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix dc: . @prefix skos: . vivo:departmentOrSchool "Science, Faculty of"@en, "Computer Science, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Duska, Bradley M."@en ; dcterms:issued "2009-05-28T22:52:33Z"@en, "1998"@en ; vivo:relatedDegree "Master of Science - MSc"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """Migration is a powerful technique in distributed systems providing many benefits. The granularity of migration ranges from the coarse-grained movement of whole processes to the fine-grained mobility of individual objects which provides more flexibility and control. One of the costs of fine-grained mobility is an increase in the complexity of programming with respect to failures. Classic fault-tolerance techniques for distributed systems cannot be applied in systems with fine-grained object mobility due to the unacceptable overhead of applying these techniques to many small objects. We discuss a group service that allows programmers to apply classic distributed system fault-tolerance techniques to systems with fine-grained object mobility. This service enforces the condition that all objects in a group are either all available or all failed, and has been implemented in the Emerald language and runtime environment. Examples using the group service include a fault-tolerant name server and a fault-tolerant distributed system monitor."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/8394?expand=metadata"@en ; dcterms:extent "5666940 bytes"@en ; dc:format "application/pdf"@en ; skos:note "Enforcing Crash Failure Semantics in Distributed Systems with Fine-Grained Object Mobility by Bradley M. Duska B.Sc, University of Calgary, 1989 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of British Columbia August 1998 © Bradley M. Duska, 1998 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of L-^on\\byVco J C ' \\ e~ce The University of British Columbia Vancouver, Canada Date ^ \\0K [Group] Create an empty group object. 3.2.2 Group Object An object whose type is Group has the following interface: Management: operation addMember [o : Any] —>• [r : Boolean] Adds o to this group. An object may be in at most one group. Returns true on success, false on failure. The operation will fail if the object is already a member of another group, if the object has type Group, or if this group has failed. (Note that immutable objects are never added to groups as immutable objects are never unavailable. This is transparent to the programmer: the addMember operation will return true if the group has not failed.) operation removeMember [o : Any] —> [r : Boolean] Removes o from this group. Returns true on success, false on failure. The operation will fail if the object is not a member of this group, or if the group has failed. operation UstMembers —> [I : Vector, of [Any]] Returns a vector listing all current members in this group. Returns nil if this group has failed. operation UstMember Nodes —> [7 : Vector, of [Node]] Returns a vector listing all current nodes that have members of this group. Returns nil if this group has failed. 38 operation failed -¥ [r : Boolean] Returns true if the group has failed, false otherwise. Notification: operation addGListener [I : GListener] —> [r : Boolean] Adds / as a listener to be notified upon failure of this group, using a callback as defined by the GListener type. Returns true on success, false on failure. The operation will fail if this group has failed. operation removeGListener [I : GListener] —> [r : Boolean] Removes Z as a listener for this group. Returns true on success, false on failure. The operation will fail if / is not a listener for this group, or if this group has failed. operation UstGListeners —> [/ : Vector, of [GListener]] Returns a vector listing all current listeners for this group. Returns nil if this group has failed. GListener represents the type of objects that are informed via a callback when a group becomes unavailable. Such objects must conform to the type GListener, which is: immutable typeobject GListener operation group Unavailable end GListener operation addNListener [I : NListener] —> [r : Boolean] Adds / as a listener to be notified upon the start or failure of a node, using a callback as defined by the NListener type. Returns true on success, false on failure. The operation will fail if this group has failed. 39 operation removeNListener [I : NListener] —> [r : Boolean] Removes I as a node listener for this group. Returns true on success, false on failure. The operation will fail if / is not a node listener for this group, or if this group has failed. operation HstNListeners —> [I : Vector, of [NListener]] Returns a vector listing all current node listeners for this group. Returns nil if this group has failed. NListener represents the type of objects that are informed via a callback upon the start or failure of a node. Such objects must conform to the type NListener, which is: immutable typeobject NListener operation nodeUpDown[n : Node,up : Boolean] end NListener operation UstListenerNodes —> [I : Vector, of [Node]] Returns a vector listing all current nodes where listeners may be notified. By default, this is all nodes that have had or currently do have group members, or that have invoked the Group object. Returns nil if this group has failed. operation addListenerNode [n : Node] —> [r : Boolean] Adds n as a node where listeners may be notified. Returns true on success, false on failure. The operation will fail if this group has failed. Location: operation moveAll [n : Node] —> [r : Boolean] Applies the move to n operation to all members of this group. Returns true on success, false on failure. This operation will fail if the group has failed, or 40 if some of the members of the group cannot be moved. (For example, if one of the members is fixed at another location. In this case, as many members as possible will be moved.) opera t ion fixAll [n : Node] —> [r : Boolean] Applies the fix at n operation to all members of this group. Returns true on success, false on failure. This operation will fail if the group has failed, or if some of the members of the group cannot be fixed. (For example, if one of the members is fixed at another location. In this case, as many members as possible will be fixed.) opera t ion refixAll [n : Node] —» [r : Boolean] Applies the refix at n operation to all members of this group. Returns true on success, false on failure. This operation will fail if the group has failed, or if some of the members of the group cannot be refixed. (For example, if one of the members is not fixed. In this case, as many members as possible will be refixed.) opera t ion unfixAll —> [r : Boolean] Applies the unfix n operation to all members of this group. Returns true on success, false on failure. This operation will fail if the group has failed. 3.3 Implementation The application's interface to the group service is through the immutable group object given above in Section 3.2. The group service is composed of two primary objects: the group manager (GManager) and the group representative (GRep). A GManager exists on each Emerald node where the group service is in use, and is responsible for creating, locating, and notifying all GReps that exist on a node. A 41 GRep exists on each node for each group active on that node, and is responsible for maintaining group state information and performing operations on that group. Each group is uniquely identified by a group identifier (gid). Several other objects are used, including a GLocks object providing locking services, and a GElection object providing for the election of a coordinator. Most of the group service is implemented in Emerald, with a small amount of changes to the Emerald runtime implemented in C. The changes to the Emerald runtime are: the addition of Group as a new builtin type; initialization of the group manager on node creation; upcalls to the group manager on node failures; upcalls to the group manager on moves of group members; a new call that allows the group object to locate the local GManager; and a new call that allows an Emerald program to mark an object as unavailable. NODE A [ NOD E C / ' ' / GRep \\ * \\ E i ( M / \"\"r 1 GManager 1 vJ^ ,' / / NODE B / ' / GRep \\ ff GRep V V gid \\Jj \\. gid 2 / \\ t [ GManager ] NO 3ED ( —h / GRep \\ \\ ,' f GRep \\ | -^ 1 / / \\ \\ gid 2 / / - -' /\"\"--._-' I GManager I VJ/ Figure 3.1: Group Service Figure 3.1 shows an example of the group service with two active groups identified by gids 1 and 2. Gid 1 is active on nodes B, C and D. Gid 2 is active on 42 nodes B and D. No group is active on node A. A challenge in implementing the group service is for the group service to remain internally consistent in spite of failures in a multi-threaded environment. 3.3.1 Group Object The immutable group object is returned from a Group.create operation. The object is created as immutable to guarantee that it is always available. The appli-cation may freely move references to the group object between nodes; because it is immutable, copies of the object are created on each node where it is referenced. Upon creation of the group object, the initially section attempts to find the local GManager for this node. If one is not found, one is created. Next, the GManager is invoked to create a GRep for this group on the local node. This GRep will exist on the node until the group fails. Finally, the initially section queries the newly created GRep for the gid and stores this in the group object. The group object is the application's interface to the group service. Upon invocation, the interface locates the appropriate GRep for the group and dispatches the call. As the group object may be invoked on different nodes, the group object must search for the appropriate GRep upon each invocation. When searching a node for a GRep, the group object first finds the GMan-ager, if any, for that node. If a GManager exists on that node, the GManager is queried for the GRep by specifying the gid. If a GManager is not found, one is cre-ated. If a GRep is not found on the specified node, all nodes with active GManagers are searched. If no GRep is found on any of these nodes, the group is assumed to have failed. If a GRep is found, a GRep is created on the specified node, passing a reference to the found GRep. 43 For all operations, either a read or a write lock is acquired for the gid at the beginning of the call. The node where the invocation was performed is searched for a GRep for this gid. If one is found, the Group object invokes it to acquire the appropriate lock as described in Section 3.3.3. All operations except unfixAll, failed, and the various list operations up-date distributed GRep state and therefore require write locks. For example, the removeMember operation notifies all other GReps of the removed member and whether the local GRep has members. The unfixAll, failed, and the various list operations simply read local GRep state and therefore use read locks. For example, the listMember operation returns a list of members from the local GRep. For the addMember and removeMember operations, after acquiring the lock the node where the group member is located is searched for a GRep for this gid, and the call is dispatched to this GRep. For all other operations, after acquiring the lock the call is dispatched to the GRep on the node where the invocation was performed. In order to improve performance on call dispatch, the group object caches a hint for the last known location of both an initialized GManager and a GRep for this group. Both hints tend towards the node where the group object is currently located. By caching these references, global searches can be avoided on most invocations, occurring only when the group object is invoked on some node where it has not been invoked before. 44 3.3.2 GManager Object The GManager is responsible for creating, locating, and notifying all GReps that exist on a node. On node creation, the group service is initialized with a small monitored object that synchronizes startup access to the GManager. The monitored object can be queried for the GManager, with an option to create a GManager. Once the GManager is created, access to each GManager is synchronized through local locks using the GLocks object (see Section 3.3.4). Each GManager has a reference to all other GManagers that exist on the current nodes with group service activity. Each GManager also maintains a current coordinator GManager. The coordinator is responsible for generating unique gids when creating GReps. The coordinator is maintained using a GElection object (see Section 3.3.4). On creation, the GManager searches all nodes attempting to find an existing GManager. If one is found, the new GManager queries the found GManager for the current coordinator. Otherwise, the new GManager assumes that it is the coordi-nator. To improve performance on startup, the GManager create operation can be passed a currently existing GManager as a hint. This hint is tried first, and if it is unavailable only then is a global search performed. On the creation of a new GRep, the current GManager queries the coordi-nator for a new gid and creates a local GRep. If this is a local representative for an already existing group, the GManager creates a GRep, passing it a reference to an already existing GRep for the group. A reference to the new GRep is stored to perform future queries. The reference to the new GRep is removed if the group fails. The GManager is notified of any node failures through an Emerald runtime upcall. The Emerald runtime detects failures of other nodes through TCP/IP con-45 nection failure. On a failure, the GManager first notifies the GElection object, if necessary choosing and initializing a new coordinator. The GManager then informs each GRep of the node failure. Each GManager also maintains a hash table mapping local group members to GReps. The GReps are responsible for the addition and deletion of members in the hash table. The Emerald runtime detects the move of a group member object by checking a bit in the object's flags upon moves. If a group member is detected, the Emerald runtime performs an upcall to the GManager with the object being moved, the destination of the move, and the type of move (move, fix, or refix). The GManager uses the hash table to determine the GRep affected by the move, and dispatches the move to the GRep. Details of moving a group member, including handling failures during moves, are discussed in the Section 3.3.3. 3.3.3 GRep Object A GRep exists on each node for each group active on that node, with each group uniquely identified by a gid. As shown in Figure 3.1, the set of GReps for each group can be logically viewed as a single service supporting that group. The GRep is the most complex object in the group service, enforcing crash failure semantics for the group on node failures, maintaining group state of the group, and implementing the Group object interface. The group state each GRep maintains includes all members of this group, the number of members of this group on this node, all group listeners for this group, all node listeners for this group, all other GReps for this group, and whether the other GReps currently have local members. Each group also maintains a current coordinator GRep, maintained using a 46 GElection object (see Section 3.3.4). The coordinator synchronizes access to the gid, and is responsible for notifying all GReps when the group is deemed to have failed. Dealing with the failure of the coordinator is discussed below. On creation, a GRep is passed a reference to another GRep. If this reference is nil, the GRep is the first GRep for a new group. Otherwise, the GRep is an additional GRep for an already existing group. In this case, the GRep contacts the coordinator for the group and initializes group state from the coordinator. If the coordinator fails while the GRep is initializing, the initialization for the GRep fails. When a member is added to the group using the addMember operation, the GRep first performs a test and set of a flag in the header of the new member. This flag will only be set if the object is already a group member. If the object is not already a group member, the flag is set and the GRep informs all GReps for this gid of the new member, updating state information as necessary. Recall that the appropriate lock is acquired and freed by the Group object. The coordinator uses the GLocks object (see Section 3.3.4) to synchronize access to the group. Before performing any read or write operation on the group, the Group object acquires a read or write lock from the coordinator. The use of a distributed lock adds two additional considerations. First, the locking service must correctly handle the failure of a lock requester before the requester has acquired the lock. This is handled using the ping capability of the GLock object: when a lock is about to be granted to a lock requester, the requester is first pinged to ensure the requester is still available. Secondly, the locking service must correctly handle the failure of a lock holder. This leads to the more general issue of failure handling in the GRep. Each GRep is informed of a node failure by the local GManager. Upon 47 receiving a failure notification, a GRep informs the GElection object. The GElection object determines if there is a GRep on the node that has failed. If there is, the GElection elects a new coordinator (if necessary) and returns the identity of the failed node. The coordinator uses the failed node to determine if the failed GRep had local members. If it did, the group has failed. The coordinator ensures that any locks held by the failed node are freed, and acquires a write lock for the gid. This write lock is given priority over all other invocations currently waiting for read or write locks by using the GLocks steal mechanism. If the group was not already concluded to have failed, the coordinator next performs a state consistency check among all surviving GReps. This checks that all GReps agree on the number of members, listeners, and nodes currently in the group, and that all local members of the group are correctly marked as such. If these values are inconsistent, the group has failed. If the coordinator concludes the group has; failed, the coordinator informs all GReps to make all members of the group unavailable, notifies all group listeners of the failure of the group, and informs all GReps to remove their local GManager reference to the GRep. The coordinator is also responsible for notifying node listeners on the start or failure of a node. The notification on the failure of a node occurs after all other group failure handling has occurred. Upon the move of a group member using Emerald keywords such as move or attach, the GManager informs the appropriate GRep for the member as described above. The GRep is responsible for acquiring a write lock for this gid and updating the state of both the source and destination GReps to correctly reflect the current location of the group member. Failures during the move are handled by setting both 48 the source and destination GReps to include the member until the move successfully completes. The failure of the source or destination during the move will result in the group failing. The moveAll, fixAll, and refixAll operations share the same set of in-vocations with a difference only in the actual move, fix or refix operation. The same operations are also used to implement the moves of individual members, and the same technique of setting both the source and destination GReps to include members until the move completes is used to handle failures during moves. To reduce message transmission, the addMember and removeMember operations piggyback whether the GRep has members with the new or removed member notification. Other notifications indicating whether a GRep has members occur only when the node changes from having no members to having some or from having some to having none. All operations ignore failures during invocations of remote GReps, and per-form the operation on as many GReps as possible. Any node failure is handled as described above. 3.3.4 Consistency and Data Structure Objects Many other objects support the group service. The GLocks and GElection objects provide consistency services. Data structures are provided by the GList, GArray, and GHashTable objects. The GLocks object supports a multiple-readers single-writer style lock. For handling failures of lock requesters, the GLocks object provides a ping mechanism. When a lock is about to be granted to a lock requester, the requester is first pinged to ensure the requester is still available. If the ping is successful, the lock is granted. 49 .. Otherwise, the requester is deemed to have failed and the next lock request is ser-viced. The GLocks object also supports a write lock steal mechanism. This allows one write lock request to move to the front of the write lock queue. This is used by the group service to give failure notifications priority, while still maintaining the correct use of locks. The GElection object supports elections among a group of nodes, with one GElection object per node. In order to operate correctly each GElection object must be notified upon node failure, and a new GElection object joining a group of GElection objects must be given a reference to an existing GElection object in the group. Each GElection object contains a unique identifier and knows the identifier of all other GElection objects. Upon a failure of the coordinator, each GElection object independently chooses the highest identifier as the new coordinator. The GList and GArray objects use parametric polymorphism to provide generic implementations of lists and arrays. The GList object uses type parameters to indicate the type of the list elements and the comparison type used to compare nodes (which is usually a specific object in the list element). The GArray object uses a type parameter to indicate the type of the array elements. Both the GList and the GArray provide \"safe\" add and remove operations: it is an error to add an element that is already in the list or array, and it is an error to remove an element that is not in the list or array. The GList uses a singly linked list, while the GArray uses the builtin array type. The GHashTable provides a hash table mapping type any to type any. The hash table is implemented as a simple linear probed hash function, increasing in size when the hash table is over 80% full to avoid clustering. The GHashTable, like the GList and GArray objects, provides safe insert and remove operations. 50 3.4 Performance Several tests evaluate the performance of the group service on fundamental opera-tions and on the scaling of these operations to larger groups located on more nodes. For the purposes of these tests, a larger group is considered to be one with more members. These tests are generally run on from one to ten Emerald nodes, with the number of members ranging from 1 to 100. The tests are performed on five PCs running FreeBSD 2.2.x with 200 MHz Pentium processors, 64 MB memory, and at least a 10 Mbps Ethernet connection. •S600 § 500 w 400 •1 Member •5 Members •10 Members •25 Members •50 Members •100 Members Figure 3.2: New GRep The results in Figure 3.2 show the time it takes in milliseconds to create a new GRep for a group on n nodes when a GRep for the group already exists on n — 1 nodes with from 1 to 100 members. For example, to create a GRep on a third node for a group with 25 members on two nodes takes approximately 300 milliseconds. Not shown on the graph is the 2 milliseconds it takes to create a new GRep on a single node. The smaller values for 10 nodes are an anomaly caused by reduced traffic on the network. 51 These results are as expected. The increase in time as the number of nodes increase is caused by the condition that each GRep is informed of every other GRep, and therefore all GReps must be contacted on the creation of a new GRep. The increase in time as the number of members increase is caused by each GRep knowing the entire group state. More information must be passed to the new GRep on creation as the number of members increase. Nodes Figure 3.3: Add Member The results in Figure 3.3 show the time it takes in milliseconds to add a member to an empty group where the group exists on from 1 to 10 nodes. For example, to add a new member to an empty group on 6 nodes takes approximately 20 milliseconds. This is caused, as discussed above, by each GRep knowing all group state, and therefore the overhead is increasing as the number of nodes increases. The results in Figure 3.4 show the time it takes in milliseconds to add a member to an existing group for a group on from 1 to 10 nodes with 1 to 100 current members. For example, to add a member to a group with 100 members on 4 nodes takes approximately 25 milliseconds. 52 (0 TJ c o o V (0 60-50-40 -30-20 -10-0 J ^ * ^ H — * * ^ ^ ^ M\"\"\"\"^^ , ^ , i — T ^ ^ ^ 7 i 1 1 1 1 1 1 - * - l Member —•—5 Members - * - 1 0 Members —*—25 Members -*— 50 Members 100 Members 5 6 Nodes 10 Figure 3.4: Add Member Scaling The increases in time with both nodes and members are caused partially by the same reasons as discussed above. In addition, the use of a \"safe\" linked list to store the members causes additional overhead. Before every insert the list is searched for the item about to be inserted to ensure that it is not inserted twice. The overhead caused by this search increases as the number of members (the length of the list) increases. 60 —«— Group Member - • - Non-Group Member Figure 3.5: Move One 53 Figure 3.6: Move All Figure 3.5 and Figure 3.6 show the overhead imposed on moves of group members by the group service. Figure 3.5 shows the time it takes in milliseconds to move a member of a group using the Emerald move keyword where the group exists on from 1 to 10 nodes. Also shown for comparison is the time it takes to move an object that is not a member of a group. For example, to move a group member on 9 nodes it takes approximately 50 milliseconds; to move a non-group member it takes approximately 1 millisecond. Figure 3.6 shows the time it takes in milliseconds to move all members of a group using the group service moveall operation where the group exists on from 1 to 10 nodes and has from 1 to 100 members. For example, to move all members of a group with 25 members on 2 nodes takes approximately 100 milliseconds. The increases in time with both nodes and members are caused by the same reasons as discussed above. Additional overhead is caused in the case of using the Emerald move keyword by the switches between Emerald and C code. The results in Figure 3.7 show the number of messages transmitted to and 54 1 0 0 0 - -800- - i lFailure-Sent • Failure - Received El No Failure - Sent UNo Failure - Received • Normal - Sent H Normal - Received Figure 3.7: Failure Message Activity from the node executing the following program. This program creates a group and ensures that every active node contains one member of the group. const failact <- object failact process const nl: ImmutableVector.of[NodeListElement] <- (locate self).getActiveNodes const g: Group <- Group.create var i : In teger for ( i <- 0 : i <= nl.upperBound : i <- i + 1 ) const mObj <- mObjClass .create[ i] move mObj to nl .getElement[i] .getTheNode const b <- g.addmember[mObj] end for end process end f a i l a c t The number of messages sent and received is given in both the case of no failures and the case of a group failure, on from 2 to 10 nodes. For comparison the number of messages sent and received in the case of no executing program is also shown. For example, for a group on 9 nodes in the case of a group failure approximately 800 messages are sent and received. In the normal case approximately 55 20 messages are sent and received. Only slightly more messages are transmitted in the case of a group failure than the case of no group failure. This shows that the bulk of the overhead of the group service occurs in the case of normal operation. 1000 --800- -HOne Member - Sent HOne Member- Received • Two Members - Sent 01 Two Members - Received CM co •* L O C D r~ -oo a> o Nodes Figure 3.8: Regular Message Activity The results in Figure 3.8 show the number of messages transmitted to and from the node executing the above program, and the additional message overhead when a second member is added to the group at every node (achieved by repeating the for loop in the above program). Only slightly more messages are transmitted to add the second member. This shows most messages in this case are from the initial set up of the GManagers and the GReps. The results in this section show that substantial room exists for performance improvements of the group service, particularly with respect to the scaling of the service. The two primary areas for improving performance are the replication of group state on every GRep, and the use of the linked list as a fundamental data structure. By maintaining group state on some fixed number of GReps, fc-way fault 56 tolerance can be provided with a substantial reduction in overhead, providing much better scaling of the group service as the number of nodes increase. Replacing the use of a linked link with, for example, a hash table would provide much better scaling of the group service as the number of members increase. Even if the linked list is maintained, the use of the \"safe\" operations for the linked list should only be used in a debug version. 3.5 Limitations Although the group service correctly handles network partitions when group mem-bers are split between partitions, the behaviour of the group service when a group listener is partitioned from the members of the group is not as clean. Ideally, a group listener is not notified unless the group fails. With a partition as described, a group listener could be notified that a group has failed when in fact the group is still active in another partition. A possible solution is to fail the group whenever a group listener node fails. This would avoid the situation described, but result in a two way dependency between the group and the listener. Emerald is a garbage collected language, and as such objects are collected when they are not referenced. In order to maintain the standard Emerald semantics, the group service does not have a close or free operation for a group. This means that the GRep for a group will never be deleted unless it has failed. The GRep is a relatively small object, although for a long running system with many groups this could result in a large amount of wasted memory. In addition, the GRep maintains references to the members and listeners of the group, and this will result in the members and listeners of the group also never being garbage collected. A solution is for the user to ensure all members and listeners are removed from a group when 57 it is no longer required. A better solution is for the garbage collector to collect the GReps when there are no more immutable group objects for that GRep (meaning no reference to the GRep outside of the group service will occur). However this requires a tight coupling between the garbage collector and the group service. The current implementation of the group service will correctly notice and synchronously handle the move of an Emerald object. However, the move of an ob-ject that has an attached group member results in the group service asynchronously moving the group member. This slightly changes the semantics of fix and refix as the object will not in fact have been fixed or refixed when the call to fix or refix re-turns. More importantly, repeated moves of a group member via attach will result in an error in the group service. The current Emerald implementation provides no mechanism for a process to be blocked on an upcall. In order to correctly handle moves of group members via attach, the group service requires a blocking upcall mechanism from Emerald. Using this, the process requesting the move would not be resumed until the group service has properly handled the moves of the attached group member(s). As implemented, the group service can only handle a single failure at a time. Multiple failures will possibly result in a failure in the group service due to race conditions. This can be corrected by a more careful implementation of the node down handling code. The extra complexity required for this was not felt to add any value to the implementation because of the rarity of concurrent failures. The group service currently assumes failures occur on the granularity of nodes. It should be relatively simple to extend the group service to notice the failures of objects on active nodes. This situation is typically the result of programming error and may result in extra overhead in the Emerald runtime. 58 Chapter 4 Examples 4.1 Introduction This chapter gives examples of using the group service to create fault tolerant dis-tributed applications in Emerald. The first example is a simple token passing application, demonstrating how to use the group service. The second example is a reliable name server. The name server provides operations to bind, lookup, and unbind names to values. The third example is a distributed system monitor providing graphical information on cur-rently available nodes, load on nodes, bandwidth and latency between nodes, and distributed garbage collection. 4.2 Token Passer This example is a simple token passing application, demonstrating how to use the group service. The code below shows the important sections of a non fault-tolerant version of the token passer. In total, this program contains 229 lines of Emerald 59 code (the bulk of these being the GList data structure). This program creates an object on every active node, and then moves itself as a token between the nodes, invoking the object on each node it visits. This could be used as the basis for a shared resource management program. Note that this program is not fault tolerant: any failed node means the program will fail. Enhancing this program to be fault tolerant without using the group service would require unavailable handlers on the invocation of the list object, the list nodes, and the objects referenced from the list nodes. Other possible requirements would be timers to ensure the token is not lost, and a coordinator to ensure that multiple tokens are not introduced. In addition, the program is not informed of any new nodes that become active after it is started. const main <- object main initially const t <- tokenPasser.create end initially end main const tokenPasser <- class tokenPasser process const n: Node <- locate self const nl: ImmutableVector.of[NodeListElement] <- n.getActiveNodes const mObjList: GList.of[mObjNode, Node] <- GList.of[mObjNode, Node].create var i: Integer for ( i <- 0 : i <= nl.upperBound : i <- i + 1 ) const mObj: mObjClass <- mObjClass.create[i] const mon: mObjNode <- mObjNode.create [mObj, nl.getElement[i].getTheNode] mObj List.add[mon] fix mObj at nl.getElement[i].getTheNode end for const mol: Vector.of[mObjNode] <- mObjList.list const delayTime: Time <- Time.created, 0] 60 loop for ( i <- 0 : i <= mol.upperBound : i <- i + 1 ) move self to mol.getElement[i].getNode mol.getElement[i].getMObj.run (locate self).delay[delayTime] end for end loop end process end tokenPasser const mObjClass <- class mObjClass [id: Integer] export operation run (locate self).getStdOut.putString [\"Greetings from \" || id.asString || \"\\n\"] end run end mObjClass The code below shows the same token passing application, but now enhanced for fault tolerance using the group service. In total, this program contains 240 lines of Emerald code. The additional 11 lines of code are marked with a '/,'/,. These lines create a group, add the objects in the token passer to the group, create a group listener object (tokenPasserL), and add this object as a group listener. Upon the failure of any node, the group service fails the current instance of the token passing application, and the group listener starts a new instance. In addition, the program could easily be enhanced to be informed of newly active nodes using the node listener portion of the group service. const main <- object main initially const t <- tokenPasser.create end initially end main const tokenPasser <- class tokenPasser const g: Group <- Group.create const tokenPasserL <- immutable object tokenPasserL export operation groupUnavailable n n u 61 const t <- tokenPasser.create end groupUnavailable end tokenPasserL process const n: Node <- locate self const nl: ImmutableVector.of [NodeListElement] <- n.getActiveNodes const mObjList: GList.of[mObjNode, Node] <- GList.of[mObjNode, Node].create var i: Integer var b: Boolean <- g.addMember[self] b <- g.addMember[mObjList] for ( i <- 0 : i <= nl.upperBound : i <- i + 1 ) const mObj: mObjClass <- mObjClass.create[i] b <- g.addMember[mObj] const mon: mObjNode <- mObjNode.create [mObj, nl.getElement[i].getTheNode] b <- g.addMember[mon] mObjList.add[mon] fix mObj at nl.getElement[i].getTheNode end for b <- g.addGListener[tokenPasserL] '/„'/„ const mol: Vector.of[mObjNode] <- mObjList.list const delayTime: Time <- Time.create[1, 0] loop for ( i <- 0 : i <= mol.upperBound : i <- i + 1 ) move self to mol.getElement[i].getNode mol.getElement[i].getMObj.run (locate self).delay[delayTime] end for end loop end process end tokenPasser const mObjClass <- class mObjClass [id: Integer] export operation run (locate self).getStdOut.putString [\"Greetings from \" I I id.asString I I \"\\n\"] end run end mObjClass 62 n n n n °a u 4.3 Name Server This example is a reliable name server. The name server provides operations to bind, lookup, and unbind names to values. The interface to the name server is given by an immutable object with these three operations. Names are given as Emerald strings, and values are any type of Emerald object. The name server is replicated on two nodes using a primary/backup model of replication. Upon the failure of one of the replica nodes, a new replica is created and moved to a functional node. New nodes joining the distributed environment can use the services of an already existing name server. The fault tolerance portion of the name server is implemented using the group service, primarily for notification and location. Using only one backup means the name server will fail if the nodes containing the primary replica and the the backup replica simultaneously fail. The name server does not use permanent storage and therefore will also fail if there is only one node in the distributed system and it fails. 4.3.1 Interface NameServer has the following interface: class export operation init —> [ns : NameServer] Joins an existing name server, if one exists on an active node, or creates a new one. Must be invoked on any node where the name server is used. Returns an object of type NameServer that is invoked to perform bind, lookup, and unbind operations. operation bind [n : String, v : Any] —>• [r : Boolean] 63 Binds value v to name n. Returns true on success, false on failure. operation lookup [n : String] —)• [v : Any] Looks up name n and returns the associated value v, or nil if there is no associated value. operation unbind [n : String] —> [r : Boolean] Unbinds value from name n. Returns true on success, false on failure. 4.3.2 Implementation The core of the name service is written in approximately 400 lines of Emerald code, excluding comments and blank lines. Using the group service provides a clean separation between failure handling code and regular operation, with approximately 50 lines of Emerald code providing failure handling. The interface to the name service is given as an immutable object. This shields users of the service from failures within the service, taking advantage of the property that Emerald immutable objects are never unavailable. On each bind, lookup, or unbind operation, the immutable object attempts to perform the opera-tion on the primary replica. If that fails, the immutable object attempts to perform the object on the backup replica. If that also fails, the name server is considered to have failed. In order to locate the current primary and secondary replicas, the name server uses the Gaggle mechanism [32]. Both the primary and backup replicas create groups and place all of their objects into these groups. The primary becomes a listener for the backup group, and the backup becomes a listener for the primary group. In this (simplified) code fragment from the name server, the primary creates the backup, adds a listener object to the backup group, ensures listener nodes include 64 the node where the other replica resides, and moves the primary and backup to separate nodes. The creation operation for the backup includes the addition of a listener for the primary group. export opera t ion buildOther const a l l <- ( loca te self) .getActiveNodes 7, c rea te backup other <- nameServerData.create[gagg, f a l s e , se l f , thisGroup] otherGroup <- other.GetGroup '/, add l i s t e n e r to backup group otherGroup.addGListener [nameServerListener.create [se l f , gagg, ( loca te self).getName]] °/0 ensure l i s t e n e r nodes include other node thisGroup.addListenerNode[al l[1] $theNode] otherGroup.addListenerNode[all[0] $theNode] 7, move t o d i f f e ren t nodes thisGroup.f ixAll[a l l [0]$theNode] otherGroup.f ixAll[al l[1]$theNode] end buildOther The group service informs the primary on failure of the backup, and the backup on failure of the primary. Upon failure, the surviving replica becomes the primary replica and creates a new backup. In this code nsd refers to the name server that created this listener. Notice that if nsd is unavailable, both replicas must have failed simultaneously. export opera t ion groupUnavailable begin '/, i f not primary, become primary if Insd.getPrimary then nsd . se tPr imary[ t rue] end i f 7. bu i ld other nsd.bui ldOther 65 failure stdout.putString[\"Both replicas have failed\\n\"] end failure end end GroupUnavailable 4.4 Distributed Monitor The Distributed Monitor is an X Windows application providing graphical informa-tion about the distributed Emerald environment (Figure 4.1), The four components of the Distributed Monitor are: the Node Status Monitor, the CPU Utilization Mon-itor, the Link Load Monitor, and the Garbage Collection Monitor. The Node Status Monitor provides a list of all active and failed nodes. The CPU Utilization Monitor shows the CPU activity for each node. The Link Load Monitor provides latency and bandwidth information for links between nodes. The Garbage Collection Monitor provides information on garbage collection at each node. 3 | Emerald Distributed Monitor | 11 J Nod* Stilus Monitor CPU UMraiDon Monitor Ifr* toad Morttor Garbage Cofecttart Monitor 0 * Figure 4.1: Distributed Monitor Each of the four monitors is logically split into two components: a view and an application. The view is responsible for displaying the data and responding to 66 user input using X Windows. The application is responsible for creating the data and passing it to the view. The monitors use the group service to make the Distributed Monitor fault tolerant. Uses of the group service that are common to all the monitors are described here, while techniques specific to each monitor are described below. Each monitor typically creates two groups: one for the view and one for the application. The application is a listener for the view group, and the view is a listener for the application group. Upon notification of the failure of the application, the view creates another instance of the application. Upon notification of the failure of the view, the application exits. The application portion of each monitor uses the node listener service to maintain the currently active nodes. This information is used to determine which nodes data should be collected from, and is passed to the view for display purposes. 4.4.1 Node Status Monitor Figure 4.2 shows the Node Status Monitor. Node blackcomb.cs.ubc.ca: 17586 is active, and node mako.cs.ubc.ca: 17586 has failed. Nodes are uniquely identified by a name and port number pair. When mako.cs .ubc.ca: 17586 becomes active again it will be placed in the live nodes column. A new instance of the Node Status Monitor will show all active hosts, and any hosts that fail after the instance is created. The Node Status Monitor is the simplest of the four monitors, using only the node listener service and residing on one node with no mobility. This simple behaviour means that the view and the application fail together if the node they are on fails, and therefore no groups are required. The node listener service is used by 67 —j Node Status Monitor 1 J1 J Live Modes . Deadf&Kfea tte&m&&sfaml?$3& speS#its.c3.tfcc.ca.i ? ^ | mamwg«.ybC'ca.!7S^ rf^o«:iafae.ca;17586 speb^ita c*>.«fac ca:! 7842 m^tt5£sabcx*:1:7642 : • _£LJ Figure 4.2: Node Status Monitor creating an empty group (a group with no members), and adding a node listener. 4.4.2 CPU Utilization Monitor Figure 4.3 shows the CPU Utilization Monitor. Node manning.cs.ubc .ca: 17586 is busy 59% of the time, mako.cs.ubc.ca: 17842 is busy 15% of the time. The utilization for each CPU is updated at a regular interval, currently set at one second. Any failed host is removed from the monitor, and any newly active node is added. The application is composed of a base, residing on the same node as the view, and CPU monitors, one residing on each node. Groups are created for the base and for each of the CPU monitors. The base creates listeners for each of the CPU monitor groups, and the CPU monitors are listeners for the base group. Upon notification from the group service that the base has failed a CPU monitor exits. Upon notification from the group service that a CPU monitor has failed, the listener informs the view that a node has failed. Upon notification from the group service that a node is up, the base creates a new CPU monitor, and creates a listener for the CPU monitor group as shown in the following (simplified) code fragment. 68 CPU Utilization Monitor 34 Figure 4.3: CPU Utilization Monitor export opera t ion nodeUp[n: Node] const gcpum: gcpuMonitor <- gcpuMonitor.create[thisGroup, x, n] gcpuMonitors.add[gcpum] const name: S t r ing <-n.getName | | \" : \" | | (n.getLNN/OxlOOOO).asString const r : Boolean <-gcpum.getGroup.addGListener[gcpuMListener.create[name]] end nodeUp Upon creation each CPU monitor creates a group, adds all objects it contains to the group, and fixes the group at the node passed as a constructor parameter. The CPU monitor then enters a loop measuring CPU activity. The measurement is performed through a call to C code and is platform dependent. 4.4.3 Link Load Moni tor Figure 4.4 shows the Link Load Monitor. The latency and bandwidth information is displayed in a table format, the rows indicating the from nodes and the columns indicating the to nodes. For example, the latency from manning, cs .ubc. ca: 17586 69 to blackcomb. c s . ubc. ca: 17586 is 944 microseconds, with a bandwidth of 29 Mbps. On the diagonal of the table the to and from nodes are the same and the measure-ments show theoretical extremes. 3: Ha Link Load Mon i to r bWEficorab«,<*c.ci:t7S86 m*o t s tlx. ca. 17S38 ,ii.ielL5f.t»;csjiK.rai:l758S: a«ckCom&.l».abC.c4:t75SB 48 us, 7281 Mfcps 925 US, 32 Mbps 1895us,7Mbps RBnrtngjs .ubc .ca: 17586 919 us, 32 Mbps 42l»,6721 Mbps 1890 us, 4 Mbps mafco.C3.ufoc.ca:!7586; 1770 as. 7 Mbps Quit Figure 4.4: Link Load Monitor The two sliders are used to highlight links with unacceptably high latencies or unacceptably low bandwidths. Links with unacceptable latencies are highlighted in one colour (magenta), links with unacceptable bandwidths are highlighted in another colour (blue), and links with both unacceptable are highlighted in a third colour (red). The application component of the Link Load Monitor is structured as a single group roaming between nodes. The group first moves all members to a single node, and then moves a single group member that performs the measurement to each of the other nodes in turn. The group service ensures that if a node containing part of the application fails, the rest of the application also fails. In this case the view will recreate the application as described above. 70 4.4.4 Garbage Collection Monitor Figure 4.5 shows the Garbage Collection Monitor. The Garbage Collection Monitor is intended primarily as an empirical tool to help understand the characteristics of distributed garbage (garbage that cannot be collected through a local collection). The garbage collection monitor measures the percentage of the old generation space that is free, the number of objects that could be remotely referenced on each node (objects with OIDs), and the number of references this node holds for objects on other nodes (stubs). For example, manning.cs.ubc.ca: 17586 has 62% of the old generation space available, 4622 objects on the node could be remotely referenced, and 103 stubs exist for objects on other nodes. - 1 Garbage Collection Monitor ' «nnir»j.<».uh<;.ci»:! 7586 621 OWectaWthOIOs 4822 .: btactaombc n]x 7B% : : \\ : | 78S BJKP98 7789 1 mmmm^ i _ _ 2 1 . ' . ' . • • . . Figure 4.5: Garbage Collection Monitor The application component of the Garbage Collection Monitor is structured as a single group roaming between nodes, measuring garbage collection information for each node when the application is resident on that node. 71 Chapter 5 Related Work SOS [57] [58] is an object oriented distributed operating system that provides object persistence and migration, built using C + + with extensions. A typical SOS object is on the order of 100 or more bytes, with an object being some arbitrary collection of data with code attached. Distributed services are built using the concept of a fragmented object. In order to use a distributed service, a client must acquire a local proxy of the fragmented object. Migration is used to acquire a proxy for a fragmented object. In order to maintain consistency within groups of objects in spite of unexpected events such as failures, SOS uses the concept of dependencies. This allows certain objects to be declared dependent on other objects, allowing the dependent object to be informed of significant changes in the state of the depended on object. These significant changes are both system detectable and user defined events. This is similar to the listener mechanism in the group service, although it is unclear if the dependency mechanism was ever implemented. COOL [2] is the Chorus Object Oriented Layer, an object oriented envi-ronment build on top of the Chorus Micro kernel supporting object migration and 72 persistence. It is meant to address the mismatch between the services and abstrac-tions offered by the operating system and the services and abstractions offered by the language environment, giving programmers single address space semantics in a distributed environment. COOL is composed of three layers: base, generic run time, and language specific run time. The base provides an image of a single micro kernel build over several distributed Chorus micro kernels, and provides memory services, message passing, distributed shared memory, and persistent storage. The base sup-ports three kinds of migration: migration inside a context group (a group of related memory regions), between context groups, and from disk into a context. The generic run time implements the COOL notion of objects: a combination of state and code. COOL objects are grouped in modules which are application defined clusters of related objects. A cluster is the basic unit of distribution, and provides an abstrac-tion of distributed shared virtual memory. COOL uses the notion of activities for a single thread of control while a job is a collection of (possibly) distributed and related activities. A job is usually a single distributed application. COOL uses up calls into the language specific run time in order to specialize the generic run time. Up calls are used for pointer swizzling upon migration and also for invoking proxies on RPC-like calls. The language specific run time is essentially a pre-processor that generates the appropriate calls to the generic run time. For example, COOL++ is an extended version of C+-1- that is pre-processed into standard C++ . Program-mers must define objects using an Interface Definition Language in order to support persistence and migration. COOL is not built to tolerate failures. For example, if an object is mapped in memory at one machine and physically stored at another, the failure of the machine storing the object will cause unpredictable behaviour. BirliX [45] provides an object migration mechanism that is adaptable by 73 separating policy and mechanism. In BirliX, objects have both functional and non-functional properties, where functional properties are ones defined by the type of the object, and non-functional properties are the object's infrastructure, the run time and inherited primary type. Each instance exists in an object management system. Migration of an instance involves the suspension of the instance, the generation of a checkpoint on secondary store, the transfer of the checkpoint to the destination object management system, the regeneration of the state from the checkpoint, and the resumption of activities. BirliX instances are persistent as long at least one reference to them or at least one thread in them exists. BirliX instances are im-plemented by a structure called a team, essentially an instance and the required storage and computing resources (such as segments) for that instance. Migration of an object is the migration of the team, making object migration in BirliX a relatively heavyweight mechanism. Each object has a default migration mechanism inherited from the primary type. This can be changed by attaching a meta object specifying a different migration mechanism. Type specific migration mechanisms can also be applied. An example is given of a meta implementation of migration that provides for more reliable object migration, at the cost of extra overhead. Applications can be made fault-tolerant in BirliX by checkpointing the state of an instance in per-sistent store. This is only feasible in situations where the frequency and number of checkpoints does not produce unacceptable run-time overhead. A system with fine-grained object mobility would result in many checkpoints, possibly occurring with great frequency. Shadows is a distributed object system influenced by the work of Arjuna. The first version of Shadows [11] takes the approach of providing a minimal amount of services for a distributed object system, with other services being built on top 74 of these. The basic services are object servers, location transparent invocation, and object migration. Object servers provide clients access to objects running on that node, with each server managing multiple objects of a single class. Location transparent invocation is achieved by using a local version of the object known as a shadow that essentially acts as a stub or proxy. Object migration is achieved by object servers having the functionality to send and receive object state from other object servers or clients. Only object state is migrated and objects must provide the methods to marshal and unmarshal their state. In addition, only quiescent objects may be migrated. These basic properties are given as a set of C + + classes and used to create Shareable, Durable, Recoverable, and Lockable classes, which may be flexibly used in any combination through multiple inheritance. The first version of Shadows has been implemented on a network or transputers running the Helios operating system. The second version of Shadows [12] was implemented on top of a UNIX system and followed the same basic architecture, with some enhancements. The facilities provided in that case are for object naming, location, invocation, persistence, and garbage collection. In addition, node failures or network partitions are tolerated. In the second version of Shadows, the object server is replaced by the object manager, and the proxy by references. References use stub-scion pair (SSP) chains. Fault tolerance is added to the reference chains. Names are context dependent with a name defined by an object manager name combined with a local name. Multiple names are achieved by the use of multiple references acting as aliases. As with BirliX, objects can be made fault-tolerant through the use of checkpoints. FT-SR [55] provides fault-tolerance mechanisms in the SR distributed pro-gramming language. FT-SR attempts to provide a balance between a low level language with no support for fault-tolerance or distributed programming such as C 75 and a high level language with a very specific fault-tolerance and distributed pro-gramming model such as Argus. This is achieved by providing general mechanisms that can be used to build specific fault-tolerant policies. The mechanisms include replication, recovery, and failure notification. Failure notification can be either syn-chronous or asynchronous. Synchronous notification is synchronous with respect to a call. Asynchronous notification occurs when a resource is monitored and an oper-ation invoked if the resource fails. FT-SR assumes fail-stop modules with fail-silent failure semantics. Fail-stop modules are modules with operations that either execute to completion or fail. Fail-silent semantics are failures that result in a complete ces-sation of execution. Fail-stop modules provide stronger failure semantics than group service groups in that groups may fail in the middle of an execution. To provide fail-stop modules with the group service a transaction layer using the group service would need to be introduced. The assumption of fail-silent semantics does not hold in distributed systems with fine-grained object mobility where portions of a server may fail, the environment the group service addresses. Synchronous notification is analogous to the Emerald unavailable mechanism. Asynchronous notification is analogous to the group service notification mechanism. The FT-SR mechanism is stylistically different in that FT-SR procedures to be notified can be called with various arguments and are nested in the code using an exception-like mechanism. The FT-SR notification mechanism is also tightly integrated with replication in that a resource that is replicated does not cause notification on failure of a replica unless it is the failure of the last replica. Obliq [10] is a language that uses distributed lexical (static) scoping to sup-port distributed object-oriented programming. This ensures that computations can roam over the network while maintaining network connections. For example, when 76 a procedure is accepted for execution at a compute server, the variables in the pro-cedure maintain the bindings determined by lexical scoping at the client. Moving a variable to another site results in a network reference. Objects are the basic structure in an Obliq program; there are neither classes nor inheritance. Object migration is achieved through a cloning of object state on the destination site and an alias to the clone on the source site. Obliq provides exceptions upon communi-cation failures that can be trapped. Failures may mean a machine has crashed, or a process has terminated. No automatic recovery is provided. Rover [33] provides a toolkit to isolate mobile application developers from the limitations of mobile communication systems, supporting either client or communi-cations failures. An extension to the Rover toolkit [34] provides support for server failures as well through stable logging of messages, persistent variables, failure re-covery procedures, and detection and restart of failed servers. Failures are assumed to be transient and recoverable, such as a power failure or a bug in a rarely used code path. The Rover toolkit is intended for a client/server model of distributed application, while the group service operates in the more general environment of fully distributed applications with object mobility. Conceptually, the group service is intended for a lower layer of abstraction than the Rover toolkit. A toolkit like Rover that assumes crash failure semantics could be built on top of the group ser-vice. Detection and restart of failed servers could be created using the notification component of the group service, with the recovery procedures invoked as part of the restart. Stable logging of messages and persistent variables would need to be cre-ated independently of the group service, perhaps using a transaction or replication mechanism. Distributed Oz [52], a concurrent object-oriented language, provides mobility 77 of objects in a network of systems. Objects are fine-grained language level entities, and may be active or passive. The main contribution of Distributed Oz is that the implementation of objects is network transparent (computations behave correctly independently of how they are partitioned between sites). This is achieved by avoid-ing forwarding chains through intermediate sites for mobile objects, removing the problem of a mobile object leaving behind a trail of surrogate objects forwarding messages to the object. The failure model of Distributed Oz is not defined as of [52], with only a discussion of failures appearing as exceptions. Finding a high-level abstraction for fault-tolerance that separates an application's functionality from its fault-tolerant behaviour is noted as future work. The group service provides this separation of functionality by providing a well understood failure model for appli-cations, essentially providing a network transparent failure model. 78 Chapter 6 Future Work and Conclusions 6.1 Future Work Future work can be divided into two broad categories. The first is improving and enhancing the current implementation. This category addresses the issues identified in Sections 3.4 and 3.5 (Performance and Limitations). The second is applying and extending the concepts of the group service to other areas. The group service provides crash failure semantics. Section 4.3 investigates the use of replication using the group service. Further work is needed to investi-gate the structuring and use of other distributed systems fault-tolerance techniques using the group service. Other possibilities include distributed transactions, check-point/logging, and process groups. Each one of these techniques may impose specific requirements on the group service, for example checkpoint/logging may require a checkpoint taken prior to group failure if possible. With the current interest in the Java programming language and the likeli-hood that it will become a defacto standard for programming distributed systems, 79 it is also important to investigate the application of grouping to this environment. Java as it currently exists does not have process or object mobility. However, the Java environment is well suited to adding this capability. In addition, grouping may be of some use even without object mobility, for example when a logical component is located on several different physical hosts. One of the strongest justifications for object mobility comes from the area of dynamic load balancing: the ability to migrate objects to balance CPU and other resource utilization in a group of systems. Further investigation into the interaction between dynamic load balancing and the group service is required. For example, what affect does the overhead of the group service have on the load balancing poli-cies? 6.2 Conclusions Implementing fault tolerance in distributed systems with fine-grained object mobil-ity imposes some unique constraints. In particular, it is not generally possible for software components to assume crash failure semantics in these systems. Due to the condition that objects in a component may reside at and move among multiple hosts in the network, at best arbitrary failure semantics can be assumed. Unfortunately, almost all classic techniques for fault tolerance in distributed systems assume crash failure semantics. We have discussed a technique called grouping that groups objects of a com-ponent into a logical group with the condition that either all objects in the group are available or all have failed. This allows the assumption of crash failure semantics and the application of classic distributed systems fault-tolerance techniques such as replication and distributed transactions, in spite of the fact that a component may 80 reside at and move among multiple nodes. This technique has been implemented in the Emerald programming language and environment as a group service. The group service provides three fundamental services: reliability, notification, and location. Reliability enforces the condition that objects in a group are all available or all failed. Notification allows other objects to be informed of the failure of a group. Location enables convenient movement of all members of a group. Two example applications using this service, a name server and a distributed system monitor, show the applicability of this technique to produce practical fault-tolerant applications. The name server uses a primary/backup model of replica-tion. The distributed system monitor is composed of four monitor applications: active/failed node monitor, CPU load monitor, latency/bandwidth monitor, and a garbage collection monitor. All of these applications use the group service to achieve fault tolerance. 81 Bibliography [1] A. El Abbadi, D. Skeen, and C. Cristian. An efficient fault-tolerant proto-col for replicated data management. In 4th Annual ACM SIGACT/SIGMOD Symposium on Principles of Data Base Systems, 1985. [2] P. Amaral, C. Jacquemot, P. Jensen, R. Lea, and A. Mirowski. Transpar-ent object migration in C00L-2. In Proceedings of Workshop on Dynamic Object Placement and Load-Balancing in Parallel and Distributed Systems, ECOOP'92, 1992. [3] J. Bartlett. A nonstop kernel. In 8th ACM Symposium on Operating System Principles, 1981. [4] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison Wesley, 1987. [5] P. A. Bernstein, D. W. Shipman, and J. B. Rothnie. Concurrency control in a system for distributed databases (SDD-1). ACM Transactions on Database Systems, 5(1), 1980. [6] K. P. Birman. The process group approach to reliable distributed computing. Communications of the ACM, 36(12), 1993. [7] K. P. Birman and T.A. Joseph. Reliable communication in the presence of failures. ACM Transactions on Computer Systems, 5(1), 1987. [8] A. P. Black. Supporting distributed applications: Experience with Eden. In 10th ACM Symposium on Operating System Principles, 1985. [9] T. C. Bressoud and F. B. Schneider. Hypervisor-based fault-tolerance. ACM Transactions on Computer Systems, 14(1), 1996. [10] L. Cardelli. A language with distributed scope. Computing Systems, 8(1), 1995. [11] S. J. Caughey, G. D. Parrington, and S. K. Shrivastava. Shadows - a flexi-ble support system for objects in distributed systems. In Proceedings of the 3rd IEEE International Workshop on Object Orientation in Operating Systems (IWOOOS'93), 1993. 82 S. J. Caughey and S. K. Shrivastava. Architectural support for mobile objects in large scale distributed systems. In Proceedings of the 5th IEEE International Workshop on Object Orientation in Operating Systems (IWOOOS'95), 1993. S. Ceri and G. Pelagatti. On the use of optimistic methods for concurrency control in distributed databases. In Proceedings of the 6th Berkeley Workshop on Distributed Data Management and Computer Networks, 1982. S. Ceri and G. Pelagatti. Distributed Databases - Principles and Systems. McGraw-Hill, 1985. E. G. Chang and R. Roberts. An improved algorithm for decentralized extrema-finding in circular configurations of processors. Communications of the ACM, 22(5), 1991. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patter. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2), 1994. P. M. Chen, W. T. Ng, S. Chandra, C. M. Aycock, G. Rajamani, and D. Lowell. The Rio file cache: Surviving operating system crashes. In Proceedings of the 1996 International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1996. A. Choudhary, W. Kohler, J. Stankovic, and D. Towsley. A modified priority based probe algorithm for distributed deadlock detection and resolution. IEEE Transactions on Software Engineering, 15(1), 1989. G. Coulouris, J. Dollimore, and T. Kindberg. Distributed Systems: Concepts and Design. Addison Wesley, second edition, 1994. F. Cristian. Understanding fault-tolerant distributed systems. Communications of the ACM, 34(2), 1991. S. B. Davidson. Optimism and consistency in partitioned database systems. ACM Transactions on Database Systems, 9(3), 1984. S. B. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in partitioned networks. A CM Computing Surveys, 17(3), 1985. A. Fekete, N. Lynch, and A. Shvartsman. Specifying and using a partitionable group communication service. In 16th Annual ACM Principles of Distributed Computing, 1997. M. J. Franklin, M. J. Carey, and M. Livny. Transactional client-server cache consistency: Alternatives and performance. ACM Transactions on Database Systems, 22(3), 1997. D. K. Gifford. Violet: An experimental decentralized system. ACM Operating Systems Review, 13(5), 1979. D. K. Gifford. Weighted voting for replicated data. In 7th ACM Symposium on Operating System Principles, 1979. 83 J. Gray. Nodes on database operating systems. In R. Bayer, R. M. Graham, and G. Seegmuller, editors, Operating Systems: An Advanced Course, volume 60 of Lecture Notes in Computer Science. Springer-Verlag, 1978. Object Management Group. The common object request broker: Architecture and specification, 1995. Revision 2.0. T. Harder. Observations on optimistic concurrency control. Information Sys-tems, 9(2), 1984. T. Harder and A. Reuter. Principles of transaction-oriented database recovery. Computing Surveys, 15(4), 1983. M. Herlily. A quorum-consensus replication method for abstract data types. ACM Transactions on Computer Systems, 4(1), 1986. N. Hutchinson and M. Dumont. Building replication and transactions into Emerald. Personal copy, 1997. A. D. Joseph, A. F. deLespinasse, J. A. Tauber, D. K. Gifford, and M. F. Kaashoek. Rover: A toolkit for mobile information access. In 15th ACM Symposium on Operating System Principles, 1995. A. D. Joseph and M. F. Kaashoek. Building reliable mobile-aware applica-tions using the Rover toolkit. In Proceedings of the 2nd ACM International Conference on Mobile Computing and Networking, 1996. E. Jul, N. Hutchinson, and A. Black. Fine-grained mobility in the Emerald system. ACM Transactions on Computer Systems, 6(1), 1988. F. Kaashoek, A. Tanenbaum, S. Flynn Hummel, and H. Bal. An efficient reliable broadcast protocol. ACM Operating Systems Review, 23(4), 1989. R. Koo and S. Toueg. Checkpointing and rollback recovery for distributed systems. IEEE Transactions on Software Engineering, 13(1), 1987. N. Kronenberg, H. Levy, and W. Strecker. Vaxclusters: A closely-coupled distributed system. ACM Transactions on Computer Systems, 4(2), 1986. H.T. Kung and J. T. Robinson. Optimistic methods for concurrency control. ACM Transactions on Database Systems, 6(2), 1981. R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat. Providing availability using lazy replication. ACM Transactions on Computer Systems, 10(4), 1992. L. Lamport. Time, clocks and the ordering of events in a distributed system. Communications of the ACM, 21(7), 1978. B. Liskov. Distributed programming in Argus. Communications of the ACM, 31(3), 1988. B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, and L. Shrira. Replication in the Harp file system. In 13th ACM Symposium on Operating System Principles, 1991. 84 [44] D. E. Lowell and P. M. Chen. Free transactions with Rio Vista. In 16th ACM Symposium on Operating System Principles, 1997. [45] W. Lux. Adaptable object migration: Concept and implementation. ACM SIGOPS Operating System Review, 29(2), 1995. [46] S. Maffeis. Adding group communication and fault-tolerance to CORBA. In USENIX Conference on Object-Oriented Technologies, 1995. [47] L. E. Moser, P. M. Melliar-Smith, and D. Agrawal. Totem: A fault-tolerant multicast group communication service. Communications of the ACM, 39(4), 1996. [48] P. Narasimhan, L. E. Moser, and P. M. Melliar-Smith. Exploiting the internet inter-orb protocol interface to provide CORBA with fault tolerance. In 3rd USENIX Conference on Object-Oriented Technologies, 1997. [49] M. Raynal. About logical clocks for distributed systems. ACM Operating Systems Review, 26(1), 1992. [50] D. P. Reed. Implementing atomic actions on decentralized data. ACM Trans-actions on Computer Systems, 1(1), 1983. [51] G. Ricart and A.K. Agawala. An optimal algorithm for mutual exclusion in computer networks. Communications of the ACM, 24(1), 1981. [52] P. V. Roy, S. Haridi, and P. Brand. Mobile objects in Distributed Oz. ACM Transactions on Programming Languages and Systems, 19(5), 1997. [53] M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and D. C. Steere. Coda: A highly available file system for a distributed workstation environment. IEEE Transactions on Computers, 39(4), 1990. [54] G. Schlageter. Problems of optimistic concurrency control in distributed database systems. SIGMOD Record, 13(3), 1982. [55] R. D. Schlichting and V. T. Thomas. Programming language support for writing fault-tolerant distributed software. IEEE Transactions on Computers, 44(2), 1995. [56] F. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. A CM Computing Surveys, 22(4), 1990. [57] M. Shapiro, P. Gautron, and L. Mosseri. Persistence and migration for C + + objects. In Proceedings of the European Conference on Object-Oriented Pro-gramming (ECOOP'89), 1989. [58] M. Shapiro, Y. Gourhant, S. Habert, L. Mosseri, M. Ruffin, and C. Valot. SOS: An object-oriented operating system - assessment and perspectives. Computing Systems, 2(4), 1989. [59] S. Shrivastava, G. N. Dixon, and G. D. Parrington. An overview of the Arjuna distributed programming system. IEEE Software, January 1991. 85 [60] A. Siberschatz, J. Peterson, and P. Galvin. Operating Systems Concepts. McGraw-Hill, fourth edition, 1991. [61] M. Sinha and N. Natarajan. A priority based distributed deadlock detection algorithm. IEEE Transactions on Software Engineering, 11(1), 1985. [62] R. E. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3), 1985. [63] R. van Renesse, K. P. Birman, and S. Maffeis. Horus: A flexible group com-munication service. Communications of the ACM, 39(4), 1996. [64] Y.-M. Wang. Consistent global checkpoints that contain a given set of local checkpoints. IEEE Transactions on Computers, 13(1), 1997. [65] G. Weikum. Principles and realizations strategies of multilevel transactions management. ACM Transactions on Database Systems, 16(1), 1991. 86 "@en ; edm:hasType "Thesis/Dissertation"@en ; vivo:dateIssued "1998-11"@en ; edm:isShownAt "10.14288/1.0051666"@en ; dcterms:language "eng"@en ; ns0:degreeDiscipline "Computer Science"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "University of British Columbia"@en ; dcterms:rights "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en ; ns0:scholarLevel "Graduate"@en ; dcterms:title "Enforcing crash failure semantics in distributed systems with fine-grained object mobility"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/8394"@en .