UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Nested group communication for wide-area networks Liang, Luping 1992

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1992_spring_liang_luping.pdf [ 7.33MB ]
Metadata
JSON: 831-1.0051179.json
JSON-LD: 831-1.0051179-ld.json
RDF/XML (Pretty): 831-1.0051179-rdf.xml
RDF/JSON: 831-1.0051179-rdf.json
Turtle: 831-1.0051179-turtle.txt
N-Triples: 831-1.0051179-rdf-ntriples.txt
Original Record: 831-1.0051179-source.json
Full Text
831-1.0051179-fulltext.txt
Citation
831-1.0051179.ris

Full Text

Nested Group Communication for Wide-area Networks B y L u p i N G L I A N G B . E n g . , Ts inghua University, C h i n a , 1982 M . M a t h . , Univers i ty of Waterloo, C a n a d a , 1985 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F D O C T O R O F P H I L O S O P H Y in T H E F A C U L T Y O F G R A D U A T E S T U D I E S ( D E P A R T M E N T O F C O M P U T E R S C I E N C E ) We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A M a r c h 3, 1992 (c) Lup ing L iang , 1992 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada Date ^ ' ' ' - ^ ' ^ ^ ' ' ' ' ' ^ DE-6 (2/88) « To my motherland C h i n a , To those ordinary people i n Bei j ing , To my parents and my wife. Abstract Group communication concerns sending messages to receiver groups i n distributed systems. A process group comprises a set of processes and encapsulates their internal interactions, thereby simplifying the interactions between user programs and groups of receiving processes. A l though the basic idea was proposed a few years ago, few systems or applications take advantage of it due to a lack of a comprehensive understanding of the requirements of group communication wi th respect to different classes of applications, and a lack of operating system support to meet those requirements. This dissertation consists of two parts that address these deficiencies. The first part provides a comprehensive understanding of process groups by examining their potential applications, and the requirements to the system support expected by these applica-tions. Groups are classified based on their structure and behavior. A l s o , a uniform treatment of grouping transparency is presented. The second part of this dissertation focuses on a part icular ly important aspect of group communication — group naming i n an internet. For the purposes of mainta in ing subnet auton-omy and reducing traffic on internet l inks , a nested group model is proposed to allow internet groups to contain other groups as members. B y formalizing this nested group model using a name graph, two problems i n group name resolutions are identified: resolution loops and resolution duplications. A f t e r analyz ing existing methods (centralized vs. distributed dynamic methods) and identifying their deficiencies, we propose a novel distributed static method. In-stead of detecting loops at the t ime of name resolution, the static method transforms the system view of the name graph into a special structure which is updated whenever there is a change in group membership. To guarantee correctness, the name graph transformation preserves the property that name resolutions based on the system's view of the name graph are consistent w i t h respect to the users' view. Based on the assumption that name graph updates occur much less frequently than name resolutions, static method trades a higher overhead of name graph updates for a better performance of name resolution to gain an improved over al l group message transport performance. In this part of the dissertation, a static shadow tree algorithm is designed and analyzed. T h e correctness arguments for the algorithm are provided and as-pects such as concurrency control and failure handling in name graph updates are investigated as well . A prototype implementat ion of the algorithm is conducted as an existence proof to demonstrate implementation feasibility and to test the behavior of the a lgor i thm. Contents Abstract ii Contents iii List of Tables v i List of Figures v i i Acknowledgement ix 1 Introduction 1 1.1 Thesis Mot ivat i on and Goals 3 1.2 Summary of Contr ibut ions 5 1.3 Thesis Organizat ion 6 2 Classifications and Requirements 7 2.1 Process Group M o d e l 8 2.2 Structura l Classif ication 10 2.3 Behavior Classif ication and Requirements 12 2.3.1 Determinist ic Groups 13 2.3.2 Nondeterministic Groups 23 2.4 Discussion 30 2.5 Related Work 32 2.6 Chapter Summary 33 3 Internet G r o u p N a m e Resolutions 34 3.1 Nested Group M o d e l 35 3.2 Group N a m i n g M o d e l 36 3.2.1 Group Name G r a p h 37 3.2.2 Group Name Resolution 38 3.2.3 Resolution Loops 39 3.2.4 Dupl icat ion Loops 40 3.3 A Taxonomy on R L and D L Contro l 41 3.3.1 Central ized Approach 42 3.3.2 Dis t r ibuted Dynamic Approach 43 3.3.3 Dis t r ibuted Stat ic Approach 45 3.4 Related W o r k 47 3.5 Chapter Summary 47 4 Spanning Shadow Tree A l g o r i t h m 49 4.1 Name Resolution Consistency 50 4.2 Handl ing Resolution Duplications 51 4.3 Handl ing Resolution Loops 53 4.4 Representing Topology of Ancestors 55 4.4.1 V i r t u a l Node and Derived Name G r a p h 56 4.4.2 Pathnames 58 4.5 Detecting R L s and D L s 60 4.6 Jo in Operat ion 61 4.6.1 V N Topology Information 61 4.6.2 V N Shadow Tree Construct ion 62 4.6.3 Operat ion Description 63 4.6.4 Hand l ing Para l le l Arcs 65 4.7 Leave Operat ion 66 4.7.1 Determining Retained V N 68 4.7.2 Operat ion Description 70 4.8 Correctness 71 4.9 Related W o r k 73 4.10 Chapter S u m m a r y 73 5 Analysis and Experiments 75 5.1 Communica t i on Complex i ty 75 5.1.1 Worst Case Assumptions 75 5.1.2 Message Complex i ty 77 5.1.3 Discussion 80 5.2 Prototype Implementation 81 5.2.1 Environment 81 5.2.2 Prototype Structure 82 5.2.3 Proto type Limitat ions 83 5.3 Testing Exper iments 83 5.3.1 R a n d o m Testing 83 5.3.2 Special-Case Testing 84 5.4 Chapter Summary 90 6 C o n c u r r e n c y and Resiliency 91 6.1 Concurrency Contro l Issues 91 6.2 Update Ordering Protoco l 97 6.2.1 Description 98 6.2.2 Discussion 100 6.2.3 Correctness 105 6.2.4 Message Complexity 107 6.3 Resiliency Support Issues 108 6.4 Name G r a p h Update Protoco l Resiliency I l l 6.4.1 M a k i n g Ordering Determinat ion Resilient I l l 6.4.2 M a k i n g Execut ion Resilient H I 6.4.3 Dealing wi th Failures during Update Execution 113 6.4.4 Message Complexi ty 114 6.5 Name G r a p h Resiliency 115 6.5.1 Handl ing N o r m a l A r c Failures 115 6.5.2 Handl ing V N P a r t i a l Failures 116 6.6 Name Resolution Resiliency 118 6.6.1 Name Resolution Consistency 119 6.6.2 Name Resolution A t o m i c i t y 120 6.7 Related Work 123 6.8 Chapter Summary 124 7 Conclusions and Future Research 126 7.1 Summary of Results 126 7.2 L imitat ions 129 7.3 Future Research 130 A G r o u p N a m e G r a m m a r M o d e l 145 B P r o t o t y p e Experiments 147 C Glossary 162 List of Tables 2.1 Deterministic vs. nondeterministic 29 2.2 Sample applications based on the classifications 30 6.1 Originator , working arc and working set 92 List of Figures 1.1 One-to-many communication 1 2.1 Inter and intragroup communicat ion 9 2.2 Many- to -many (group-to-group) communication 9 4.1 Examples of duplication loop control 52 4.2 The new name resolution procedure 54 4.3 A n example of the shadow tree scheme 56 4.4 Examples of pathnames 59 4.5 Spanning tree construction 62 4.6 A n example of V N breaking by a leave operation 69 5.1 The worst case name graph topology 76 5.2 Worst case overhead analysis of the join 78 5.3 Worst case overhead analysis of the leave 79 5.4 Test cases for the join operation 86 5.5 Test cases for the leave operation 88 6.1 Examples of interference to join operation 93 6.2 Generation of irrelevant messages after deleting < u,v > 103 B . l J o i n experiment 1 148 B.2 J o i n experiment 2 149 B.3 J o m experiment 3 150 B.4 J o i n experiment 4 151 B.5 Join experiment 5 152 B.6 J o m experiment 6 153 B.7 Join experiment 7 154 B.8 Join experiment 8 154 B.9 Leaue experiment 1 155 B.IO Z/eat;e experiment 2 156 B . l l Leaue experiment 3 157 B.12 Leawe experiment 4 158 B.13 Leat;e experiment 5 159 B.14 Leat;e experiment 6 160 B.15 Leave experiment 7 161 Acknowledgements I am deeply indebted to my co-supervisors D r . Samuel T . Chanson and D r . Gerald W . Neufeld, for their personal friendship and professional support throughout my graduate studies at U B C , their encouragement which helped me in gaining confidence in myself, and their guid-ance which taught me how to conduct scientific research. Without Sam and Gera ld , this thesis would never have existed. 1 am also very grateful to other members in my thesis committee, D r . M a b o Ito and D r . A l a n Wagner, for their insightful comments and suggestions. I appreciate D r . Larry Peterson at the University of A r i z o n a for his serving as the external examiner. M a n y other people at U B C have been very supportive. M a n y thanks to Ian Cavers, Wei Chen , Francois Ja lbert , Hilde Larsen, Hongbing L i , Geng L i n , G a n g L i u , Y i n g L u , Runping Q i , Y u n X i e , Cheng Y a n , Jiansheng Zhao and Y i n g Zhang and many others for their sharing the good time and smile w i th me. Their good wishes and positive encouragements always cheer me up. Thanks to Teresa Przy tycka , Roger C h i n , and Felix Jaggi for lending me their ears and their constructive comments, and to Wenjing Zhu for t ry ing to crash the prototype. Thanks to D r . B a r r y B r a c h m a n for carefully reading this thesis and making corrections, and to Donald A c t o n , M u r r a y Goldberg , Norman Goldstein, A r t Pope and M i k e Sample for helping my Engl ish. Thanks to Peter Phi l l ips for keeping SunDraw working and George Phill ips for helping me to get my L^TgX running at B N R . ChinaNet, which carried the technical discussions on building network connections to C h i n a [Quarterman 90], and China News Digest (CND), which brings the daily news about C h i n a to thousands of oversea Chinese students and scholars all over the wor ld , are electronic mai l distribution lists built and maintained by tens of volunteers in N o r t h Amer i can universities. M u c h experience was gained from my technical involvement in setting up and managing both distribution lists. I wish to extend my thanks to those volunteers who worked together with me in these projects. I would never become what I am without the unconditional love, understanding and support from my parents. Their letters from home have always been a constant source of strength and courage, and their determined faith in me has inspired me every step in my life. Above a l l . it is beyond the power of word to express my debt to the person closest to my heart, my wife Yuel i , for her patience, her encouragement, and her endless love. I cannot imagine what it would have been like without her. The research of this dissertation was supported by grants from the N a t u r a l Sciences and E n -gineering Research Counc i l of C a n a d a , the Graduate Fellowship from the University of Br i t ish Co lumbia and the Fellowship from B C Tel . Chapter 1 Introduction In computer networks, multicast (one-to-many communication) is a message transmission mechanism that delivers a message from a single source to a set of destinations. Special cases of multicast are unicast (one-to-one communication) and broadcast (one-to-aU communication) . A s F igure 1.1 shows, a single multicast transmits a message to a set of destinations in parallel . Figure 1.1: One-to-many communication. allowing the receivers to process the message concurrently. The fact that multicast reduces communication overhead and provides better concurrency than unicast can be seen from the following. The communication overhead of an operation is generally measured in terms of cost of bandwidth — the number of messages the operation triggers, and cost of host loading — the number of packet events that occur at the host(s) involved during execution of the operation. Packet events are defined as message transmissions and receptions [Mockapetris 83, Zwaenepoel 84]. To multicast to N{> 1) receivers and receive reply from each, the sender does one multicast send^ and receives N replies, and each receiver does one receive and one reply after the message has been processed. Assuming no message is lost, the host loading cost is 1 + 3N packet events and the bandwidth cost is 1 + N messages. Sending a message to the same set of receivers using unicast results in 4N packet events in host loading and 2N messages in bandwidth cost. Furthermore, a l l these packet events often occur sequentially, which is not acceptable in many applications [Satyanarayanan and Siegel 90]. Mult icast is a network abstraction; group communication is an operating system abstraction. Generally speaking, a group is a set of objects sharing common application semantics, as well as the same group identifier or multicast address.-^ Therefore, each group can be viewed as a single logical entity, without exposing its internal structure and interactions to users. Generally, objects are grouped for (i) abstracting the common characteristics of group members and the services they provide; (ii) encapsulating internal state and hiding interactions among group members from the clients so as to provide a uniform interface to the external world; and (iii) using groups as bui ld ing blocks to construct larger system objects. The exchange of messages among groups is called intergroup communication ( I G C ) . Group communication offers improved efficiency and convenience. • It provides a high-level communication abstraction to simplify the interaction between user programs and a group of receivers (for example, instead of using multiple one-to-one ' A s s u m e b r o a d c a s t m e d i u m s u c h as E t h e r n e t [ M e t c a l f a n d B o g g s 76]. ^ M u l t i c a s t exists at the m e d i u m access sublayer . T h e m a p p i n g between g r o u p identifiers a n d mult icast addresses is i m p l e m e n t a t i o n d e p e n d e n t a n d not necessari ly o n e - t o - o n e . See S e c t i o n 2.3.2 for e x a m p l e s . send operations to deliver a message to ind iv idual members in a group, only one group-send operation needs to be performed in the user program and only the group id has to be known) . • It hides the organization of a group (e.g., membership of the group) and internal coordi-nation among group members (e.g., synchronization among members) from applications. • It provides a high-level abstraction to take advantage of the multicast capability of a net-work, thereby, reducing the costs of network bandwidth and host loading, and increasing message processing concurrency. Group communicat ion is best supported by network multicast . It can also be emulated by unicast, or s imulated by network broadcast. However, when emulated by unicast, not only is there higher overhead in delivering and processing messages sequentially, but the sender must keep track of group membership; thus, a group is no longer self-contained. W i t h network broadcast, extra host loading overhead is required because al l machines must examine every message regardless of its destination; furthermore, the communication is less secure since group messages can be seen outside a group. 1.1 Thesis Motivation and Goals Although mult icast was introduced some years ago, the advantages of group communication have not been fuUy realized. The reasons for this are: 1. a lack of understanding of the requirements of group communication with respect to different classes of applications; 2. the fact that few systems provide sufficient group communication support at the operating system level to meet these requirements; and 3. unti l recently, a lack of adequate multicast hardware support. This thesis addresses only the first two problems. The first part of the thesis provides a comprehensive discussion of process groups. Potent ia l appUcations o f process groups and the required group communication support are examined. Groups are classified according to their internal structure and external behavior. Furthermore, a uniform treatment of grouping transparency is presented. The second part of the thesis investigates internet group naming support. In an internet environment, an internet group G may contain members in more than one subnet. The sub-nets are interconnected v ia relatively expensive internet Unks. In order to save internet link bandwidth, reduce message transmission delay, and support subnet /subdomain autonomy, we propose to organize internet groups using a nested naming structure so that members in each subnet /subdomain of a group G constitutes a subgroup. A message sent to G results in a single message being sent to each subgroup, where the message is delivered to the local mem-bers. Natura l ly , a group may contain other groups. The membership in a subgroup normally need not be known outside its subnet/subdomain for reasons of autonomy. Distr ibut ion lists in messaging systems are examples of nested grouping. Unfortunately, there are two undesirable problems w i t h the nested group structure: 1. A nested group may include itself directly or indirectly, forming a recursive group. This often occurs in distribution lists in message handling systems. For example, a mail ing list 6 ' i at the University of Br i t i sh Co lumbia may contain a maiUng list G2 at the University of Waterloo (as its member), which may in turn contain a mai l ing list G3 at the University of Toronto. W i t h o u t knowing that G2 is a subgroup of G i , the list manager at the University of Toronto may include G\ in G3, creating a recursive group — a group that contains itself as a subgroup. When recursive groups are allowed, delivering a message to such a group may become an infinite process; that is, a name resolution request for a recursive group may never complete. 2. Another problem is duplicate message propagations because of duplicate name resolutions. This problem occurs when a group G2 is included in at least two other groups that are themselves included either directly or indirectly in the same supergroup Gi, so that a message to Gi w i l l be sent to G2 more than once through the different subgroups of Gi, and every member of G2 wiU receive multiple copies of the same message. A l though this problem does not prevent the network from functioning correctly (because duplicate messages may be detected by receivers using, for example, a message sequence number) , it does cause unnecessary message propagation that wastes internet bandwidth, especially when G2 contains many subgroups. A n examination shows that existing approaches do not handle these problems correctly. The second part of this thesis proposes a new approach to solving these problems. Issues associated with the new approach, such as concurrency control and failure handl ing , are also discussed. 1.2 Summary of Contributions The major contributions of this thesis are summarized below: 1. Analys is and classification of process groups on the basis of the group internal struc-ture and the external behavior. This provides a comprehensive understanding of process groups w i t h respect to their potential applications and the requirements to group com-munication mechanism. This classification also provides a uni form treatment of grouping transparency. 2. Extension of groups to a nested structure for supporting efficient internet group commu-nication as well as subnet /subdomain autonomy. Based on a formal name graph model, problems related to group name resolution in an internet environment are identified and analyzed. E x i s t i n g approaches are investigated and their shortcomings analyzed. A new static approach to handling these problems is proposed. 3. A thorough study of the design, implementation and analysis of static algorithms, which avoids resolution loops and controls resolution duplications for internet group communi-cation. Th is also includes the design of concurrency control and failure handling protocols in group membersiiip updates, as well as the protocol for atomic and ordered group com-municat ion . 1.3 Thesis Organization T h e remaining six chapters are organized as follows. In Chapter 2, process groups are an-alyzed and classified according to their internal structure and external behavior to provide a better understanding of group appUcations and their system support requirements. In C h a p -ter 3, a nested structure for internet groups is proposed to reduce communication overhead on internet links and support autonomy in subnets/subdomains. Furthermore, problems related to the name resolution of nested groups are investigated, and formal models as well as correctness criteria for internet group naming are proposed. A taxonomy of existing approaches to deaUng wi th resolution loops and message duplications is also provided. In Chapter 4, a shadow tree a lgor i thm is designed. It is a distributed static algorithm which controls resolution loop and resolution dupl icat ion at the time when the name graph is updated. In Chapter 5, the com-municat ion complexity of the shadow tree algorithm is analyzed, and the implementation and testing of a prototype are reported. The aspects of concurrency control and failure handling are investigated in Chapter 6. Chapter 7 contains the conclusions of the thesis and an outline of future work. A n up-to-date bibliography of group communication is included at the end of the thesis. Chapter 2 Classifications and Requirements In Chapter 1, we saw that group communicat ion using multicast provides many advantages over broadcast and unicast communicat ion methods. To bui ld a group communication mecha-nism i n distributed systems, we need a good understanding of the k i n d of applications for which groups wiU be useful and the k i n d of operating system support expected by these applications. This chapter is intended to serve this purpose. In this chapter, process groups are classified based on an analysis of group internal structure and applications.^ The organization of this chapter is as follows. T h e process group model is defined in Section 2.1. In Section 2.2, we classify groups on the basis of the internal homogene-ity among group members. A group is classified into one of four categories: data and operation homogeneous, operation homogeneous only, data homogeneous only and heterogeneous. In Section 2.3, different distributed applications of groups and their group communication re-quirements are examined. Here, groups are classified into deterministic and nondeterministic categories according to their external behavior. Grouping transparency is identified as a de-sirable property commonly expected by group applications. The relationship between the two classifications is discussed in Section 2.4. Section 2.5 contains a brief overview of the related work in the area. ' A m o d i f i e d version of this c h a p t e r has b e e n p u b l i s h e d [ L i a n g et al 90a]. 2.1 Process Group Model A process group is defined as a set of processes ti iat cooperatively provide a service. Each process, serving as an object manager, maintains a set of objects. Generally, an object can be characterized by a set of variables and a set of operations defined on those variables. Object managers control the way that objects are accessed; that is, users have to send their requests to the object managers to execute the operations associated with the objects managed by the managers. Object managers in the same group share the same group id . They may have to coordinate among themselves to mainta in consistency among the objects that they maintain. It is sometimes useful to have an object be accessible to more than one group. Therefore, a process (the manager of the shared object) can be a member of several groups and process groups can overlap. In distributed systems, machine boundaries prevent processes on different hosts from phys-ically sharing memory. In the following discussions, urdess stated otherwise, we assume that interactions wi th process groups are through message passing.'^ The underlying network can be either a local area network ( L A N ) or an internet. We make no assumption regarding the implementation of group communicat ion. A group is closed when only its members are allowed to send requests to i t ; otherwise, it is open. In this chapter, we assume the open model , as it is more general and corresponds to the client-server model commonly used i n distributed operating systems. In the client-server context, the process invoking an operation is called a client and the process receiving and processing the invocation is called a server. A process can play both client and server roles, depending on its communication context. This client-server model can be extended to group communication; that is, client group and server group can be defined similarly. As Figure 2.1 indicates, communicat ion between external clients and a server group is caDed intergroup communication and internal communication among group members is known as intragroup communication. Intergroup communication could also occur between groups in ^ O t h e r c o m m u n i c a t i o n p a r a d i g m s s u c h as r e m o t e o p e r a t i o n i n v o c a t i o n c a n be i m p l e m e n t e d o n t o p of message pass ing . Client Inter-group Communication Intra-groap '^r-Sr*' Communication v—J Server Group Figure 2.1: Inter and intragroup communication. Figure 2.2: Many- to -many (group-to-group) communication. the form of many-to-many communication as shown in Figure 2.2. Sending messages from a single process to a group is called one-to-many communication, and from a group to a single process is called many-to-one communication. Usually many-to-many communication can be decomposed into one-to-many and many-to-one communication [Cooper 85]. 2.2 Structural Classification Viewed as a collection of object managers, a group can be classified on the basis of the liomogeneity of the internal state of the objects maintained and operations supported by each group member. A n object manager can be characterized by: • Application level objects — the set of objects maintained by the process. Their value determines the internal state of this process. • Application level operations — the set of operations that can be executed on the applica-tion level objects. Other processes can modify the value of these objects only by invoking these operations through this manager. Operations on the objects are what define the services a process provides. Because services are accessed through interprocess communication in message passing systems, operation executions and process state transitions are stimulated by the events of message arr ival . A selection rule of a group G is a set of criteria for selecting accessible objects in G and is determined by G"s appl ication. For simplicity, we assume that an object is accessible in group G if and only if the object satisfies G 's selection rule, and that a process is a member of G if and only if the process maintains at least one object satisfying that rule. The services provided by a group are implemented by the group members cooperatively and are accessed by clients through intergroup communicat ion. A group G can be characterized by: • Group objects — the subset of objects maintained by each member process in G'. which satisfies G's selection rule. • Group operations — the set of operations on the above objects. Depending on how each group member implements and maintains objects and operations, a group can be placed in one of four categories: D a t a and operation homogeneous ( D O H ) : Every member in a D O H group maintains a com-plete replica of the set of group objects and implements an identical set of operations on these objects. To guarantee consistent external behavior, a D O H group maintains consis-tency among replicas of the objects and requires every member to execute exactly the same sequence of operations. D O H groups are used mainly to increase service rehabil-ity and availabiUty. Examples are groups in Isis [Birman and Joseph 87a], troupes in Circus [Cooper 85] and fault-tolerant state machine [Schneider 90]. Operat ion homogeneous only ( o H o ) : In an O H O group, the object space is partitioned among group members — each member maintains only a part of the global group state. Object space partitions may overlap. A l s o , every member supports an identical set of op-erations on its port ion of the objects. However, when an operation is invoked on an object, only members with the relevant object need to perform the operation, O H O groups are mainly used to distribute the work load among group members. Each member may main-tain the integrity of its own objects independently of the other members. The distributed name service discussed in Section 2.3.2 is an example of the O H O group [Ahamad et al 88]. D a t a homogeneous only ( D H O ) : Members in a D H O group share a set of objects by sharing the same address space on a single machine, or in some other distributed manner (for example, data replication). Each member supports a set of operations on the same objects. These operations may or may not be identical to those of other members. Upon invocation of a group operation, members may act differently. For example, a coordinating member accepts an operation invocation from a client and accompUshes the task through internal cooperation with other group members. The role of coordinator need not always be played by the same member. Also , members in a D H O group must synchronize themselves to serialize concurrent updates to the objects. This requires an underlying mechanism similar to that for D O H group support. The usual purposes of a D H O group are to provide group services cooperatively v ia a set of worker processes; and to simplify the design, implementation and interface of the service by masking member cooperation from external observation. Examples include teams in V [Cheriton 88b] and the primary-secondary replication scheme [Schroeder et al 84]. Heterogeneous ( H E T ) : A S far as the group application is concerned, the objects and op-erations each member implements and maintains could both be heterogeneous. There may or may not be cooperation among group members, and their internal states may be completely independent of one another. Rather than encapsulating interactions among members to provide a cooperative group service, heterogeneous groups facilitate system control and simplify the interactions between the cUent and server groups. Electronic mai l distribution lists, computer conferencing, news groups and distr ibuted process-control are applications of heterogeneous groups. 2.3 Behavior Classification and Requirements One cannot appreciate the design requirements for operating system group support without understanding the quality of service requirements of the intended applications. This section reviews various group applications, highlighting the service requirements expected from the group system, and classifies process groups according to the external behavior expected by the appUcations. According to their external behavior, distributed process groups can be classified into two major categories: deterministic and nondeterministic. The former groups are mainly used in replicating data and services to enhance reliabil ity; the latter are mainly used in distributing data and workload among multiple servers to improve information availabil ity and resource sharing. More complete definitions of the two categories appear later. Basically, a deterministic group requires high rehability in group communicat ion to maintain strong consistency among members. Such groups are "heavy weight" in the sense that they require the communication layer to mainta in complete group membership information and atomic , consistently ordered group interactions. In contrast, nondeterministic groups are "light weight" since they need only basic datagram mult i -del ivery transport support. Inconsistencies and unreliable group interactions are handled in an application-specific manner, resulting in more flexible and efficient — but more complex — application programming. Whether a group is deterministic or nondeterministic depends solely on its application, not on its structural characteristics. 2.3.1 D e t e r m i n i s t i c G r o u p s A group is deterministic if it requires that every member in the group receive and act on a request."^ In addit ion , the order of the requests being processed is the same at each of the recipients. This requires coordination and synchronization among the group members. In most deterministic groups, member processes are equivalent [Cooper 85]'^ — upon receiving the same request in the same state, the same procedure is invoked, and every member transfers to the same new state and produces the same response and external effects. We now look at some deterministic group applications in terms of their basic characteristics and communicat ion requirements. Replicated Fi le Systems In a fuUy replicated file system, a l l file servers constitute a group. Files are replicated at every file server to enhance file availability and reliabil ity. T w o common methods of supporting replicated file systems are peer-member and primary-secondary [Chang and Maxemchuk 84b]. In the peer-member scheme, al l members in a file server group are identical and group commu-nicat ion and coordination occur between the client and al l members. In the primary-secondary scheme, a pr imary member handles the communicat ion interface between cUents and the group, and group communication and coordination occur between the pr imary and al l secondaries in -^ T h e t e r m d e t e r m i n i s t i c is used to c h a r a c t e r i z e the r e l a t i o n s h i p a m o n g g r o u p m e m b e r s , a n d is less restr ict ive t h a n t h a t used by s o m e a u t h o r s to m e a n o n l y one possible e x e c u t i o n of a single p r o c e d u r e . * T h e r e are cases w h e r e m e m b e r s of a d e t e r m i n i s t i c g r o u p are not equivalent . See Sections 2.3.1 a n d 2.3.1. side the group. A fuUy replicated file server group is a deterministic DOH group i f the peer-member scheme is used, or a deterministic DHO group if the primary-secondary scheme is used . Rephcation transparency is an important characteristic of these systems. Clients of a file system usually prefer a single file image regardless of whether a file is implemented by a single server or many servers. The file abstraction as seen by a client is called a logical file image a n d operations on it are called logical operations. The physical file copies maintained by the file servers are known as file replicas and operations on them are called physical operations. ReUabiUty requires that file replicas be kept consistent at al l servers, so that files are always available to clients as long as at least one file server is functioning. Avai labihty requires that users should be able to read the files with m i n i m um latency, reading a consistent copy of a file from the closest functioning server, for example. Both reliabil ity and availability imply that file replicas must be consistent. Thus, every logical file update must be atomic; that is, either executed by aU servers or by none, and a client must be informed whether its update has completed. Furthermore, because different sequences of the same set of update operations can result in different file states, all servers must execute exactly the same sequence of operations wi th respect to each logical file. Two logical operations opi and op2 are in conflict if they are data dependent; since they manipulate overlapping logical data , the execution order affects the results. Two conflicting operations collide if they are executed concurrently. Consider two clients C'l and C2 that issue a LockFi le request independently to acquire a lock on the logical file F. If the two requests are not consistently ordered at a l l server group members, some servers may allocate the physical locks on their physical rephcas of F to C ' l , others to C 2 , depending on which request is executed first: thus, a deadlock may occur. Clearly, colliding operations must be ordered: the order can be arbitrary as long as it is consistent at all member sites . This ordering of colliding operations is called absolute ordering [Birman a n d Joseph STa]. File servers can be added or deleted dynamically. Clients expect the file server sToup to coordinate internally to hide membership changes from external observation. This type of internal coordination includes keeping consistent name binding between a g r o u p identifier a n d a set of server process identifiers, and bringing the state of a new server up-to-date. Satisfying this requirement not only simplifies the group interface to clients, but also increases the flexibility of file system configuration. Since a host cannot always distinguish between network partit ioning and host failure, file system level consistency requires that both clients and servers be notified when either failure occurs, and that some higher level consistency control algorithms be used to handle the fai lure. Replicated P r o g r a m Execution Replicated program execution is another example of the deterministic DOH group. In some distributed appl icat ions, a major concern is the resiliency of computations. Programs can be decomposed into abstract data type modules, each having a set of internal states and a set of procedures manipu la t ing the states. Modules are repKcated at multiple sites [Cooper 85] and replicas of a module form a group.^ Group members may be different implementations of the same abstract d a t a type writ ten by different programmers (subject to the constraint of equiva-lent deterministic behavior) . Thus , module rel iabi l i ty can be enhanced by both replication and multiversion programming . Mult ivers ion programming tends to reduce program design faults (assuming fail-stop behavior) ; repUcated module execution tends to make the module run time execution robust. A procedure call to a module can proceed as long as at least one group mem-ber is functioning. A n appUcation programmer would expect that the syntax and semantics of a replicated procedure call is the same as for a non-replicated one, making the rephcation transparent. Consistency in replicated procedure calls requires that the group members be deterministic and equivalent and that every member execute the same sequence of calls. It is the responsi-biUty of the appl i cat ion programmer to ensure that modules are deterministic and equivalent. The group system, on the other hand, must guarantee atomicity and ordering (defined in the previous section) to each procedure call , as well as rephcation transparency. For each repli-cated procedure caU from a client group to a server group, each client group member makes " A l s o d e n o t e d as a t r o u p e i n the or ig inal reference [ C o o p e r 85]. a one-to-many call to the server group, each server group member accepts many-to-one iden-t ical procedure invocations (resulting from one call per cUent group member) and makes a one-to-many reply to the cUent group after execution. Eac h cUent group member then handles many-to-one repUes from the server group. These mult ip le calls and receptions are best han-dled in the group communication layer so that chent/server programmers need not be aware of multiple entities in the caUer/caUee group. A n advantage of making replicated procedure calls at the module level is that the degree of rephcation can be adjusted dynamically according to the functions of different modules, thus opt imizing system performance and reUabiUty. In other words, more copies of crit ical modules could be made to enhance their reUabihty and fewer rephcas would have to be made for less important modules. A lso , dynamically changing group membership allows system auto-reconfiguration to be transparent to apphcations as group members fai l and recover at run time. However, this dynamic group membership makes it difficult to bind a group name to a set of modules. A repUcated program execution often happens wi th in a single local area network; therefore, network part i t ion failure is unUkely. W h e n any group member fails, it is expected to reconfigure itself autonomously, making part ial failures transparent to cUent groups. Distributed Industrial Process Control Distr ibuted process control is an example of the deterministic H E T group. Imagine a simple distributed process control environment in which the temperature in a reaction container is to be controUed. A sensor measures the current value of the temperature and a console displays that value to human operators, records the history of the sensor signals and aUows an operator to set the control parameters. A set of controUers manages the How of cooUng fluid into the container by opening or closing valves. The sensor can view the console and the controUers as a group. W h e n a control parameter diverges from a preset value, the sensor multicasts the measured result to the group so that the console informs the operators and the controUers open or close the valves to adjust the flow. This group is obviously heterogeneous because each member, although driven by the same sequence of signal s t imul i , maintains completely different objects and performs different opera-tions. These components are grouped together simply for convenience of communication; they are identified by a single receiver ID i n each group message. If the underlying network supports multicast , only one copy is transmitted for each sensor signal, saving network bandwidth and speeding the signal processing. A l though there is no application level consistency requirement, this group communication has rel iabil ity requirements. Unless the sensor fails, its signals must be reUably delivered to aU active members i n the same order as generated. Also , each signal from the sensor must be delivered by a predefined deadline or the signal wi l l become obsolete. The sensor expects no reply from the group. A n y indiv idual member failure must be detected quickly and operators must be notified to repair the failed component. Before recovery of the failed component, however, the remaining group members are expected to continue fulfi l l ing their duties even though the complete group service may not be available. E m a i l Distr ibution List and C o m p u t e r Conferencing Electronic m a i l d istr ibut ion lists and computer conferencing are two other examples of the deterministic H E T group. People registered in the same distr ibut ion list or conference constitute a group. Group ing is for convenience of communication; for each message, a sender prepares a single copy and performs one send operation. Generally, a sender does not know who participates in the distr ibut ion l ist , because registration is handled by an independent authority. The message is sent to the address of the l ist , which has the same syntactic format as other single-user addresses and which logically includes a l l part ic ipants . The same is true for computer conferencing except that , normally, conferences are closed groups in which only the participants can send messages to the conference. In contrast, a d istr ibut ion list is open to anyone with access to the its address. Assuming registration and network connections are set up properly, messages to a conference are expected to be delivered atomically. If any participant receives a group message, then other members in the same group should also receive the same message within a bounded period. For each message deUvered. however, the sender may receive zero or many repUes from recipients, either in the form of private one-to-one correspondence or as foUow-up discussions to al l conference participants. In a foUow-up message m , the speaker makes his point on the basis of al l the messages related to m that the speaker has received: these messages are called the context associated with m. For other participants to understand m properly, they must be able to reference its context [Peterson 88, Peterson et al 89]. A recipient should not see a message without having also received its context. This dependence relationship is sometimes called a causal relationship [Birman and Joseph 87a] and defines a partial ordering among messages submitted to a conference. In both distr ibution lists and conferencing, absolute ordering is not required. Concurrent messages usually can be deUvered in arbitrary order because they are not context related and because the partic ipants ' states are not message dependent. GeneraUy in a conference, a single person is elected as the conference chair who determines when to start and close the conference and how to arbitrate the order of concurrent messages when necessary. Sending and receiving messages can be concurrent and asynchronous for each participant. Because incoming messages may affect the content of out-going ones, receiving may be given higher prior i ty than transmission. In conferencing, a foUow-up message may become a competitive orphan (explained in Sec-tion 2.3.2) because of the asynchronous communication pattern. In this s i tuat ion, several people may respond identicaUy to the same message before seeing responses from each other. Fast deUvery makes this problem less Ukely but does not eUminate it entirely. Membership changes should not affect the group communication. A new member to a group normaUy becomes up-to-date by reading the conference buUetin board' ' or the discussion archive. The failure of a member is normaUy treated as i f it has left the group. ' ' A b u l l e t i n b o a r d m a y be necessary as a conference p r o c e e d i n g so that speeches de l ivered to a conference c a n be r e a d later . E v e n wi th d i s t r i b u t i o n lists, a message archive for group discussions is very useful. Distributed Databases A deterministic H E X group may be used within a distributed database. A distributed database model has a transaction manager ( T M ) and a data manager ( D M ) at each site. Each T M accepts user requests and translates them into commands for D M s . Each D M maintains part of the database stored at its site and may concurrently execute transactions from multiple T M s . A transaction group consists of a l l D M s part ic ipat ing in the same transaction. This group is heterogeneous because each D M maintains a different part of the database and each may respond differently (accept or reject the pre-commit from the T M ) according to its local status. Messages from the coordinating T M are atomically delivered to the transaction group. For the purposes of concurrency control and failure recovery, a two-phase commit ( 2PC) protocol is normally used at the end of each transaction. W h e n the coordinating T M and al l cohorts ( D M s ) that know the decision (commit or abort) fa i l , the conventional 2 P C wil l be blocked unt i l some processes recover. In [Chang and Maxemchuk 84a], a simple nonblocking 2 P C is designed on the basis of or-dered atomic group communication. In their nonblocking 2 P C , the protocol proceeds as the conventional 2 P C in the absence of a coordinating T M failure. W h e n the coordinating T M fails, a communication-layer-generated failure notification is broadcast to the group and each D M aborts the transaction upon receiving this notice. This protocol requires not only an atomic but also an ordered group communicat ion mechanism to guarantee: 1. that every D M in a transaction group wi l l receive exactly the same sequence of instructions from the coordinating T M ; and 2. that if the coordinating T M fails after broadcasting commit , its failure notification is delivered either before or after its commit message consistently and atomically to all D M s . allowing them to either commit or abort the transaction consistently. A l t h o u g h the membership of a transaction group is governed by the coordinating T M and a D M failure is detected by the T M during the execution of 2 P C . the communication layer stil l needs this membership information for ordered atomic communication to support the nonblocking algor i thm. Inconsistent repUes from D M s are handled by the coordinating T M itself rather than by the group mechanism. S u m m a r y of Requirements The above analysis of deterministic group appUcations makes clear that grouping trans-parency is a desirable property aUowing a cUent to treat a service of a group as if it were provided by a single server. A single cal l is made and a single result, i f any, is expected from the server group. Also , a server mainta in ing group object repUcas need not know that another co-server exists, and thus appUcation programmers need not be aware of group coordination. Grouping transparency for deterministic groups must satisfy the foUowing requirements: Communicat ion transparency. Communicat i on transparency consists of two aspects — atomicity and ordering [Birman and Joseph 87a]. • Atomic message delivery. A n atomic group message is either received and processed exactly once by aU members in the recipient group or by none at aU. Atomic i ty hides a part ia l group communicat ion failure by converting it into a total failure. • Application level absolute ordering/ The deUvery order of messages to group mem-bers needs to be synchronized if the order wiU affect the result. Every member of a group must see the same sequence of requests on dependent data and adjust its inter-nal state accordingly. A l s o , in the case of coUiding requests, the common members of two overlapping groups must see a consistent "combined" sequence of requests to both groups. '^It is w o r t h n o t i n g t h a t some s y s t e m s i n t r o d u c e a n o t h e r t y p e of g r o u p message o r d e r i n g w h i c h respects the c a u s a l r e l a t i o n s h i p a m o n g messages. It is o u r bel ief t h a t m a i n t a i n i n g c a u s a l r e l a t i o n s h i p is a general requirement in process c o m m u n i c a t i o n , even i n the case of o n e - t o - o n e c o m m u n i c a t i o n . O n c e a b s o l u t e o r d e r i n g is s u p p o r t e d , a g r o u p c a n be t r e a t e d as a single ent i ty a n d c a u s a l l y o r d e r e d g r o u p c o m m u n i c a t i o n c a n be bui l t o n top of absolute o r d e r i n g using any causa l s y n c h r o n i z a t i o n m e c h a n i s m designed for one - to -one i n t e r p r o c e s s c o m m u n i c a t i o n . R e p l y handling transparency. Because client and server interactions normaUy foUow the request-response pattern [Cheriton 86b], reUable multicast from cUent to server is not enough. RepUes from a group also have to be coUected and processed properly to achieve grouping transparency [Birman and Joseph 87a, Cooper 85]. For each group request there exists a potential for multiple responses from a server group. These responses may or may not be identical . Reply-handUng transparency guarantees that a cUent need not be aware of multiple repUes to its request. It sees a single reply result without having to be concerned wi th how this result is derived from multiple repUes (for example, by weighted voting). Mul t ip l e t imely repUes may cause congestion (in a short period) at the original message sender i f the speed of reply processing at the latter cannot catch up with that of the reply arrivals. Some reply messages may be lost due to message receiving buifer over run . For a system to be useable, a mechanism for coordinating and reUably deUvering multiple t imely repUes wiU be required. N a m i n g transparency. This involves the problem of dynamicaUy and transparently binding group members to a single name. Group naming consists of two parts: mapping a logical service name into a group of servers and aUowing group membership to change dynam-icaUy. A group-view is defined as a snapshot of the group membership at a particular instant of t ime. It is maintained by each of the concerned parties, be they the group members, the system name servers, or any cUent needing to make decisions on the basis of the group-view. A group-view changes in paraUel wi th other group message activities as members fai l and recover, or are inserted and deleted. Since atomic group interactions rely on a consistent group-view to verify that aU active members have confirmed reception of each atomic message, group-view changes have to be detected in a consistent manner. It is convenient to seriaUze group-view changes con-sistently wi th respect to other group message activities. Isis group members achieve this by having a system generated announcement, issued on behalf of the failed member,^ follow al l its pre-failure messages. This announcement arrives at every member in the same order wi th respect to the other group messages, so that the members wi l l aU see the same sequence of group-view transitions at v irtual ly the same t ime. Therefore, it is guaranteed that no message w i l l arrive from a failed member once its failure is announced [Birman and Joseph 87a]. If a member recovers, the state of the new member must be brought up-to-date w i t h a consistent snapshot of the group internal state, and al l con-cerned parties should perceive the existence of the new member before any message is received from i t . Failure transparency. Depending on the purpose of a group, either the cUents and server group members are notified of the failure to take appUcation level recovery actions, or a member failure is hidden from the cUents. In the latter case, the failure may be pre-sented as a complete group failure, or other group members may take over the role of the failed member. The technique chosen depends on the function of the group. W h e n a deterministic group is used to enhance data reUabiUty, as in repUcated file systems or databases, strong consistency is required among the group object repUcas. Therefore, the group consistency control strategy may exaggerate a part ia l failure as a total failure;^ if any member fails or the network is partitioned during a group operation execution, the operation cannot succeed unt i l the failure is repaired or the group is properly reconfigured. When a group is used to enhance service reUabiUty, however, as i n repUcated procedure caU, maintaining service to the cUents is important . Dur ing recovery of the failed member, the remaining active group members should continue to fulfill their duties to keep damage to a m i n i m u m . Real-time requirement. We can measure multicast message delay in terms of distribution time — the time taken for aU operational members in a group to receive a multicast, or * I n this c o n t e x t , fa i lure also i n c l u d e s d e l e t i o n , a n d recovery also inc ludes i n s e r t i o n . ^ A s imi lar i d e a has been u s e d in a t o m i c message delivery. in terms of completion time — the time talcen for the sender to learn that al l destinations have received its message reliably [Mockapetris 83]. In deterministic group applications, consistency is more important than efficiency. W h e n trade-offs between the two have to be made, system support often gives pr ior i ty to consistency. Usually in real-time systems, on the other hand , messages must be delivered w i t h in a client-specified deadline; otherwise, a message is deemed obsolete and a t iming fault is triggered. In one-to-one interprocess communication, a server can simply ignore a t iming fault mes-sage or respond to the client wi th an operation failure. In group communicat ion, it is possible for some servers to receive the message on time while others see a t iming fault, even though atomicity and ordering are guaranteed. W h e n this occurs, servers in the recipient group must coordinate to act consistently on each t iming fault message to guar-antee consistency. Mult icast distr ibution t ime should be bounded so that group actions can be scheduled to occur atomically and simultaneously in v i r tua l time at aU group members [Crist ian et al 85]. 2.3.2 N o n d e t e r m i n i s t i c G r o u p s Determinist ic groups require strong data and behavior consistency, and synchronization among aU members. Nondeterministic groups assume that their appUcations do not need such strong bui l t - in consistency and they relax it i n various appUcation-specific ways. Nondetermin-istic group members generaUy need not be equivalent [Cooper 85]. Each member may respond differently to a group request, or not respond at aU, depending on the state and function of each indiv idual member. NormaUy, either member states of a nondeterministic group are unaffected by processing user requests or they are not necessarily consistent. Either requests to such a group do not require aU members of the group to act or missing requests can be detected and recovery completed within the appUcation. Because of the relaxed consistency, maintenance overhead for the group is generaUy lower than that for a deterministic group. Whether a group should be implemented as deterministic or not depends on the requirements of the appUcation. Next , we wiU describe some nondeterministic group appUcations in terms of their basic characteristics and communicat ion requirements. Distributed Clock Service The time-of-day service in the V system [Cheriton 88b] is an example of a nondeterministic DOH group, in which a l l group members implement the same set of functions and play the same role in the service. A l t h o u g h al l members are supposed to maintain identical objects, the state of these objects may differ when a part ia l group communication failure results in some members not receiving a group request. In the distributed time-of-day service, every station periodically receives a clock tick from a central clock. Between clock ticks, each clock replica simply caches the latest time stamp from the central clock and extrapolates forward using its local clock. Clock drifting can be corrected b y the next clock tick. T i m e requests are handled using the locaUy extrapolated time values. Should any clock update message be missed, the next clock update corrects i t . A l though the time value stored in the local clock may not be absolutely correct, it is accurate enough for most non-t ime-crit ical appUcations. Similarly, many appUcations that repUcate data do not require absolute consistency. The required level of consistency is obtained by using appUcation semantic knowledge and assuming that the cUent can detect, recover from, or tolerate inconsistencies. This reduces communication complexity and promotes efficiency. The nondeterminism in these appUcations stems from the fact that group members may maintain an inconsistent or inaccurate global group state. AppUcations of this type need only an efficient and "best effort" multicast mechanism. Distributed N a m e Service In a distributed name service, a set of name servers operate at several machines in the net-work. In some designs the global name space is partit ioned and a different name server maintains each partit ion [Ahamad et al 88]. This type of distributed name server forms a nondeterminis-tic O H O group because every member performs the same set of operations. As an example, each object in the operating system Clouds is assigned a logical system name [Ahamad et al 88]. A mapping function LJ maps a system name onto a multicast address A = u){S). Each station maintains a multicast address table for the objects stored at the node. To locate an object O , a user provides its system name S. The client host first calculates O's multicast address A = u){S) and uses A as an index to search its local object directory. If it is not found, a multicast remote procedure call is invoked on the address A. Nodes which have A in their multicast address table accept the remote procedure cal l , search the local object directory for the object wi th the system name 5 , and reply if they found i t . In this way, al l nodes share the overhead of mainta in ing , locat ing, and migrat ing objects. Grouping transparency is an important property expected by both chents and servers. A client locates an object by submitt ing a name L o o k U p request to the name service. It is irrelevant to the client which server responds or whether a single or multiple servers handle the request. Each server manages its assigned port ion of the name space and responds only to requests for objects it knowns. It need not be aware that other servers exist. Nondeterminism in distr ibuted object naming stems from the fact that the global group state is partit ioned among members; therefore, a client does not know which server wi l l perform its request. A L o o k U p need not be multicast to name servers atomically, because it is an idempotent operation ^° and does not alter the name servers' state. Note, however, that the name binding update i n the name servers is a deterministic operation, and all replicated servers must execute the update operations atomically and i n the same order that names are bound. Grapevine [Schroeder et a l 84] adopted another way of supporting a distributed name ser-vice. It exemphfies the nondeterministic D H O group i n which the name space is fuUy repUcated at each name server. A server may play pr imary or secondary roles in name updates, however, even though every server supports the same set of operations. The Grapevine update algorithm does not guarantee a consistent view at all name replicas because name update broadcasts are neither atomic nor ordered. A client deals only wi th its local name server without seeing the whole name server group. Clients can detect and correct inconsistent and stale names in an ' " A n o p e r a t i o n is i d e m p o t e n t i f it c a n be e x e c u t e d zero or m u l t i p l e t imes w i t h o u t c h a n g i n g the state of the server. application-specific manner. B o t h examples should make clear that failure of any name server does not stop the activities at other servers. Contract B i d d i n g Contract b idding, another example of a nondeterministic OHO group, is a technique for resource sharing and load balancing among stations in a server pool . CUents submit job requests for servers to complete. The specific server station and the order of execution do not matter. Server group members do not maintain a global group state. A U worker stations are functionaUy identical and respond to idempotent service contract bids only on the basis of their local state. U p o n completing a task, servers return to the ready state for the next job assignment. The number of available server stations is always changing, as is their current load. To optimize overaU system throughput and mean response t ime, tasks should be scheduled to keep aU servers equaUy busy. ScheduUng is usuaUy done in one of two ways. In a cUent-initiated scheme, a cUent posts its task requirement for server stations in the pool to bid on. A n available server responds wi th its current load condition and the cUent then chooses the proper server to complete the task. In a server-initiated scheme, potential cUents form a group and an available server posts its request for loading to the cUent group. Each cUent responds with its task requests and the server chooses the appropriate one to execute. In both schemes a competitive orphan may be generated. In contrast to a failure orphan, generated when a server continues to execute the request of a dead cUent, a competitive orphan is generated when server group members are not properly coordinated.^^ In the server-initiated scheme, a competitive orphan request could be generated from a cUent that does not know whether its job request has already been carried out. A lso , in the cUent-initiated scheme, multicasting a job request could trigger multiple concurrent executions " An orphan is a request that its sender either has d e a d or lost interest . Fai lure o r p h a n s are a general p r o b l e m in the c l ient - server m o d e l ; c o m p e t i t i v e o r p h a n s are a s p e c i a l p r o b l e m in the group c o m m u n i c a t i o n c o n t e x t . W e c a n view a fai lure o r p h a n as result ing f r o m a lack of c o o r d i n a t i o n between the client a n d the server g r o u p : we c a n view a c o m p e t i t i v e o r p h a n as result ing f r o m a lack of c o o r d i n a t i o n a m o n g g r o u p m e m b e r s . at different servers, even wlien only a single execution is needed. The effect of competitive orphans must be reduced to a m i n i m u m and the competition losers must be notified so that they can participate in subsequent contract-bidding activities. A g a i n , the communication overhead between chents and the members of a server group should be minimized ; multicast should be efficient but need not be perfectly reliable. The failure of any one server should not terminate the whole system. T h e client whose job was being executed by the failed server must be notified to re-submit its request, however. News Propagation News propagation in U S E N E T exemplifies the nondeterministic H E X group. People subscrib-ing to the same news group (a particular topic of discussion) constitute a group, and the state and the operations each human subscriber performs differ. A news poster never knows who the members in a group are. News propagations need be neither atomic nor ordered. Respondents to a news article can use either one-to-one personal correspondence or follow-up articles to the same or a different group. Each person then decides what to do w i t h each news article and the foUow-up comments. A subscriber can come and go at wi l l ; no synchronization is necessary. S u m m a r y of Requirements Nondeterministic groups are intended to improve service performance. Through grouping transparency, a group service strives for the same simple syntax and semantics found in one-to-one interprocess communicat ion . Depending on the intended applications and the nature of the particular group, however, this transparency may be relaxed. Nondeterminist ic groups induce less overhead and generally have the following characteristics and requirements: Communicat ion transparency. The dominant communication pattern in nondeterministic group appUcations is request-response. UsuaUy, appUcations do not require absolutely reUable message deUvery or message ordering. Interactions wi th nondeterministic groups are inherently asynchronous because: (1) it is neither necessary nor reaUstic to e.xpect a client to wait unt i l al l server group members are synchronized and ready to receive a request; (2) server group membership normally is not known to clients; and (3) a server may not receive the request at aU. D a t a consistency is not a problem for various reasons: the application requests may be idempotent, group consistency may not be cr i t ical to the appUcation, or the appUcation is able to detect and recover from inconsistency easily. AppUcat ion programmers are given the flexibiUty — with the attendant complexity — of handUng par t ia l message failures. R e p l y handling transparency. M u l t i p l e repUes from a nondeterministic group may not be consistent. They may have to be handled by the cUents in an appUcation-specific manner rather than by the server group, as in deterministic groups; this sacrifices a certain degree of reply handUng transparency. A l so , an appUcation must provide its own timeout value for each multicast request because the communication layer itself does not know how long to wait for the server group responses. W h e n the timer expires, the appUcation may decide whether to re-multicast or to perform alternative actions. In deterministic groups, these actions are normaUy performed at the communication layer. N a m i n g transparency. As in the case of deterministic groups, appUcations prefer to use a logical name to address a group service rather than having to know each ind iv idua l mem-ber. To achieve naming transparency, both cUents and servers prefer to handle requests and repUes independently of the number of servers. The membership of a nondetermin-istic group can change dynamicaUy, however, and is usuaUy known neither to the active group members nor to the communication layer. Failure transparency. Nondeterministic group member failure has different semantics from that of deterministic groups. W h e n a member fails, other active members may trans-parently take over the uncompleted task to enhance availabiUty, share service load and reduce global communication traffic. Compet i t ive O r p h a n P r o b l e m . In a nondeterministic group, a new type of orphan, the competitive orpl ian . can be generated because of a lack of internal coordination among group members. Except for computer conferencing, most deterministic groups do not experience this problem because a group request must be handled by every member syn-chronously while the client waits for a reply from every server. Table 2.1 summaries the differences between the deterministic and nondeterministic categories. Deterministic Nondeterministic C o m m u n i -cation NormaUy requires atomic i ty and absolute ordering. Some appUcations require causal ordering. Not necessarily reUable, nor ordered. AppUcations handle inconsistency. N a m i n g Complete group-view at communica-tion layer is necessary. Membership changes need be synchronized with al l other group messages. Group-view usuaUy is not known to anyone, even to the communication layer. Reply handling If required, expect a l l members to reply. Group members handle inconsistent repUes without the involvement of the cUent. CUents must handle inconsistent repUes expUcitly and appUcations have to provide the timeout parameter. Failure handling To enhance reUabUity, a part ia l failure is turned into a to ta l failure in most cases. To enhance availabiUty, part ia l failures are usuaUy hidden by active group members. Others In real-time systems, a t iming fault may cause inconsistency. Competit ive orphan may arise due to a lack of proper coordina-t ion among group members. Table 2.1: Determinist ic vs. nondeterministic. It is worth noting that although we classify groups into deterministic and nondeterministic categories, there is a grey area between the two in which many appUcations may fit in . A n appUcation of groups may have to be deterministic in some aspects, but need not be that way in other aspects. The purpose of this classification is not to draw a Une between the two categories, but rather to provide guide-Unes for understanding different aspects of group communication and transparency requirements from appUcations. 2.4 Discussion The two classifications of groups previously discussed are based on different criteria, one on structure (data and operation homogeneous, operation homogeneous only, da ta homogeneous only and heterogeneous), the other on behavior (deterministic and nondeterministic) . B y or-thogonally projecting one over the other, as Table 2.2 shows, we hope to better understand how various group apphcations fit the classifications. Deterministic Nondeterministic D a t a and operation homogeneous ( D O H ) FuUy repUcated file systems (peer member scheme), repUcated procedure caU & fault tolerant state machine AppUcations such as the distributed time-of-day service in V system. Operation homogeneous only ( O H O ) Par t ia l l y repUcated file systems. Clouds ' distr ibuted name-server group & contract b idding. D a t a homogeneous only ( D H O ) Ful ly repUcated file systems (primary and secondary scheme). Grapevine's name-server group. Heterogeneous ( H E T ) Dis tr ibuted process-control , computer-conferencing & E - m a i l d istr ibut ion Ust. News propagation in a news-group of U S E N E T . Table 2.2: Sample appUcations based on the classifications. F i r s t , we can see that external uncertainties in most nondeterministic groups stem from the foUowing facts: • objects are distributed only to a subset of group members and the size and membership of this subset may not be known i n advance; and • even when objects are fuUy distr ibuted at aU group members, appUcations do not re-quire their values to be always consistent or accurate and missing group messages can be tolerated. Second, Table 2.2 shows that most applications of deterministic group require that messages are sent atomically and in order, regardless of the homogeneity of group structure. For instance, the communicat ion support for a deterministic D H O group is the same as for a deterministic D O H group. N o matter how differently each indiv idual D H O group member functions at a high-level, to guarantee consistency among replicas, changes to the objects must be propagated atomically and i n order. Consider part ia l ly replicated file systems as an example of the deterministic O H O group in Table 2.2. F i le system reliabil ity requires that for each update request, those and only those servers having the target file take action and respond. It is difficult to map a logical file onto an unknown number of file servers maintaining the physical replicas of the file. Thus , there exists a level of inherent nondeterminism in the group communication. If we had a separate server group for each replicated logical file object, we would end up wi th a fuUy replicated deterministic D H O server group for each file. This may not be necessary, however, and having many dynamical ly changing groups could be expensive. It would be preferable to insta l l software filters at the client and the servers to eliminate this s tructural nondeterminism. E a c h server filter would discard requests for nonlocal files. T h e client filter uses some mechanism (perhaps by consulting a name server) to determine the membership of the imphed subgroup (those having a copy of the target file) and to guarantee atomic message transactions only w i t h this subgroup. Once the implied subgroup membership is determined, the group transaction can proceed atomically. A n alternative would be to have every file server reply to every request; those not knowing the target file would simply reply wi th a " n u l l " message. The client would work on non-nuU replies using the knowledge of the whole file server group membership to eliminate the nondeterminism. This scheme trades extra host loading cost for structural determinism to gain rel iabil ity. It differs from broadcast in that : 1. only file servers pay host loading costs for each file access, and 2. being a system program, file servers are generally more trustworthy and therefore file transactions can be made secure. T h i r d , Table 2.2 shows that a D O H group can be used for both deterministic and nondeter-ministic apphcations. The same is true for the other three group structures. This suggests that it is the apphcation, rather than the internal structure of groups, that dominate the require-ments to group communication support. Performance of group management, although important to certain applications, is one aspect that has not received enough attention. To support deterministic groups, an I G C protocol, besides dehvering messages to members in the destination groups, must also guarantee ordered and atomic message transactions among the members in each receiving group. In this way, apphcation level programming can be simphfied. This extra effort is not necessary in an I G C protocol supporting nondeterministic groups only. This is because either the appUcations do not care about message reUabiUty or ordering, or the appUcations wiU take care of message reUabiUty and coordinate member actions themselves. When designing a group communication mechanism, trade-offs can be made based on the requirements of the intended appUcations, the message synchronization overhead in the I G C protocol , and the overhead of handUng reUabiUty and ordering issues at the appUcation level. There is yet another possible dimension of classification, based on the way group members cooperate internaUy to provide the expected behavior. For example in a deterministic D O H group, the single copy behavior may be guaranteed by using two-phase commit protocol, ma-jor i ty voting, weighted vot ing or primary-secondary scheme. Discussions along that dimension are beyond the scope of this chapter, however. 2.5 Related Work Several existing systems support group communication. Isis [B i rman and Joseph 87a] and Circus [Cooper 85] systems are pr imari ly intended to deterministic repUcated data objects or procedure module groups. [Birman and Joseph 87a] provides detailed design and analy-sis on protocols for atomic and ordered group communication. The V system [Cheriton 88b, Cher i ton and Zwaenepoel 85] and several other experimental systems [Hughes 88b] are exam-ples of systems support ing nondeterministic groups. Other work in understanding multicast includes [Mockapetris 83], which presents a general analysis of multicast mechanisms at the network level, rather than the application level. A multicast taxonomy is given in [Hughes 89a] on the basis of the number of replies to each multicast request. There is no general examination and classification of group communication requirements from the view point of applications, however. 2.6 Chapter Summary Before designing a general, coherent and integrated group communication system, we must understand how it wiU be used; that is, the basic application requirements. We have ana-lyzed different types of groups, along with their potential applications and classified group applications into two major categories: deterministic and nondeterministic. One important dist inct ion between the two is whether or not the group communication software needs to know the membership of groups to coordinate the actions of members. Orthogonally, according to the structure , process groups can also be classified as data operation homogeneous, operation homogeneous only, data homogeneous only or heterogeneous. A basic conclusion drawn from this analysis is that grouping transparency is important and desirable. W h e n integrated into the underly ing group support, it simphfies the interface between the server groups and their clients by hiding from the clients, as much as possible, the membership of server groups and the interactions among group members. Th is enables the designers of clients and servers to concentrate on the problems to be solved, as they do in the uni -cast environment, without concern for coordinating multiple servers. Grouping transparency is manifested in group communication, group naming , multiple-reply handling, group-view change and par t ia l failure. We hope this classification framework and analysis wi l l enhance the under-standing of process groups, group communication and some applications, thus aiding designers working wi th these mechanisms. Chapter 3 Internet Group Name Resolutions Groups exist natural ly in the internet. Teleconferencing in wide area networks ( W A N ) and electronic mai l d istr ibut ion Usts are two example apphcations. The second part of this thesis, starting from this chapter, wi l l be focused on an important aspect of group communication: group naming support in the internet environment. The naming mechanism discussed in the rest of this thesis is intended to support deterministic groups. The communication support require-ments for deterministic groups are stronger than those for nondeterministic groups. Therefore, more appUcations can be be supported by deterministic group communication. In this chapter, we investigate issues in name resolution for internet groups.^ In Section 3.1. the internet group naming structure is extended from the conventional flat structure to a nested structure to support efficient internet group communication and subdomain autonomy. In Sec-tion 3.2, a graph theoretic model of group naming is introduced. FoUowing this is a brief overview of different name resolution schemes and the formulation of the resolution loop prob-lem and the resolution dupUcation problem. A discussion of the correctness criteria that algo-rithms handUng these problems must meet is included in this section as weU. In Section 3.3. a brief taxonomy of various existing approaches to handUng resolution loops and dupUcations is discussed and the shortcomings of these existing methods are analyzed. Related work in the area is mentioned in Section 3.4. ' A p r e l i m i n a r y vers ion of C h a p t e r 3 a n d C h a p t e r 4 is p u b l i s h e d [ L i a n g et al 90b]. 3.1 Nested Group Model A n internet consists of a set of subnets connected by internet l inks. A n internet group is a group whose members are located in more than one subnet. Internet group communication refers to sending messages to internet groups. A subnet may be any type of local area network, such as Ethernet [Metcalf and Boggs 76] or Token ring [Bux et al 82]. A l though L A N s generally provide efficient message transmission, they may not directly support multicast . In the remainder of this thesis, we do not assume multicast support in subnet. Internet l inks usuaUy do not directly support multicast [Deering 90]. Compared to L.ANs. communication across internet Unks is usuaUy more expensive because of the low bandwidth and long delay in internet Unks. It is desirable for internet group communication to generate as l itt le internet Unk traffic as possible. It is also desirable for subnets (or administrative domains) to be autonomous in the sense that changes made within a subnet (domain) have min imum impact on other subnets (domains). For these reasons, we introduce the nested group model to internet groups. A n nested group is a group whose members are either processes or subgroups. The pro-cess members usuaUy are located in the same subnet. The subgroup members contain either processes or other subgroups. This nested structure need not be restricted to a single level. A group can have more than one subgroup and as many levels of nesting as necessary. When a group X contains another group y, x \s a, parent (or supergroup) of y and y \s a, child (or subgroup) of x.^ A group can have more than one parent groups. Each group is identified by a unique gid and appears in its parent group as a single member. The nested group model minimizes the group communication traffic on internet links as weU as the message deUvery time to internet groups. Instead of one copy of each message per member process, one copy per subnet is sent across internet Unks. The message deUvery time can be further reduced by allowing a subnet to use whatever multicast support available in that " T h e r e is no f u n c t i o n a l difference between a g r o u p , a s u b g r o u p a n d a s u p e r g r o u p . T h e terms s u b g r o u p a n d s u p e r g r o u p are used w h e n we need to emphases the p a r e n t - c h d d r e n re lat ionship between groups . subnet to distribute messages to local members. The nested group model supports subnet (subdomain) autonomy. B y grouping members of the same subnet (subdomain) as a subgroup, only the identifier of the destination subgroup is required to send messages. Furthermore, membership changes in a subgroup can be sheltered wi th in a subnet. In summary, the model of nested group is useful to internet groups not only because it simpHfies the user interface (by addressing each group using a single identifier) and improves system modular i ty (a nested group hides the internal details of a child group from its parents), but is also necessary as the membership of a group in a subnet may not be known to outsiders. 3.2 Group Naming Model Addressing a group by a logical group identifier (gid) (as opposed to ahst of indiv idual mem-ber process identifiers (pid)) allows the group to be managed/accessed v ia its gid without refer-ring to its membership. Since it is the indiv idual member processes that eventually accept and process a message, the membership must be somehow maintained either exphcitly or imphcit ly in a centraUzed or distributed manner [Birman and Joseph 87a, Chang and Maxemchuk 84b, Cher i ton and Zwaenepoel 85]. A t some point during the group communication process, the destination gid must be expanded into the set of member pids to deUver the message. This expansion process is called group name resolution. For example in Is is , group membership is consistently rephcated at every group member and cached by message senders [Birman and Joseph 87a]. A gid is resolved by the communica-tion software at the sending machine to a hst of receivers using a local membership cache. The list of receiver pids is then used in t ransmit t ing the message to the group members. Another example is the V system in which name resolution is performed at the destination machines [Cheriton and Zwaenepoel 85]. From a broadcast channel, a machine receiving a message wi th a recognizable gid expands the gid into a hst of pids of the member processes on that machine and dehvers a copy of the message to everyone in the Ust. 3.2.1 G r o u p N a m e G r a p h To better understand group naming problems in the internet, we develop the following name graph model . A grammar model for group naming is included in A p p e n d i x A . Nested groups in a distr ibuted system can be characterized by a simple directed graph called a name graph? Each node in the name graph is uniquely labeled by either a gid or a p id . A node labeled by a gid is a group node representing a group in the system; a node labeled by a pid is a process node representing a single process. A n arc < a;,y > in a name graph goes from a parent node x to an immediate child node 2/ if 2/ is a member of a;.^ A child can be either a group or a process node. Process nodes are leaf nodes wi th out-degree zero because they do not have children; group nodes are internal nodes w i t h one or more children (assuming groups are not empty) . A node may have multiple parents. In that case, it has multiple in-arcs, one from each parent. A group that is not a subgroup of any other group is a top-level group, and its corresponding node in the name graph is called a root node. Given a name graph, a part ia l ordering from ancestors to descendants is defined by the transit ive closure of the relationship given by the arcs. To ensure that the simple directed name graph model sufficiently represents the naming structure of nested groups in the real world, we assume that : 1. a group never includes itself directly as a subgroup (i.e., no self-loops), and 2. a group does not include the same subgroup more than once (i.e., no multiple edges). Obviously, these two assumptions do not l imit the way that groups may be structured. Intuitively, in a name graph the member processes of a group x are the process nodes reachable from node x. A node y is reachable from a node x , denoted a,s x ^ y, i f there is a directed path from x to y in the name graph. The name graph of a group a; is a subgraph of the global name graph. It is rooted at node x and consists of aU nodes and arcs reachable from x. simple g r a p h is a g r a p h w i t h no m u l t i p l e edges or self - loops. *It is i m p o r t a n t not to confuse arcs i n a n a m e g r a p h w i t h p h y s i c a l l inks of the n e t w o r k . T h e former c o r r e s p o n d to c o m m u n i c a t i o n p a t h s t h a t m a y consist of several p h y s i c a l l inks . Corresponding to the group management operations in real systems, the following operations can be defined on a name graph and executed at group nodes: • create(x) — creates a node x in the name graph when a new group x is set-up; • delete(x) — deletes node x from the name graph when group x is dismissed; • join(x, y) — adds arc < x,y > into the name graph when a group/process y becomes a new member of group x (consequently, al l members of group y become descendants of x when the operation completes); • leave(x, y) — deletes arc < x,y > from the name graph when a group/receiver y rehn-quishes its membership in a group x ; and • name-resolution(x) — obtains the set of identifiers of the member processes in group x. 3.2.2 G r o u p N a m e R e s o l u t i o n E a c h arc in a name graph represents one level of indirection in group naming. G iven a name graph, the name resolution procedure executed at a node x maps gid x onto the set of identifiers of a l l immediate children of x. The name resolution process for a group x can be viewed as recursively traversing the subgraph rooted at x along aU out-going arcs (possibly in parallel) and executing the name resolution procedure at each reachable group node hop-by-hop unt i l all process nodes are reached. The result of name resolution for a group x is defined as NR(x) = {pid I pid of the process reachable from x in the name graph}. Paths traversed in resolving group x are called name resolution paths of x. To send a message to a group x is to dehver the message to every member in x. There-fore, as part of the send operation, the name resolution process for x is invoked to obtain x's membership. The basic requirement that the send operation imposes on the name resolution mechanism is: B l : The result of a name resolution process for group x is correct if it satisfies the condition that NR{y) C NR(x) if and only i f a; A y in the name graph. Assuming the absence of communicat ion failures, B l guarantees that each message to a group X is dehvered to every member of x. There are two approaches to doing name resolution: centralized source name resolution and distributed name resolution, depending on how and where name resolution is performed. In the centrahzed source name resolution, the destination gid is completely expanded into a set of receiver pids (called the distribution list) at a single location (either at the name resolution server or at the sender itself) before the group message is transmitted . Groups in Isis and Circus are examples of this approach [Birman and Joseph 87a, Cooper 85]. Distributed name resolution does not fully expand a gid at a single site before message transmission. Membership of a group is distributed across the network, and the group membership is only known to the name resolution server local to that group. A group is identified by a single id outside the subnet and is included as a single member in its parent groups. Group name resolution occurs incrementally as the message propagates to group members. A gid is expanded only when the message is going to the next hop. [Cheriton and Deering 85, Cheri ton and Zwaenepoel 85] and [Frank et al 85] provide examples of this approach. 3.2.3 R e s o l u t i o n L o o p s A cycle is a single path loop in a name graph. Two cycles are chained together if there is a path from some node in one cycle to some node in the other and visa versa. In a name graph, a resolution loop ( R L ) contains al l cycles chained together. A n R L is a strongly connected component in terms of graph theory. Two groups are deep equal i f their name resolution results are identical . Deep equal groups share exactly the same set of member processes although their name resolution paths might be different. Nodes that are deep equal because they are in the same R L (hence, are mutually reachable through the loop) have the relationship of loop deep equal. We shah denote this relationship by L e m m a 3.2.1 Loop deep equal is an equivalence relationship; it is reflexive, sym-metric and transitive. Proof : This l e m m a follows from the fact that a l l loop deep equal nodes are residing in the same strongly connected component in a name graph [Even 79]. • The name resolution process for a group x is non-terminating if and only if an R L exists in the name graph of x. In this case, name resolution requests circulate around the R L and and saturate the network. It is absolutely necessary to control R L s for group commuidcation to function correctly. Besides requirement B l , a name resolution mechanism capable of handUng R L s must : R l : terminate each name resolution process in a finite number of steps, and R 2 : preserve the loop deep equal relations defined by R L s . Requirement R 2 ensures that users are given the flexibility to take advantage of loop deep equal relation of R L s i f they want. A s far as group name resolution is concerned, name expansion is not performed on a p id and no R L contains a process node. Therefore, we can ignore process nodes and the arcs leading to them to simplify the name graph. Such a simplified graph is called a reduced name graph. In the foUowing sections, the term "name graph" means a reduced name graph. 3.2.4 D u p l i c a t i o n L o o p s In a name graph, a set of paths from a node to another node is caUed a segment i f the last hop in these paths are the same. T w o paraUel segments from a node to another node are distinct i f their last hop are different. A duplication loop DL{s,k) is a subgraph identified by two nodes — a source s and a sink k. It consists of aU the segments from s to k and satisfies the foUowing conditions: • the to ta l number of distinct segments from 5 to is greater than one, and • at least two immediate children of 5 in these segments are different. Note that a segment may have some R L s and D L s embedded in i t . A n R L / D L is embedded in a segment from s to k i f a l l the nodes in that R L / D L are reachable from s and if k is reachable from all these nodes. The existence of a DL(s, k) is a necessary and sufficient condition for resolution dupUcation to occur,^ because if no dupl icat ion suppression is enforced, a message to or routed through s wil l reach k through more than one s —>• k segment, resulting in resolution duplications at k as well as at the descendants of k. O n the other hand, assuming R L s are properly handled and no D L exists, there wil l be at most one path between any pair of nodes in a name graph and each message can be resolved at a node at most once. A l though D L s are the natura l results of subgroup sharing, duplication suppression improves the efficiency by preventing a message from being transmitted more than once in the subgraph rooted at the sink. Besides requirements B l , R l and R 2 , other requirements for a name resolution mechanism capable of suppressing name resolution duphcations are: D l : a message sent to or routed through the source of a D L should be resolved at most once at every node in the dupl icat ion loop (including the sink) , and D 2 : a message sender should not be prevented from retransmitt ing its messages. Requirement D2 comes from the observation that retransmission is one of the methods to guarantee at least once semantics in many protocols. Each retransmitted message should be delivered to all the members of the destination group independent of previous transmissions. 3.3 A Taxonomy on R L and D L Control In this section, we survey existing methods of handhng resolution loops and duplications and point out their shortcomings. ' D u p l i c a t i o n s due to message r e t r a n s m i s s i o n s are not cons idered as reso lut ion d u p l i c a t i o n s a n d d u p l i c a t i o n s due to R L s are c o n s i d e r e d separate ly . Essentially, the methods can be classified into centrahzed and distributed approaches de-pending on what name resolution scheme is used. The distributed methods can be further classified into dynamic and static approaches, depending on when resolution and duphcation loops are detected and controUed. We do not consider disaUowing loops a reasonable solution, because it simply shifts the responsibiUty of loop control from the system to the users. A lso , excluding D L s rules out the possibiUty of sharing groups in a system. It is worth noting that even though dupUcation sup-pression can be achieved at the end receivers using the tradit ional sequence number technique, propagating dupUcates wastes internet Unk bandwidth. 3.3.1 C e n t r a l i z e d A p p r o a c h The centraUzed approach uses centraUzed name resolution. Resolution loops can be easily detected since the site that performs name resolution has complete knowledge of the receiver group membership. A lso , resolution dupUcations can be suppressed by removing aU dupUcated pids from the distr ibution Ust. The problem is that either a separate copy per destination is sent, thus wasting internet Unk bandwidth , or the distribution Ust is included in each message, resulting in a variable message header and complex receiver and forwarder communication software [Waters et al 84]. Furthermore, the overhead of name resolution depends on how the name graph is maintained. If the name graph is stored in a centraUzed site, it is vulnerable to site failure. If the name graph is distributed in the network, it takes time to coUect name bindings. For the purpose of efficiency, techniques Uke name caching or name server rephcation may be used to avoid overloading the central server. Consistency among name caches and repUcated servers has to be maintained. Another problem is related to security. To resolve a gid, the site that performs name resolution must have read permission to aU the subgroup membership Usts of the destination group [Deutsch 84]. This is a stronger requirement than a simple reference permission. ' ' In ' ' T h r e e types of p e r m i s s i o n c a n be def ined for g r o u p m e m b e r s h i p access, (i) reference: the abi l i ty to refer to a a network environment where personal workstations can be completely under the control of individual users, this may e.xpose confidential group membership to bogus users. 3.3.2 D i s t r i b u t e d D y n a m i c A p p r o a c h The distr ibuted approach uses distributed name resolution. Aiany problems in the central-ized approach can be simphfied or avoided. For instance, if the nested group model is adopted, name resolution for a subgroup can be done wi th in a subnet and the indiv idual members of a local subgroup are not exposed to anyone outside the subnet. A lso , it is only necessary to send a single copy of the message per subnet, incurring much less internet link traffic compared with the centraUzed approach. Furthermore, because name bindings are maintained by distributed name servers in the subnets, achieving consistency can be confined to the subnet. On the other hand, since none of the name servers has a complete and up-to-date picture of the global name graph, R L and D L control is more difficult than for the centraUzed approach. In distributed dynamic methods, loops are detected and action taken at the time of message transmission. BasicaUy, there are two types of dynamic methods: message-based and node-based. The former stores the name graph traversing history in the message itself; the later saves this information at each group node that the message has visited. Message-Based Scheme In a message-based scheme, a Ust is carried with each group message. This Ust contains the identifiers of the subgroups that have been expanded so far. Before expanding a group x. this list is examined. If x is found in the Ust, an R L is detected and the message is discarded. This scheme requires extra communication bandwidth to carry the Ust with each message. Because the Ust is dynamicaUy expanded at each node along the name resolution path , a variable length of group message header results. A l so , since a node discards a message only if its gid a r o u p t h r o u g h its gid w i t h o u t f inding out the i n d i v i d u a l m e m b e r s in the g r o u p : l i i) read: the abi l i ty to d e t e r m i n e 'he m e m b e r s h i p of a g r o u p identif ied by a given g i d ; a n d (iii) write: the abi l i ty to m o d i f y the m e m b e r s h i p of a group identi f ied by a g iven g i d . is contained in ttie list in the message, duplicated copies reaching a node v ia different paths cannot be suppressed. Node-Based Scheme Compared to the message-based scheme, a node-eased scheme trades memory cost for shorter and fixed length messages. Messages are uniquely identified and each node remembers the identifiers of the message that have been expanded. A n R L is detected when the same message needs to be expanded again. The duplicated message is then discarded. Because this scheme depends on message identifiers, rather than on the path that a message has traveled, duplicate messages wi th the same message identifier wi l l be suppressed at a sink. The node-based scheme prevents message retransmissions, however, since retransmitted messages bear the same identifiers as their originals. ' A remedy to this problem is to allow a message to be resolved at most k(> 1) times through each node,^ or to discard remembered messages in a regular interval . In both cases, the possibihties of resolution loops and duphca-tions are not completely el iminated. A n alternative is to distinguish a retransmitted copy from the copies generated by the R L or D L . One way of doing this is to associate a retransmission counter with each message. This counter is set to zero in the original copy and is incremented by one at the message sender when the message is retransmitted. Under the assumption that message identifiers are globally distinct, a group member could use the message identifier to detect retransmissions of a received message. Nodes along the name resolution paths treat messages wi th the same identifier but different values of retransmission counter as different messages. A more serious problem with node-based schemes is that it can only be approximated in practice, since theoretically it requires every node that has subgroups to remember the identifier of every message it has ever expanded. This information must be kept unti l notification of ' D e t e c t i n g d u p l i c a t i o n s at leaf nodes so that d u p l i c a t e messages are not seen by g r o u p m e m b e r s is a variat ion of the n o d e - b a s e d s c h e m e . " T h e integer A: c o u l d be the m a x i m u m n u m b e r of re transmiss ions al lowed by the p r o t o c o l before r e p o r t i n g an except ion . termination of tlie message transaction is received. Alternat ive ly , if message lifetime can be bounded, the name expansion record of a message can be cleared after its Ufe has expired. Unfortunately, message hfetime usually cannot be predetermined in practice. Furthermore, for efficiency reasons, message expansion records need to be kept in R A M for quick reference. In practice, it is unreahstic to have an arbitrary large R A M . R A M message buffers often have to be cleared on a regular basis, which requires an intelligent distr ibuted garbage collection algorithm. This can be difficult as the algorithm must adapt to the dynamic changes of internet group configuration as well as the variable network traffic load. One approximation of the node-based scheme is to mainta in a F I F O queue of messages at each group node. A message is resolved i f its identifier is not in the queue, and its identifier is saved at the end of the queue after the expansion. For a queue of size k, it can be guaranteed that the last k messages wiU not be duplicated. The value of k can be decided on the basis of experiment. A s k goes to infinity, one can approximate the original node-based scheme. The key problem is how to decide k dynamical ly to adapt to the changing network traffic load and interconnections. Unless k is very large, the possibihty of resolution loop and duphcations is not completely removed. Besides these pitfalls in dynamic methods, a performance consideration is that a long fist may have to be searched at every node for every message to detect resolution loops and duph-cations, causing run time overhead and delay. The advantage of dynamic methods (over the static method discussed later) is that the name graph update operations are simple and their latency can be made small [Deering 90], since no special act ion needs to be performed and no global knowledge is required. 3.3.3 D i s t r i b u t e d S t a t i c A p p r o a c h Static loop avoidance methods do not require every message to be buffered or examined at every node along the resolution path , they do not record in each message the set of nodes resolved so far either. Instead, each node staticaDy saves some topology information about the name graph and detects R L s and D L s on the basis of this information when the name graph is modified v ia join and leave operations. To control R L s and D L s . the name graph is transformed into a special structure that , together w i t h the loop avoidance and duphcation suppression algorithms, meets the requirements specified in Section 3.2. The rationales of the static approach are based on the following observations: • A s long as the final result of the result of name resolution process for the destination group is correct (i.e., satisfies B l ) , it does not matter to a group message sender how this result is obtained or what name resolution paths are taken. • Name graph modifications generally occur much less frequently than name resolutions. Accord ing to statistics from C D N n e t [CDNnet 90], the ratio of electronic mai l name resolutions to email distribution Ust updates was more than 60:1 in February 1990.^ Pr ivate communications with B I T N E T managers also confirmed this observation. Static methods completely remove resolution loop and dupUcation effects, but dynamic methods can only approximate this. Static methods require no heuristic to estimate how long a message identifier has to be buffered at a node, or the management of a message buffer at each node for name resolutions. The name graph topology information for the purpose of loop detection can be saved on disk because it is needed only when the name graph is updated, rather than dur ing message transport. W i t h the static approach, group name resolution is less expensive as messages do not have to be checked at every node to determine whether they have been resolved as in the dynamic cases. Therefore, better group message transport performance can be expected than is the case of dynamic schemes. In high speed networks, it is particularly desirable to reduce processing overhead of control information dur ing data transfer, since in such networks, the performance bottleneck has shifted from the network to the nodal processing required to execute communication protocols in workstations and servers. A node with Umited processing power must minimize the processing ^It has been o b s e r v e d t h a t m o r e updates o c c u r d u r i n g the e a r l y s tage of a l ist 's l i fetime before it becomes stable . It is also o b s e r v e d that m o s t u p d a t e s in the early stage of a n e m a i l d i s t r i b u t i o n list a d d new s u b s c r i b e r s . A sublist is f o r m e d o n l y w h e n a significant n u m b e r of subscr ibers f r o m the s a m e o r g a n i z a t i o n are co l lected . T h e s e were o b s e r v e d f r o m a large n u m b e r of C D N n e t e m a i l d i s t r i b u t i o n l ists . overhead per message to avoid congestion and to take advantage of the large bandwidth of high speed networks. Therefore, the property of lower processing overhead during message transport makes static methods more suitable to high speed networks. The negative side is that modifications to the name graph are expensive in static methods because of the overhead i n loop detection and the overhead in propagating the name graph topology updates to descendants. Since modifications generally occur much less frequently than name references, the overall average overhead can be expected lower than in the dynamic schemes. In other words, the static scheme trades higher overhead in name graph updates for cheaper and faster group name resolution performance. 3.4 Related Work [Birman and Joseph 87a, Benford and Onions 87, Chang and Maxemchuk 84b, Cooper 84b] and [Cheriton and Zwaenepoel 85] present some general discussions on existing support for mul -ticast or group communicat ion. In most systems, group communication is Umited to single level groups. One of the contributions i n this chapter is to organize internet groups according to the nested group model to achieve communication efficiency and subnet autonomy. Compared to the tree structure of host groups [Cheriton and Deering 85] and channel groups [Frank et al 85], our model is more flexible in aUowing groups to be structured as an arb i trary directed graph. The name graph model is a global abstraction independent of how each node is physically supported; host and channel groups can be viewed as specific implementation techniques for maintaining group membership information in a subnet. [Benford and Onions 87, Deutsch 84] describe some work in dynamic resolution loop detec-tion in the context of electronic m a i l distribution lists. 3.5 Chapter Summary Al lowing nested groups in an internet environment provides a structured way of organizing internet groups, thus support ing subnet autonomy and reducing internet link traffic. The problems of resolution loops and duplications result, however. We identified these problems using a formal group name graph model and analyzed the correctness criteria of the algorithms handUng these problems. We also classified different methods into centrahzed and distributed approaches, depending on how and where name resolution is performed, and further classified distributed methods into dynamic and static categories, depending on when loops are detected and control actions taken. B y analyzing the shortcomings of existing methods, we find that none of the existing centraUzed or distributed dynamic approaches work correctly or satisfactorily. We also expect that static distributed methods maybe more efficient than dynamic methods when group name resolution occurs much more frequently than changes in group membership. In this case, which is common in practice, making the frequent operations as efficient as possible reduces the overall system overhead and delay. These observations motivated us to work on the static method presented in the remainder of this thesis. Chapter 4 Spanning Shadow Tree Algorithm In this chapter, we recognize name resolution consistency as the correctness requirement for distributed static methods that handle R L s and D L s . We then design a spanning shadow tree algorithm that saves the name graph topology to detect R L s and D L s when the name graph is modified v ia the join and leave operations. Whi le preserving name resolution consistency, update operations transform the system-level representation of the name graph into a special structure to avoid resolution loops and suppress duphcations. This chapter is organized as follows. Name resolution consistency is defined in Section 4.1. The basic idea of removing the effect of R L s and D L s are discussed in Section 4.2 and Section 4.3 respectively. In Section 4.4 and Section 4.5, we describe how to statically save the name graph topology and how to use the saved information in detecting R L s and D L s . The join and leave operations are presented in Section 4.6 and Section 4.7 respectively and their correctness is argued in Section 4.8. A brief overview of related work is given in Section 4.9. We assume that no specific order (required from the user level) is imposed on the traversal of nodes in an R L . To simphfy the discussions, we also assume that the network is rehable and that name graph updates do not occur concurrently. Removing the last two assumptions wil l be the subject of Chapter 6. 4.1 Name Resolution Consistency Define the external group view ( E G V ) as the name graph perceived by omniscient users having global information, and the internal group view ( I G V ) as the internal representation of the name graph maintained by the naming system. The naming system implementing static R L / D L control algorithms must guarantee that the results of name resolution based on the E G V and the I G V are always consistent, so that users need not be aware of the existence of the I G V graph. This is called name resolution consistency. In the following discussion, we simply use the E G V (or I G V ) to refer to the E G V (or I G V ) graph. A naming system generally has two parts : name resolution and name graph maintenance. The former interfaces wi th users and performs name resolutions using the I G V . The latter interfaces with system administrators through join and leave operations to change the name graph and mainta in name resolution consistency between the I G V and the E G V . In centraUzed or distributed name resolution that uses dynamic R L / D L control schemes, the E G V and the I G V are identical. Join/leave operations simultaneously modify both views as discussed in Section 3.2.1. In static schemes, however, the topology of the I G V maybe different from that of the E G V since join/leave operations rtiay have to modify the I G V in a way different from what the user perceives in modifying the E G V in order to avoid R L s . Let (N,A) denote a di-graph wi th a node set N and an arc set A . A n IGV = (NIGV-AJOV) is name resolution consistent with, respect to an EGV - { N E G V , A . E G V ) if: C I : NiGV = NEGV\ and C 2 : for any pair of nodes x and t / , there is an x A name resolution path in the I G V if and only if there is an x A y path in the E G V . W h e n C l and C 2 are met, the results of name resolution for the same group in both views are identical. The only difference between the I G V and the E G V is that name resolution paths may traverse the same set of group nodes in a different order in each view. Because this order is unimportant to users as far as name resolution is concerned, the I G V is transparent. 4.2 Handling Resolution Duplications Intuitively, a good place to enforce resolution dupl icat ion control ( R D control) in a DL{s. k) is at the sink node, where the duphcations caused by DL{s,k) are first observed. Since the eff'ect of DL{s,k) takes place only when a message originates at 5 or at the ancestors of s, it is difficult to bui ld a static structure in the name graph to meet requirements D l and D2 . We take the foUowing semi-static approach: • Every message carries the gid of its originator group — the group to which the message was first sent (from the user). • In each D L , one of its segments is selected as the primary, the others as secondary. Accordingly , to shorten the notat ion, the immediate parents of the sink in these segments are caUed primary or secondary parents (in this DL) .^ To decide when to take the R D control action for a DL(s,k), a node maintains an R D control record containing a Ust of ancestors of the source (including the source). Dur ing name graph updates, this Ust is computed by the sink (discussed later) and distributed to aU secondary parents of the sink in DL(s,k). W h e n such an R D control record is maintained for an arc leading from a secondary parent in DL(s, k) to the sink, we say that arc is RD controlled for DL(s, k). • The name resolution procedure is changed so that if the originator group of a message is found in the R D control record of an arc, the message is not forwarded through that arc. Figure 4.1 shows some examples in which the dashed arcs are R D controUed. The notation < DL(s,k) : ai,---,ak > shows the R D control record for DL(s.k). in which " a i , - - - . a f c " is the Ust of ancestors of the source node. In Figure 4.1.(A). DL(a.f) is controUed at node c, DL(a,e) is controUed at node d, and DL{d,g) is controUed at nodes e and / . The R D control records for DL{a,g) at nodes e and / are not shown in the figure because they contain only a subset of the R D control record for DL{d,g). " T h e c o m m u n i c a t i o n cost of the last h o p in each s e g m e n t m a y be taken into account in select ing the p r i m a r y . « Figure 4.1: Examples of duplication loop control . A DL{s', k') is embedded'm another DL{s., k) i f both its source 5 ' and sink k' are in DL{s, k). Figure 4.1.(B) shows an example where the DL{c,g) is embedded in DL{a,g). DL{c,g) is R D controlled at node / , DL(a,g) is R D controlled at nodes / and e. One must be careful i n selecting the pr imary segment for D L s that have the embedding relationship and share the same sink node. For example in Figure 4 .1 . (B) , i f R D control for DL(a,g) is performed at nodes d and e and R D control for DL{c,g) is performed at node / , messages to or routed through node a w i l l never be deUvered to node g. The reason is that the R D control record for an embedded D L contains the source node of the embedding D L . To avoid this problem, a sink k must select its primary parent in DL{s, k) as its primary parent in DL{s',k) if s s' in the name graph. This R D control scheme preserves name resolution consistency because it does not change the name graph. Whi le traversing the name graph to perform name resolution, the R D control algorithm dynamically enforces the message propagation paths to be a spanning tree in the subgraph rooted at the message originator so that each node in that subgraph is visited exactly once. When a message is routed through the source of a D L in that subgraph, the R D control procedure removes the last hop leading to the sink from every secondary segment in that D L and only allows the message to be propagated to the sink once through the p r i m a r y segment. In this way, resolution duplications are suppressed. Furthermore, this scheme imposes no restriction on message retransmission. It is worth not ing that the secondary parents of the sink in each D L have to do a table look-up when resolving a message, and some amount of R A M is required at these nodes to store the R D control record for efficient name resolution. The length of the hst, however, is finite (less than the t o ta l number of nodes in the name graph) and normaUy is smaU compared to the message Ust i n the node based scheme since generaUy the name graph is not very compUcated. Furthermore, except for the secondary parents of a sink node, other nodes do not pay such memory and run- t ime overhead. 4.3 Handling Resolution Loops W h e n adding a new arc < x , y >, an R L is generated i f and only iiy ^ x. This R L consists of aU the nodes along the y —>• x path . W h e n a D L or any existing R L is embedded in the y —>• x path , multiple cycles wiU be contained in this new R L . These cycles are chained together. The basic idea of removing the R L effect is to replace each cycle by a chain of two-node cycles and enforce control on accessing these two-node cycles. We define a shadow edge as a bidirectional arc and a shadow path as a path consisting of shadow edges only. Observe that nodes in a shadow path are loop deep equal because each shadow edge in fact is a two-node cycle and a shadow path is a chain of such cycles. A cycle wi th more than two nodes can be replaced by a shadow path with nodes in the cycle remaining loop deep equal. O n the basis of this observation, an < a;,?/ > arc is not added in the I G V i f its addition generates any cycle. Instead, the exist ing y ^ x path in the I G V is converted into a shadow path . The name resolution procedure is modified to control the R L effect in a two-node cycle. Instead of resolving a message to aU out-going arcs as before, each node in a name graph adopts the new resolution procedure shown in Figure 4.2. If aU cycles are replaced by shadow paths, then under the conditions that ^ R D c o n t r o l is s t i l l p e r f o r m e d o n each o u t - g o i n g arc based o n the message o r i g i n a t o r a n d the R D c o n t r o l r e c o r d a s s o c i a t e d w i t h the a r c . < A s s u m e a m e s s a g e c o m i n g in f r o m an arc / > for each s h a d o w edge or o u t g o i n g arc /' in t h e I G V do if (/ is a s h a d o w e d g e and / ' ^ /) or ( the o r i g i n a t o r o f t h e m e s s a g e is n o t in a n y R D c o n t r o l r e c o r d m a i n t a i n e d for /') then f o r w a r d t h e m e s s a g e a c r o s s /' Figure 4.2: The new name resolution procedure. 1. no other path exists between any two nodes on the same shadow path , and 2. everyone in the name graph follows the new resolution procedure, we can proof the following theorem: T h e o r e m 4.3.1 Messages to any node in a shadow path wi l l be resolved at every other nodes in the pa th once and only once. Proof: First observe that the new name resolution procedure does not have any effect on messages coming in or going out through normal arcs, that is , converting a cycle into a shadow path does not affect name resolutions at nodes outside the cycle. Second, the above argument can be proven by induct ion on the distance (number of arcs) between a pair of nodes, say x and y, in a shadow path . W h e n the distance is one [x and y are connected by a single shadow edge), a message to x w i l l be deUvered to y and y wiU not send it back to x through the same arc according to the new procedure. Because there is no other path between y and x (condition 1), both X and y resolve this message once and only once. Assume the statement is true when the distance between x and y is n {> 1). Consider the case when the distance is n + 1. In the shadow path between x and y, there is a node z whose distance from x is n, and the distance between z and y is one. Accord ing to the induction hypothesis, a message to x wiU be resolved at z once and only once. After it reaches z, this message wiU be resolved at y once and only once as weU according to the induction base. • Given the above results, we only need to guarantee that aU nodes involved in an R L are connected by a shadow path and that no other path exists between them after the shadow path construction. Mult ip le y ^ X paths are involved in a loop if at least one D L is embedded in the loop. If every arc in an embedded D L were shadowed, a shadow loop would be generated, since a D L consists of more than one segment between its source and sink and every segment becomes a bidirectional path after shadowing. On the other hand, when no D L is embedded in a loop, there is only a single path between any pair of nodes involved in the loop; thus, only this single path is shadowed. To handle the case when embedded D L s are involved, we extend the shadow path idea by constructing a spanning shadow tree among al l nodes in al l existing y ~ x paths such that there is only a single shadow path between any pair of these nodes. The nature of a shadow edge guarantees that nodes in the tree remain loop deep equal and the one-way propagation property of the new resolution procedure prevents infinite resolution effects in the shadow paths. Therefore, requirements R l and R2 are satisfied. The shadow tree algorithm preserves name resolution consistency. Because a l l nodes connected by an R L in the E G V are connected by a shadow tree in the I G V , these nodes are mutual ly reachable in both views. A l so , the shadow tree algorithm does not add any arc, nor does it delete an arc if the arc is not in an R L . Therefore, the reachabihty between any nodes in the name graph remains unchanged and hence requirements C l and C2 are satisfied. Figure 4.3 shows an example of the shadow tree scheme. In Figure 4.3.(A), the addition of arc < k.a > completes an R L consisting of aU the nodes and arcs shown in the graph. There are four duphcation loops DL{b,g), DL{c,f), DL{d,g) and DL{h,j) embedded in the R L . Figure 4.3.(B) shows the result of spanning shadow tree construction. 4.4 Representing Topology of Ancestors R L s and D L s have to be detected before the above schemes can be used to deal with them. To detect an R L when adding an arc < x,y >, node x must have some way of knowing if any y — X path exists in the name graph. If that is the case, it needs to know aU the nodes in this path to construct the shadow tree. Also , after adding an arc < x , t/ >, a node in the subgraph of Figure 4.3: A n example of the shadow tree scheme. y has to determine whether it has become the sink of a new D L or whether a new segment has been added to a D L of which it is the sink. If so, it has to compute the R D control record and distribute this record to its secondary parents in the D L . In summary, R L and D L detections require a node to know the topology of its ancestors. A mechanism to record this topology is described in this section. 4.4.1 V i r t u a l N o d e a n d D e r i v e d N a m e G r a p h In a name graph, a physical node refers to a single group node; a virtual node ( V N ) refers to a strongly connected component in the graph. A V N consists of nodes that are loop deep equal and the arcs connecting these nodes. It corresponds to an R L in the E G V or a shadow tree in the I G V . A V N has a unique id (vnid) similar to a gid. Nodes in a V N form a loop deep equal set and are called the components of the V N . In this thesis. R L and V N are used interchangeably. In a name graph, i f each strongly connected component is substituted by a V N , the resulting graph is called derived name graph. The properties of V N and the derived name graph are summarized below: • A derived name graph is a directed acychc graph ( D A G ) . • Since nodes inside a V N are loop deep equal, the ancestor-descendant relation in a name graph is completely defined by the arcs across different loop deep equal sets (i.e., the arcs in the derived name graph) . • The internal connection structure of a V N is immater ia l as far as R L and D L detections are concerned, hence need not be known to anyone outside the V N . • The components of a V N can be arbitrar i ly connected as long as they remain mutually reachable (for name resolutions). The second and th i rd points suggest that it is sufficient for a node to keep the topology information of its ancestors in the derived name graph to detect new R L s and D L s . The last point suggests that the name graph maintenance mechanism may choose whatever I G V structure to connect components in a V N as long as its derived name graph is the same as that of the E G V . Spanning shadow tree is an example. In the foUowing discussions, an arc is in the normal state i f it connects two nodes that are not loop deep equal. A n o r m a l arc is unidirectional. There is a one-to-one correspondence between the normal arcs in the name graph and the arcs in the corresponding derived name graph. A n arc is i n the shadow state i f it connects two nodes in an R L in the E G V and is part of the spanning shadow tree of the V N in the I G V . A shadow edge is bidirectional. Name resolution on a shadow edge foUows the procedure in Figure 4.2. A n arc is in the fade state if it connects two nodes in an R L in the E G V but is deleted from the I G V to avoid shadow loop (when spanning shadow trees are constructed). Fade arcs are ignored in name resolutions. A shadow edge or a fade arc is restored when its state is reset to normal . 4.4.2 P a t h n a m e s A pathname scheme is adopted for a node to Iceep track of its ancestors' topology in the derived name graph. A pathname P is defined as a sequence of gids, P =< xi,X2, • • •, x^ >, representing an x\ A x „ path in the derived name graph. Given a name graph, pathnames are constructed and assigned to a node as follows: 1. Each node w i t h in-degree zero has a pathname consisting of its g i d / v n i d only. 2. A node x has a pathname < Pxp,x > i f and only if an immediate parent Xp of x has a pathname F^^; that is, x inherits the pathnames from its parents. 3. In pathnames, the nodes in an R L are represented by a single V N . Components of a V N share the same set of pathnames. Identifying an R L as a V N hides the V N ' s components and their connections from their de-scendants, hence, simphfies the pathnames of the descendants. The lengths of pathnames are finite because the derived name graph is a D A G with a finite number of nodes. A s far as the descendants are concerned, there is no need to distinguish between a V N and a physical node in pathnames for the purpose of R L / D L detection. Figure 4.4 provides examples of pathnames, in which gids/vnids are shown in bold face and pathnames are included in angle brackets. Figure 4.4.(B) shows how a shadow tree is represented by a V N t. The collection of pathnames maintained in a node is caUed the pathname set of that node. Obviously, a l l components in a V N share the same pathname set. B y constructing pathnames in this way, it is clear that : 1. The pathnames of a node x terminate wi th the gid of x. 2. A l l components in a V N share the same set of pathnames (terminated by the vnid) . 3. For each path from a root node to x in the derived name graph, x has a corresponding pathname. a <a> <a.b> b <DL(a,(l):a> <a,b,d> ^ <a,c,d> ^ <a,b,d,f> <a,c,d,f> <a,b,d,f,h> <a,c,d,f,h> <a.b,d,g,h> <a,c,d,g,h> <a,c,e,g,h> c <a,c> e <a,c,e> <DL(c,g):a,c> g <a,b,d,g> <a,c,d,g> <a.c,e,g> <DL(d,h):a,b,c,d> <a> <a,b> <a,b,c> <DL(b,t):a,b> <a,b,t> t <a,b,c,t> <DL(ti):a.b.c,t> <DL(b,i):a,b> <a,b,e,i> <a.b,c,t,i> <a.b,t,i> (A) (B) Figure 4.4: Examples of pathnames. 4. If X is an ancestor of y , x occurs in some pathname(s) of y and each pathname of x appears as a "prefix" in some pathnames of y . The pathname set at a node completely represents the topology of its ancestors in the derived name graph. The pathname set must be updated accordingly when the topology of ancestors is changed, since the correctness of R L / D L detections in future name graph updates wi l l depend on the consistency of the pathname set. The distributed procedure of updating pathname sets of the nodes in a name graph is called pathname update relay. For instance, after a normal arc < a;,y > is added, all paths leading to x are extended to y . Therefore, y and its descendants must update their pathname sets by adding the path-names inherited from x. To start the pathname update relay, an update request containing the pathname set of x is propagated through the subgraph rooted at y . A node receiving an update request simply appends its gid (or its vnid if it is a component of a V N ) at the end of each pathname contained in the request,'^ includes the resulting pathnames in its pathname " I f the node is in an V N a n d the u p d a t e request is received f r o m a n o t h e r c o m p o n e n t m the s a m e V N , the set, and relays the request to al l of its immediate children after replacing the pathname set in the request by the inherited pathnames. Threads of pathname update relay terminate at leaf nodes (i.e., groups with process members only) . Pathname update relay in other cases (such as adding a non-normal arc or deleting an arc) wiU be discussed in later sections. 4.5 Detecting RLs and DLs R L detection is simple if a pathname set is maintained at each node. When adding an arc < x.y >, node x learns that y is an ancestor if y occurs in any of its pathnames. Then aU nodes in x's pathnames from y to x ( including x and y) are the components of the new R L . For a node to detect that a new segment is leading to it from one of its ancestors, it has to do D L detection after every pathname update. A node k detects a DL(s,k) if • s appears in at least two different pathnames of node k, and • in these pathnames, s is the common node with the shortest distance from k and is not an immediate parent of k in aU of these pathnames. W h e n this occurs, k selects a pr imary segment as discussed in Section 4.2, computes and distributes an R D control record to its secondary parents in DL(s,k). When a new segment of the D L is added, the new segment becomes a secondary and the same R D control record is assigned to the parent of the sink in that segment as well. The R D control record of DL(s,k) can be computed by node k since every path that leads to s are extended to k and pathnames of s are embedded as prefixes in some pathnames of k. Nodes in these prefixes are ancestors of s and their ids are contained in these pathnames before Instead of having the sink compute and distribute the R D control record, the sink may tell its secondary parents about the source of the D L and let them compute the R D control record themselves. In this way, the size of the messages from the sink to these parents can be reduced. node need not a p p e n d its v n i d to the i n h e r i t e d p a t h n a m e s again . •"In a p a t h n a m e , m is before n i{ m identifies a n o d e t h a t is an ancestor of the n o d e identif ied by n. Because the R D control record is computed from pathnames and the vnid (rather than the gid of the originator) is recorded in the R D control record, a message originating at a component of a V N must have the V N ' s vnid in its originator field for the purpose of R D control . It should be noted that pathname update relay requests are not R D controlled because different copies of the request travel through different segments of the D L , bringing to the sink different pathname information. To increase efficiency, a sink may wait if the update request originates from the source or is routed through the source unt i l an update request is received from every segment. Then the pathname update request is relayed.^ 4.6 Join Operation A static algorithm consists of two name graph update operations: join and leave. In this section, we present a join operation that combines the above ideas. Before going into the detail , we briefiy describe V N topology information and outUne a construction algor i thm of spanning shadow tree. 4.6.1 V N T o p o l o g y I n f o r m a t i o n The topology information of a V N consists of three parts: 1. The EGV connection topology of a V N is the interconnections among al l the components of the V N in the E G V . This topology information can be represented by an adjacency matr ix [Bondy and M u r t y 76].^ 2. A parent list containing information about a l l immediate parents of the V N , ^ including ^ A n a l t e r n a t i v e vi^ay of r e d u c i n g message o v e r h e a d is to R D c o n t r o l p a t h n a m e u p d a t e requests . T h e sink, u p o n rece iv ing a u p d a t e request f r o m the p r i m a r y s e g m e n t , not on ly inherits the p a t h n a m e s i n the request but also generates the p a t h n a m e s w h i c h w o u l d be passed d o w n f r o m the s e c o n d a r y segments . It c a n do so because it c a n f i n d , f r o m the l o c a l p a t h n a m e set, the p a t h s f r o m t h e source n o d e t h r o u g h the s e c o n d a r y segments to itself, a n d f r o m the u p d a t e request , the p a t h s e x t e n d e d to t h e s o u r c e n o d e due to the arc a d d i t i o n . B o t h i n h e r i t e d a n d loca l ly g e n e r a t e d p a t h n a m e s are re layed . A d j a c e n c y m a t r i x is a s t a n d a r d d a t a s t r u c t u r e u s e d to represent g r a p h topology . '^A n o d e is a n i m m e d i a t e parent (or chi ld) of a V N i f it is an i m m e d i a t e parent (or chi ld ) of one of the c o m p o n e n t s of the V N . the gid of each physical immediate parent, the vnid of the V N in which that parent is a component (this vnid is undefined if the parent is not in any V N ) , and the component(s) to which that parent node is connected. 3. A child list containing the information about al l immediate children of the V N in the same format as the parent hst. The topology information of a V N is coUected when the V N is formed and is repUcated at aU the components. It is used when a new V N is formed or when an existing V N is broken by a leave operation. 4.6.2 V N S h a d o w T r e e C o n s t r u c t i o n Once the E G V connection topology of a V N is coUected, a number of existing algorithms maybe used to construct a spanning tree for the V N . We adopt the weU known Kruskal algo-r i t h m [Bondy and M u r t y 76] because of its simpUcity. Figure 4.5 shows the algorithm. P u t al l t h e E G V arcs o f t h e V N i n t o an a r c . l i s t a n d sort t h e arc_ l ist ; R e m o v e t h e f i r s t arc ( d e n o t e d by < a , 6 > ) f r o m a r c . l i s t ; t r e e _ n o d e s = {a , 6}; /* l ist o f t ree n o d e s */ t r e e . a r c s = {< a,b > } ; /* l ist o f t ree arcs */ while ( s i z e o f ( t r e e _ n o d e s ) < s i z e o f ( V N ) ) do begin < x,y > = g e t _ n e x t _ a r c ( a r c _ l i s t , t r e e _ n o d e s ) ; /* x £ t r e e . n o d e s */ if (j/ is n o t in t r e e . n o d e s ) then begin t r e e . n o d e s = t r e e . n o d e s Ll{y}; t r e e . a r c s = t r e e . a r c s U { < x,y >}; end end return ( t ree.arcs) ; Figure 4.5: Spanning tree construction. In sorting the a r c . l i s t . a comparison order between a pair of arcs is defined as foUows: < x,y > « x',y' > if X < x', or x = x and y < y'. Since gids are assumed globaUy unique and no paraUel arcs between physical nodes are aUowed in the E G V , this comparison order between arcs is well defined. The get_next_arc() takes two parameters: arcJist — the fist of arcs yet to be considered, and tree_nodes — the fist of nodes already connected by the tree. It removes and returns the smallest arc in the arcJist originating from a node in the tree. After a component has computed the fist of tree arcs, the state of its adjacent arcs can be determined. A n adjacent arc is shadowed if the arc is in the spanning tree. It is fade if the arc connects to another component in the same V N , but the arc is not in the Ust of tree arcs. The arc is set to normal if it connects to a node outside the V N . A key point here is that every component must have exactly the same topology information of the V N and run the same algorithm for the tree construction, so that the same fist of tree arcs is obtained. A s w i l l be seen in Section 4.6.3, this is guaranteed by collecting and distr ibuting the topology information of the new V N through a single coordinator. In practice, one may take the communication cost (delay, bandwidth, hnk ut ihzat ion, re-HabiUty etc.) of arcs into consideration to construct a min imum spanning tree. The cost information may be collected while the V N topology is gathered and saved as part of the V N topology in format ion . The modifications to the above algorithm would be: • each arc is associated with a weight representing the communication cost of that arc and the arcJist is sorted according to the weight; and • the arc returned from the get_next_arc() is the arc (originating from a node in the tree_nodes) wi th the smallest weight in the arcJist. 4.6.3 O p e r a t i o n D e s c r i p t i o n Consider a, join operation that adds an arc < x,y > and let P^ denote the pathname set of node X. F i r s t , the R L detection algorithm in Section 4.5 is invoked at x using f^. Depending on the result, we have the foUowing cases: N o new R L is formed If the new arc connects two nodes that are loop deep equal before the join (i.e., x and y are components of an existing V N ) , al l components in that V N are informed (by x) to update the V N topology information to include a fade arc < x,y >. No pathname update relay is necessary since no ancestor-descendant relationship is changed. If X and y remain not loop deep equal after the join, a normal arc < x,y > \s added and a pathname update is relayed i n the subgraph rooted at y to inherit the pathname set of x (refer to Section 4.4.2). If a; is i n a V N , that V N becomes an immediate parent of y. Components in that V N have to update their V N child hst to include y. Similarly , i f 2/ is in a V N , components in that V N learn about the new immediate parent x during pathname update relay and update their parent Ust to include x accordingly. A new R L is formed The subset of pathnames L = {p \ p £ Px A y £ p} identifies a l l the paths routed through y to x. The node set N = {n \ 3p £ L such that n E p A y ^ n} U {y} contains the nodes (both physical and V N s ) in the new R L . We shaU caU this set the component list of the new V N . Serving as the coordinator for the join operation, x contacts every element in the component Ust to coUect the topology information for the new V N . Every physical node contacted by x returns its local topology information — the Ust of its immediate physical parents and children in the E G V (including the gid of the parent /chi ld and the vnid of the V N in which that parent /chi ld is a component) . If an element in the component Ust is a V N , its membership is returned, so that x can expand the component Ust (by substituting the returned member Ust for the V N in the component Ust) and then obtain the local topology information from each physical node member.^ On the basis of the information gathered, an adjacency matr ix recording the E G V connection topology of the new V N and the Usts of parents and children of the new V N can be constructed. Node x also assigns the new V N a vnid . A U the paths to the components in the new V N can be recognized from pathname set P^, since these nodes are ancestors of x before arc < x.y > \s ^ An a l ternat ive t h a t reduces the n u m b e r of e x c h a n g e d messages is to d i r e c t l y receive the topology i n f o r m a t i o n of an e m b e d d e d V N f r o m its m e m b e r . added and each path to these nodes is contained as a prefix in x's pathnames. The pathnames of the new V N can be derived from P r by substituting these components with the vn id . A pathname update relay is launched in the subgraph rooted at x. The above derived pathname set is contained in the pathname update relay request, x sends the pathname update request to other components in the new V N . The topology information of the new V N is piggybacked, therefore is repUcated at every component. Upon receiving a join pathname update, a node n does the foUowing: • If n is a component of the new V N , it executes the shadow tree construction algorithm described i n Section 4.6.2 and sets the state of its adjacent arcs accordingly, adopts the new vn id and saves the topology information of the V N , and replaces its pathname set by the set of pathnames i n the update request.^ The update request is then relayed only to children connected by normal arcs. If any D L for which n performs R D control is embedded i n the new V N , the corresponding R D control record is discarded. • If n is not a component of the new V N , it performs pathname relay as described in Section 4.4.2. Because nodes in the new V N are no longer visible in the derived name graph, pathnames containing the ids of these nodes ( including the vnids of old V N s that are embedded i n the new one) are eUminated. After completing each pathname update, a node runs the D L detection procedure (refer to Section 4.5) and updates its R D control record. W i t h i n a V N , only the component with the smaUest gid needs to do D L detection. 4.6.4 H a n d l i n g P a r a l l e l A r c s A l t h o u g h paraUel arcs are not aUowed in a name graph, they may exist in the derived name graph since it is possible that paraUel arcs may exist between a physical node and a V N , or ' 'If n was a c o m p o n e n t of a n ex is t ing V N , the v n i d a n d the t o p o l o g y i n f o r m a t i o n of the o ld V N are d i s c a r d e d , since the o ld V N d i s a p p e a r s (it is e m b e d d e d in the new V N ) . ' ° T h e reason is t h a t a l t h o u g h al l c o m p o n e n t s in a V N c a n do the D L d e t e c t i o n , their results are i d e n t i c a l a n d their efforts i n D L d e t e c t i o n s are d u p l i c a t e d . between two V N s . Paral le l arcs must be handled since pathnames represent the topology of the derived name graph. The foUowing steps are included i n the D L detection procedure: 1. Given the id of an immediate parent obtained from a pathname, a physical node, say n, can find out the number of arcs coming from that parent to n, or to the V N in which n is a component. This is possible because the parent Ust repUcated at n provides aU the necessary information i f n is in a V N . If n is not in any V N , n may find this from its local topology information. A n y change to the mapping between the gid of an immediate physical parent and the vnid of the V N in which that parent is a component can be observed and recorded by n when the pathname update request arrives. 2. W h e n more than one arc from an immediate parent to n or to the V N in which ra is a component is found, each arc is treated as a segment of a D L , consisting of only a source (the immediate parent found from the pathname) and a sink (node n or the V N in which n is a component) . A U but one such paraUel arcs are R D controUed for the D L . 3. After aU pathnames at a node are examined and aU paraUel arcs are taken care of as stated above, the D L detection procedure described i n Section 4.5 is used to deal wi th D L s containing both paraUel arcs and segments longer than one arc. Assume there are m (> 1) paraUel arcs from an immediate parent p in a segment S to the sink in a DL{s, k). These paraUel arcs form a DL(p,k) which has been taken care of in step 2. If the D L detection procedure chooses S as the pr imary segment of DL{s,k), aU but one paraUel arcs from p to k are R D controUed; otherwise, aU such arcs are R D controUed should S be chosen as a secondary segment. A s stated i n Section 4.2, when S is chosen as the primary, the paraUel arc from p to k which is not R D controUed must be the pr imary of DL{p.k) since DL{p,k) is embedded in S. 4.7 Leave Operation From the view point of the users, leave is the operation that negates the effect of join. A group y, which is an immediate member of another group x, quits its membership in x by invoking leave(x,y). This operation deletes arc < x,y > from the E G V . In static R L avoidance algorithms, however, the leave operation may not be that simple if the deleted arc is contained in an R L . The R L containing the deleted arc is called the originating RL of the leave operation. Depending on the topology of the originating R L and the location of the deleted arc, some nodes in the originating R L may no longer be loop deep equal; others may remain being connected by cycles in the originating R L . Each such cycle contains a subset of nodes and arcs in the or ig inat ing R L . The cycle is not broken because it does not contain the deleted arc. A set of such remaining cycles is called a retained RL i f (i) these cycles were part of the originating R L , (ii) they are not broken, and (iii) they remain chained together after the arc deletion. A retained R L is a new R L , it defines a new loop deep equal relation. A leave operation may also generate new D L s or change existing D L s . D L s embedded in the originating V N are hidden but w i l l be re-exposed if its source and sink are no longer loop deep equal after the arc deletion. Re-exposed D L s need to be detected and properly controlled. A lso a segment of a D L may be removed after an arc deletion. If the removed segment was the primary, a new pr imary needs to be chosen. A D L disappears when only one segment is left. The leave operation has to reorganize the I G V connections among nodes in the originating R L to maintain name resolution consistency. The ancestor-descendant relation among nodes that are no longer loop deep equal must be reestabhshed in the I G V . Nodes in a retained R L have to be mutual ly reachable i n the I G V to maintain their loop deep equal relation. Furthermore, pathname sets at the descendants of the originating V N have to be properly updated, so that D L s can be detected and future name graph updates can be performed correctly. Before discussing the solution to the leave operation, the following points are observed: • If the deleted arc is not in any R L , it is a normal arc and its removal does not change any loop deep equal re lat ion. N o change is necessary in the I G V except removing the arc. • A n arc deletion affects at most one loop deep equal relation in the name graph. Only I G V connections among nodes in the originating R L need to be reorganized. • W h e n the deletion of an arc < x,y > breaks a cycle, aU nodes in that cycle wiU become the descendants of y after the arc removal, irrespective of their ancestor-descendant relation before the cycle was formed. The new ancestor-descendant relation is defined by the 2/ A x paths in the E G V since this is what a user perceives from the E G V . • After deleting a normal arc < x.y >, only those nodes reachable from y need to be informed of the topology change as far as pathname update relay is concerned. These nodes have to drop the pathnames of the pattern < • • - xy • • • > since paths to x are no longer directly extendible to y. If arc < x,y > is not normal and its deletion breaks the originating V N , all the descendants of the originating V N have to drop their pathnames containing the broken V N . The basic idea of the solution to the leave operation is simple. When a leave operation breaks a V N , the following steps are taken: 1. determine new loop deep equal relations among nodes in the originating V N according to the E G V topology of the originating V N (the result is a set of retained R L s ) ; 2. in the I G V , restore the arcs between retained R L s and construct a spanning shadow tree connecting the nodes in each retained R L ; and 3. relay pathname update according to the reorganized name graph topology, and detect D L s and update R D control record according to the new pathname set. Figure 4.6 shows an example of the leave a lgor i thm. In the example, deleting arc < x,y > within an R L breaks that R L . Two retained R L s are generated after the removal of < x,y >, one contains nodes u, y and z, the other contains nodes x, v and w. The arcs connecting the two retained R L s are restored. DL{z,x) embedded in the originating R L is exposed after the arc deletion and R D control record for this D L is assigned to arc < z,x >. 4.7.1 D e t e r m i n i n g R e t a i n e d V N .\fter deleting a shadow edge or a fade arc, the connectivity among aU nodes in the originat-y X y X z V z V E G V I G V Figure 4.6: A n example of V N breaking by a leave operation. ing V N can be obtained from the E G V connection topology of the originating V N (represented by an adjacency matr ix ) . Assume that the components in the originating V N are numbered f rom 1 to A''. B y definition [Bondy and M u r t y 76], in an adjacency matr ix A, element a,-,j = 1 if there is an arc from node i to node j; otherwise, a , j = 0. The reachabiUty m a t r i x R = J2iLi A reachabiUty matr ix element r,-,j > 0 indicates that in the E G V at least one path exists within the originating R L from node i to node j after the given arc is removed.^^ W i t h the reachabiUty matr ix R, components in the originating V N can be partit ioned into equivalence sets, each constitutes a retained V N . Node i and node j are i n the same set i f and only i f Tij > 0 and rj^i > 0; that is, nodes in each set are mutuaUy reachable in the E G V and remain loop deep equal. ' ' A s a m a t t e r of fact , f i n d i n g r e t a i n e d R L s is equiva lent to f inding s trongly c o n n e c t e d c o m p o n e n t s in a g r a p h . O t h e r a l g o r i t h m s for this p u r p o s e c a n be f o u n d i n [ A h o et al 74]. 4.7.2 O p e r a t i o n D e s c r i p t i o n Consider a leave operation tl iat deletes an arc < x.y >. Depending on whether any loop deep equal relation is changed, we have the following cases: N o V N is Broken W h e n a normal arc is deleted, no R L is broken and no loop deep equal relation in the name graph is changed. A pathname update relay is launched in the subgraph rooted at y to delete pathnames of the pattern < • • •,x,y, • • • >. O n the basis of the updated pathname set, a node updates its R D control record to exclude the nodes that are no longer the ancestors of the source of the concerned D L . A t the sink of a D L , it is also necessary to check if deleting < x,y > removes a segment of the D L . If less than two segments is left, the D L disappears and its R D control record at the secondary parents are removed. Otherwise, a new primary segment has to be selected if the pr imary segment is removed by the leave. If a; is in a V N , other nodes in the V N have to update their V N child hsts. In the case that the deleted arc is not normal and the arc removal does not change any loop deep equal relation (i.e., r^^y > 0 and ry^x > 0), the update is internal to a V N . Only the nodes in that V N need to be informed to exclude arc < x,y > from the E G V connection topology of the V N . The spanning shadow tree for that V N may have to be recomputed, but no pathname update relay is necessary. T h e Originating V N is B r o k e n W h e n deleting < x,y > breaks an R L (i.e., rx,y — 0 or ry^x = 0), y, serving as the coordinator for the leave operation, informs other nodes in the originating V N of the arc deletion. Each node in the originating V N does the foUowing reorganization work: • computes the membership of the retained V N in which the node is a member (refer to Section 4.7.1), and runs the spanning shadow tree construction algorithm (refer to Section 4.6.2) on this retained V N to shadow/restore its adjacent arcs:^'^ • cleans up the pathname set by removing the pathnames representing paths to nodes other than itself or to any component of the retained V N ; and • sets the last element of the remaining pathnames to the vnid of the retained V N in which the node is a component (or its gid if the node is not in any retained VN).^'^ After node y completes its reorganization work and receives confirmation that other nodes in the originating V N have also done so, it launches a leave pathname update relay. The update request contains the deleted arc, the vnid of the broken V N , and the complete pathname set of y. Pathname update requests are propagated along normal arcs and shadow edges in the subgraph rooted at y and are not R D controlled. Upon receiving a leave pathname update, a node does the following: • Inherits the pathnames in the request and relays the update request. If the node was a component or a child of the or iginating V N , changes made to immediate parents are observed from the received pathnames and are recorded in the parent hst. • Discards the R D control record of a D L i f the vnid of the originating V N is found in that record or i f the originating V N is part of the D L . This is because after the originating V N is broken, its components or retained V N s become visible. The R D control record has to be recomputed to reflect these changes. • Invokes the D L detection procedure (refer to Section 4.5) i f the node is not in a V N or if the node has the smaUest gid among aU components in a V N . 4.8 Correctness In Section 4.2 and Section 4.3, we presented a correctness argument for the shadow tree ' ^ I f the n o d e is not in any reta ined V N , aU its a d j a c e n t arcs are res tored . ' ^ A l t e r n a t i v e l y , the c o o r d i n a t o r m a y c o m p u t e the r e t a i n e d V N s , assign t h e m v n i d s a n d d i s t r i b u t e the result to other c o m p o n e n t s . scheme and the R D control scheme. In this section, we briefly argue the correctness of the join and leave operations wi th respect to the name resolution consistency requirements in Section 4.1. Since both operations do not add/delete any node in the name graph, C l is always satisfied. According to requirement C 2 , an update operation in the shadow tree algor i thm is correct if after adding/delet ing an arc, (1) nodes in each R L in the E G V are connected among themselves by a spanning shadow tree in the I G V , and (2) the ancestor-descendant relationship in the I G V conforms to the derived name graph of the E G V . We show that C2 is satisfied in two steps: 1. Since a join operation constructs a spanning shadow tree for the new R L and a leave operation constructs a separate spanning shadow tree for each retained R L , nodes that are loop deep equal in the E G V are also loop deep equal in the I G V . W h e n the name resolution procedure i n Figure 4.2 is used, a message resolved at a tree node wiU be eventually resolved at a l l other nodes in the same shadow tree. The order that nodes are visited during name resolution may be different from that expected from the E G V , but this should be of no concern to the users. 2. Except for restoring existing fade arcs or shadow edges across loop deep equal sets, a leave operation does not add or delete any arc between different loop deep equal sets. A join operation does not add or delete any arc outside the new V N . The set of arcs in the I G V between nodes that are not loop deep equal is identical to that in the E G V , therefore, the I G V and the E G V have the same derived name graph and the ancestor-descendant relation among nodes in the I G V are consistent with that in the E G V . Furthermore, both operations launch pathname update relay after I G V reorganization. The propagation of pathname update requests follows the direction of normal arcs between loop deep equal sets. Hence, pathnames correctly record the ancestor-descendant relation in the derived name graph. When a V N is created or changed, aU the descendants of that V N are involved in the pathname update to bring their view of the name graph up-to-date. D L s are detected as soon as information about the topology of ancestors becomes available (i.e., after a node completes each pathname update) . We therefore conclude that both update operations preserve name resolution consistency between the I G V and the E G V . 4.9 Related Work [Benford and Onions 87, Deutsch 84] describe some work in dynamic resolution loop detec-t ion in the context of electronic m a i l distr ibution hsts. A s pointed out in Section 3.3, these schemes do not perform correctly in certain circumstances. The static algorithm not only guarantees correct name resolution, but also provides potential ly better name resolution per-formance because it is not necessary to perform Ust searching at every node during message transport to prevent resolution loops and duphcations. [Dally and Seitz, MerUn and Schweitzer 80a, M e r l i n and Schweitzer 80b] discuss deadlock avoidance in message routing in multiprocessors and in message buffer management for store-and-forward networks using deadlock graph transformations. The requirements of graph trans-formation for deadlock avoidance is different from that for resolution loop and duphcation avoidance. A s far as we know, there has been no other research on the problem of static resolution loop and duphcation avoidance. 4.10 Chapter Summary In this chapter we designed a spanning shadow tree algorithm that removes the effect of R L and D L while preserving name resolution consistency between the E G V and the I G V . This distributed static algorithm uses a pathname structure for a node to record the topology of its ancestors. W i t h this knowledge, a node can detect i f any R L wi l l be formed when an arc is added. The v i r t u a l node concept is used to encapsulate a set of nodes that are loop deep equal. This concept is helpful in defining the amount of name graph information required for static algorithms to work. In a join operation, V N topology information is collected by the coordinator when the V N is formed. The E G V connection topology of a V N is rephcated at all its components. Based on this information, a component constructs a spanning tree and shadows/fades adjacent arcs t o / f r o m other nodes in the same V N depending on whether the arcs are in the tree. In a leave operation, each node in the or iginating V N computes the retained V N based on reachabihty analysis and runs the spanning tree a lgor i thm to determine the state of its adjacent arcs. D L s are detected after each pathname update and R D control records are assigned to a l l secondary parents of the sink i n each D L to suppress resolution duphcations. Name resolution procedure is modified to take shadow edges and fade arcs into consideration for R L control and to check R D control record for D L suppression. The correctness of the algor i thm is argued by showing that the derived name graphs of the I G V and the E G V are the same and components in a V N are mutual ly reachable. Thus the name resolution consistency requirements C l and C 2 are satisfied. Chapter 5 Analysis and Experiments In this chapter, we analyze the communication complexity of the shadow tree algorithm in Section 5.1, describe a prototype implementation in Section 5.2, and report some experiments conducted on the prototype in Section 5.3. 5.1 Communication Complexity The complexity of the static algorithm is measured i n number of messages. It depends on the name graph configuration, the location that the update is performed in the name graph, the distribution of group nodes in a physical system, and multicast support at group nodes. Since these factors are system-specific, we shall estimate the worst case communication complexity for the algorithm. 5.1.1 W o r s t C a s e A s s u m p t i o n s Consider adding/delet ing an arc in a name graph G = {N,A) containing |A''| nodes and \A\ arcs (a shadow edge is counted once because a message traverses each arc in a shadow tree only once). In the worst case, 1. groups are completely distributed, one group node per subnet, so that every message between group nodes must be counted; 2. multicast is simulated by one-to-one interprocess communication at every node; and 3. tlie name grapl i topology is organized as shown in Figure 5.1, where each node has a path from each of its ancestors. ^ 1 Figure 5.1: The worst case name graph topology. We justify the worst case topology as follows. A n update can be decomposed into two phases. In join, the first phase consists of the collection of topology information of the new V N from each of its components. In leave, the first phase consists of the notif ication of the deleted arc so that each component of the originating V N can reorganize the state of its adjacent arcs and clean up its pathname set. W h e n only a normal arc is added or deleted, the first phase is not needed. The second phase consists of the pathname update relay. In this section, the part of the name graph in which the pathname update relay is to be performed after an update operation is called the to-be-updated subgraph. In an update operation, pathname update relay request must travel through every normal arc and shadow edge in the to-be-updated subgraph, because the correctness of pathname update relay is propagation path dependent. The message cost in the second phase of an update is proport ional to the number of arcs in the to-be-updated subgraph, the number of new D L s resulted from the update, and the number of secondary segments in each new D L . Therefore, to justify the worst case topology, we only need to show that any D A G is a subgraph of the graph that has the same number of nodes and the topology shown in Figure 5.1. We use a greedy construction strategy to show that the topology in Figure 5.1 contains the m a x i m u m number of arcs. Let us number the nodes in the graph by 1, 2. • • •. n where n = \N\. Observe that (a) node n can have at most \N\ - 1 in-arcs since the name graph is a simple graph; and (b) given that there is an arc from every other node to n , it is impossible to have an arc from n to any of the other nodes, since the graph is a D A G . Recursively applying this construction strategy to the other |A'^|-1 nodes results in the topology shown in Figure 5.1 where \N\ — 5. Because adding an arc into this graph would either generate a loop or make the graph non-simple, this topology gives the m a x i m u m number of arcs in a D A G : \A\ = |A'|(|A'| — l ) / 2 . 5.1.2 M e s s a g e C o m p l e x i t y In the special cases that adding an arc joins two nodes already in the same R L or deleting a nonnormal arc does not break an R L , the second phase of relaying pathname update is not necessary. Suppose the involved V N has k components, only k — I messages are necessary to inform the other k - 1 components to update their E G V connection topology of the V N . In the worst case, k = \N\. W i t h the worst case assumptions described in Section 5.1.1, the two phases of an update in non-special cases are analyzed as follows. Given a to-be-updated subgraph consisting of m nodes (where 1 < m < \N\), the message overhead of the second phase consists of two parts: 1. OHpn-updateitn) — the number of messages for pathname update relay in the to-be-updated subgraph: Assuming no R L in that graph, the m a x i m u m number of messages in this part is m ( m - l ) / 2 , since one message per arc is required. This indicates that the m a x i m u m overhead is incurred when an update occurs close to the root of the name graph and contain as many nodes in its to-be-updated subgraph as possible. 2. 0Hrd-assign{Tn) — the number of messages for assigning or removing the R D control record at the secondary parents of the sink in each D L : The number of messages in this part depends on the topology of the to-be-updated subgraph. This part of the overhead reaches the m a x i m u m when (i) the to-be-updated subgraph contains the maximum number of D L s , (ii) each D L contains the m a x i m u m number of segments, and (iii) the to-be-updated graph does not contain any R L since D L s embedded i n an R L need not to be R D controhed. The worst case topology maximizes this part of the overhead to m ( m - l ) / 2 - ( m - 1), as every arc in the graph wi l l be R D controlled except the path from G i to G m which contains m — 1 arcs. Next, let us analyze the first phase oi join. Consider the graph i n Figure 5.2, where \N\ — 5. If a normal arc is added (i.e., no new R L is formed), only a pathname update is necessary. Let us ignore arc < G 2 , ( J I > for the moment. From the above discussion, if arc < G i , G 2 > is the arc added by the join, the m a x i m u m number of messages wi l l be required in the second phase; that is, 1 + OHpn-uvdate{\N\ - 1) + OHrd-assian{\N\) = \N\^ - Z\N\ + 3 siuce the R D record assignment is performed in the subgraph rooted at G i which contains |A'^ | nodes, and pathname update is performed i n the subgraph rooted at G2 which contains lA'^l — 1 nodes. Suppose that an R L of k components is formed. Now consider the complete graph in Figure 5.2 inc luding arc < G2,G\ >. The first phase requires 3(A; — 1) messages to collect and to distribute the new V N topology information. The number of messages in the second phase can be maximized to (|iV| - k)'^ + 2{k - \){\N\ - k) i f the new V N includes the first k nodes (including G i ) . ^ The first part of this expression is the number of messages required in the ' A s s u m e t h a t each n o d e i n the t o - b e - u p d a t e d s u b g r a p h does not relay p a t h n a m e u p d a t e request to its c h i l d r e n G 1 Figure 5.2: Worst case overhead analysis of the join. pathname update relay and the R D control record assignment for a graph of |iV| - A; + 1 nodes with the worst case topology. The second part is the number of addit ional messages for aU but one components in the new V N to send their pathname update requests to the immediate children connected by normal arcs and for R D control record assignments to these arcs. The tota l number of required messages when an R L of (> 2) nodes is formed is 3(A; — 1) + ( I TV I — k)'^ + 2(k — 1){\N\ — k). B y setting its derivative wi th respect to k to zero, the max imum number of required messages equals to |iV|^ - 2|iV| + 3 is obtained when A; = 2 or 3; that is, when arc < G2-,G\ > or < Gz,Gi > is added. In summary, given the worst case topology wi th \N\ nodes, the m a x i m u m number of mes-sages required in a join is |iV|^ - 3|iV|-I-3 a normal arc is added, Overheadjoin = { \N\ — 1 a nonnormal arc is added without creating new V N , [ |7V|2 - 2|7V| -1-3 a new V N consisting oi k {2 < k < \N\) nodes is formed. FinaUy, let us analyze the first phase of leave. Consider the graph in Figure 5.3, where |A''| = 5. If a normal arc is deleted, only a pathname update is necessary. Let us ignore arc Figure 5.3: Worst case overhead analysis of the leave. < G'|Ar|,Gi > for the moment. From the above discussions, i f arc < G i , G 2 > is the arc deleted u n t i l the request is received f r o m all i m m e d i a t e p a r e n t s i n t h a t s u b g r a p h . T h i s a s s u m p t i o n simplif ies the message c o u n t i n g a n d reduces the t o t a l n u m b e r of messages r e q u i r e d . by the leave, the number of messages required in the pathname update relay is majcimized to IA' ' !^ — 3|iV| + 3 since the to-be-updated subgraph is the subgraph of G2 which contains |iV| — 1 nodes and has the worst case topology. Suppose that an R L with k components is broken by the leave. Consider the complete graph i n Figure 5.3 including arc < G|jv|,Gi >. The first phase requires 2{k — 1) messages to announce the arc deletion to the components in the originating V N . Obviously, this part of the overhead is maximized if k = \N\. The number of messages required i n the second phase depends on the topology of the to-be-updated subgraph after the R L is broken. From Figure 5.3 it is easy to see that if < G | ^ | , G i > (here \N\ = 5) is the arc to be deleted, the pathnames of the whole name graph have to be updated. Therefore, the overhead i n this phase is OHpn-updatei\N\) + OHrd-assignilN\) = (|iV| - 1)^. The max imum number of messages required when an R L is broken by an arc deletion is 2(|iV| - 1) + (|A''| — 1)-^ . In summary, given the worst case topology wi th |A'^ | nodes, the max imum number of mes-sages required in a leave is Overheadieave = ' |A''p-3|A'^|-|-3 a normal arc is deleted, lA'^l — 1 a nonnormal arc is deleted without destroying V N , |iV|^ - 1 a V N consisting of k {2 < k < \N\) nodes is broken. 5.1.3 D i s c u s s i o n Note that although the worst case message complexity of both operations is in the order of 0{\N\^), the average complexity is expected to be much less, because: • Name graphs tend to be sparse in the real world since unrelated groups usually are not connected. Th is suggests that an update act iv i ty ( including its pathname update relay) only involves a small portion of the name graph instead of al l \N\ nodes. • W h e n a V N of size m exists in the name graph, only m - 1 messages are needed to prop-agate a pathname update to the components of the V N . This is one order of magnitude smaller than m ( m - l ) / 2 messages when these nodes are not in a V N . A l s o , there is no need for R D control assignment for D L s embedded in a V N . • If multicast is properly supported, a node may multicast pathname update relay to its children, reducing the number of messages. • M u l t i p l e group nodes may reside on the same machine. If these nodes are consecutively connected in the name graph, only one intermachine message is needed in updat ing their pathnames. • In practice, major i ty updates occur at the leaf-level in the name graph. A lso , the number of secondary parents to be assigned R D control records is typically small . 5.2 Prototype Implementation A prototype implementation of the internet group model and the static name graph update algorithms has been bui lt . Th is prototype, serving as an existence proof, is used to estimate the amount of implementation effort and to demonstrate and investigate the model behavior in various situations. 5.2.1 E n v i r o n m e n t The prototype is implemented using Threads — a sub-kernel running inside a U n i x pro-cess [Neufeld et a l 90]. Threads supports Ught-weighted cooperative processes sharing a sin-gle U n i x process address space and provides efficient interprocess communication between the Threads processes. In Threads, concurrency can be achieved by creating multiple execution threads in a U n i x process. Threads is chosen as the test-bed of the prototype for the foUowing reasons: • Easy to use. Threads provides a very simple user interface. Connectionless message pass-ing between different Threads processes is supported by synchronous Send(), Receivef) and ReplyO primitives. Message receivers are simply addressed by their pids. A U existing U n i x system caUs and Ubrary routines are available in Threads. The Threads sub-kernel has been weU tested and ported on several architectures. • Easy to debug. Since Tlireads processes are in a single Unix process, Un ix dbx can be used to debug the prototype. • Easy to move the prototype to a real system since a program written to run in the Threads environment has few differences from one wr i t ten to run directly under Unix . 5.2.2 P r o t o t y p e S t r u c t u r e The internet group name graph model in Chapter 3 and the name graph update operations described in Chapter 4 are implemented. Each group node in the name graph is implemented as a separate Threads process and a l l nodes in the name graph reside in the same U n i x process address space. The arcs between nodes are recorded in an adjacent arc Hst data structure maintained at each node. A node uses this hst when it takes part in name resolution or pathname update relay activities. To be reahstic, communication between different nodes are restricted to message passing; that is , instead of passing pointers, messages between Threads processes are copied. A node coordinates a name graph update activity (with other group nodes) when a user request (join or leave) is received. There is a book-keeping server, implemented as a thread, which keeps track of the user com-mands and maintains the E G V . It also keeps track of name graph update activities coordinated by group nodes and maintains the I G V . Its purpose is to provide a global view of the name graph. W h e n requested by users during debugging, it can print the adjacency matrices of the I G V and the E G V , and compute reachabihty matrices in both views for comparison. There is a user interface thread which is responsible for reading commands from the console and creating and deleting group nodes in the name graph. It dispatches join and leave user requests to group nodes in the name graph, or the user requests for the global state of the name graph to the book-keeping server. The implementation is in C. The tota l number of hues of code is about 7,800, including blank hnes, comments and debugging routines. M a n y data structures could have been simphfied and many routines wi th similar functions could have been combined and optimized. 5.2.3 P r o t o t y p e L i m i t a t i o n s We briefly outline the Umitations of the prototype in this section. Removing these Umita-tions is left as future work. T i m i n g performance refers to the execution time of name graph updates. Since the pro-totype is implemented wi th in a single U n i x process, its t iming performance does not reflect the real execution time and message delay in physical networks. To obtain meaningful t iming performance, the prototype has to be moved to a real network, its code optimized, and the measured performance data have to be collected from there. Concurrency is another issue not addressed. Since there is only one user interface thread that reads user requests from the console and dispatches the requests to the group nodes in the name graph, update requests from the user are executed in a sequential order. To execute updates concurrently, the prototype has to be implemented using a multiple Unix process structure (preferably on mult iple machines) and concurrency control and failure handhng protocols have to be implemented. Th is wiU allow users to send their requests directly to the originating groups (rather than having al l requests funneled through a single process as in the prototype) to update name graph concurrently. The concurrency control and the failure handhng protocols in Chapter 6 are not implemented in this prototype because they are extensions of protocols that are known to work. 5.3 Testing Experiments We report the testing experiments conducted on the prototype in this Section. T w o types of testing experiments were conducted on the prototype: random testing and special-case testing. 5.3.1 R a n d o m T e s t i n g The purpose of the random testing is to detect if there is any design flaw in the algorithm or any unexpected bugs in coding the prototype. D u r i n g random testing, graduate students were invited to attack the prototype. Group nodes were added to the name graph and arcs between nodes were added and deleted randomly. .A.fter each update, name resolution consistency was verified by invoking name resolutions on group nodes to check if the name graph was correct and if R L and D L effects were properly controUed. Users were supposed to know the E G V graph, but the I G V must be transparent to them. If a message sent to a group node could be received by that node and each of its descendants (according to the E G V ) once and only once, name resolution consistency between the I G V and the E G V was achieved and the execution of the update was considered correct. Except for a few smaU bugs which were easily fixed, no major error in the algorithms was found. The largest name graph tried by students had more than 20 nodes. Name resolutions were performed as expected according to the E G V and no inconsistency was found. In other words, the I G V name graph transformations in static updates were transparent. 5.3.2 S p e c i a l - C a s e T e s t i n g The purpose of the special-case testing is to see whether the internet group model and the static name graph update operations react to various situations as expected. We focused our attention on testing updates that generate, modify or destroy R L s and D L s , because it is in these cases that the static R L / D L control algorithm takes effect to change the I G V . .A.S in the case of random testing, name resolutions were invoked on nodes in the name graph after each update was completed in order to verify the correctness of the update. The execution of a static update was correct if name resolution consistency between the I G V and the E G V was preserved. Furthermore, the internal state of each node, including its pathname set. its adjacent arc Ust and the topology information of the V N in which the node was a component, were carefuUy examined using the Unix dbx. Appendix B describes details of the experiments conducted on the prototype. These exper-iments were designed according to the flow of control in the prototype program. They cover aU the situations Usted below. A l t h o u g h nothing formal can be said about the coverage of the special-case tests, the experiments caused every Une of the code to be executed at least once.-" E a c h Line of code executed d u r i n g the e x p e r i m e n t i n g process was m a r k e d by h a n d to ensure that al l lines The situations that need to be tested are summarized below: T e s t C a s e s for J o i n We hst the situations which need to be tested for the join operation in this section. These test cases include R L generation/modif icat ion, D L generation/modif icat ion, and combinations of both . Examples are shown i n Figure 5.4, where a dotted hue represents the arc added by the join and circles and shaded areas represent V N s . 1. The t r iv ia l case — an arc addit ion generates no R L / D L . Th is allows us to check if path-name update relay is performed properly in join. 2. R L detections. Th i s can be further divided into the following cases: (a) The detection of a t r i v i a l R L — an R L consisting of two physical nodes only (case J .2 .a in F igure 5.4). (b) The detection of a nontr iv ia l R L whose component hst contains physical nodes only (case J.2.b i n Figure 5.4). (c) The detection of a nontr iv ia l R L whose component hst contains physical nodes as weM as V N s (case J.2.c in Figure 5.4); that is , existing R L s may be embedded in the new R L . Th i s allows us to check if the component hst of the V N can be expanded properly during V N construction. (d) Changing the connection topology of a V N by adding an arc that connects two components i n the same V N (case J .2 .d in Figure 5.4). The effect of this update should not be seen by nodes outside the V N . The above were tested in join experiments 1 and 2 in A p p e n d i x B . 3. D L detections. Th i s can be further divided into the following cases: were e x e c u t e d . J.4 J.5 Figure 5.4: Test cases for the join operation. (a) The detection of a simple D L — a D L containing no V N or embedded D L (case J .3 .a in Figure 5.4). (b) The detection of a D L that contains embedded D L s (case J.3.b in Figure 5.4). (c) The detection of a D L that is embedded in another D L (case J.3.c in Figure 5.4). (d) The detection of a D L wi th a V N source, or a V N sink, or both (case J .3 .d in Figure 5.4). (e) Updat ing R D control record when ancestors of the D L source are changed. (f) The effect on R D control when adding an arc connecting two nodes already in the V N source or the V N sink. (g) The detection and control of parallel arcs between two V N s , or between a physical node and a V N (case J .3 .g in Figure 5.4). The above were tested in join experiments 2 - 8 in Append ix B . 4. The detection of an R L containing embedded D L s (case J.4 i n Figure 5.4). This was tested in join experiments 2, 3 and 5 - 8 in Appendix B . 5. The detection of an R L that contains nodes in different segments in the same D L (case J.5 in Figure 5.4). In this case, the D L involved is changed and the resulting new D L s have to be detected and controlled. Th is was tested in join experiment 5 in Appendix B . Test Cases for Leave We hst the situations which need to be tested for the leave operation in this section. These test cases include D L modi f icat ion/destruct ion, and R L modif icat ion/destruct ion and its effect to D L s . Examples are shown in Figure 5.5, where the deleted arcs are crossed and shaded areas represent V N s . 1. The t r iv ia l case — deleting a normal arc. This includes the foUowing cases: L.4.C L .4 .d L.5 Figure 5.5: Test cases for the leave operation. (a) The deleted arc is not in any R L / D L . This aUows us to check if pathname update relay is performed properly in leave. (b) Deleting the arc removes a secondary segment of a D L (case L . l . b in Figure 5.5). (c) Deleting the arc removes the pr imary segment of a D L . This allows us to check: (1) If a new pr imary can be properly selected when there are more than one sec-ondary segment left in the involved D L (case L . l . c . l in Figure 5.5). (2) W h e n only one segment is left (case L . l . c . 2 in Figure 5.5), i f the R D control record for the involved D L is removed at the immediate parent of the sink. These were tested in leave experiments 1 and 5 in Appendix B . 2. The effect of deleting a shadow edge or a fade arc within a V N without changing any loop deep equal relation (case L.2 in Figure 5.5). Nodes outside the V N should not be aware of the arc deletion, but components of that V N should properly update the V N topology information. This was tested in leave experiment 2 in Appendix B . 3. The deletion of parallel arcs between two V N s or between a physical node and a V N (case L.3 in Figure 5.5). Effectively, this is similar to the case when a segment of a D L is removed. This was tested in leave experiment 3 in Appendix B . 4. The effect of breaking a V N due to an arc deletion. This includes: (a) Construct ion of retained V N s (case L .4 .a in Figure 5.5). (b) Re-estabhshment of the ancestor-descendant relation among ah the nodes in the broken V N (case L.4.b in Figure 5.5). (c) Whi le a D L may be removed when its V N source is broken, new D L s may be formed, and should be detected (case L.4.c in Figure 5.5). (d) Whi le a D L may be removed when its V N sink is broken, new D L s may be formed. The new D L s should be detected and R D control records have to be assigned (case L.4.d in Figure 5.5). (e) D L s embedded in a brolien V N may be re-exposed and tlierefore siiould be detected. These were tested in leave experiments 4 - 7 in Append ix B . 5. W h e n a V N is the sinli of a D L and the source of another D L at the same time, breaking the V N may result in both D L s being destroyed and new D L s generated (case L.5 in Figure 5.5). Th is was tested in leave experiment 7 in Appendix B . 5.4 Chapter Summary The communication complexity of the shadow tree algor i thm is 0(|A^|^) in the worst case. In practice, a better average case complexity can be expected. The effort of prototype i m -plementation is reasonable. The testing experiments show that the algorithm is robust and pract ical . Chapter 6 Concurrency and Resiliency Because message delays in networks are inherently unpredictable, the execution result of an update may be affected by the concurrent executions of other updates or by some unexpected node or communicat ion hnk failures. A pract ical name graph update protocol has to deal wi th these situations to produce correct results. Th i s chapter is organized as follows. In Section 6.1, we discuss the interference between updates and analyze why and how interferences may occur. We also briefly outhne different concurrency control pohcies on name graph updates and their imphcations. In Section 6.2, an update ordering protocol that serializes concurrent updates is described and its correctness is argued. In Section 6.3, the failure mode is defined and the basic approach to deahng with failures is outUned. In Section 6.4, the update ordering protocol is extended to provide resihency. Failure handUng activities to mainta in name resolution consistency are described in Section 6.5 and rehable name resolution protocols are outhned in Section 6.6. 6.1 Concurrency Control Issues A s Table 6.1 shows, for each update operation op(x,y), we define its originator to be the node that coordinates the update operation; its working arc to be the arc added or deleted by the update in the E G V ; and its working set to be the set of nodes along the y x path(s) . The subgraph rooted at y is caUed the subgraph of update op(x,y). op(x,y) originator working arc working set join{x,y) X <x,y> (p i f no R L is detected. Otherwise, ah components in the new V N . leave{x, y) y < x,y> </) i f < X , 2/ > is normaL Otherwise, aU components in the originating V N . Table 6.1: Orig inator , working arc and working set. W i t h o u t loss of generahty, assume that there is a name graph manager process per node. The name graph is coUectively maintained by al l the name graph managers. A name graph manager maintains the following topology information about the name graph: 1. information for name resolutions — the adjacent arc hst specifying the type (in or out) and the state (normal, shadow or fade) of the arcs incident at this node, and R D control records for out-going arcs that are R D controUed for some D L ; 2. information for R L and D L detections — the pathname set representing the topology of ancestors i n the derived name graph, and 3. topology information of the V N in which this node is a component (the adjacency matr ix of the V N and the Usts of parents /chi ldren of the V N defined in Section 4.6.1). Each update operation takes two steps to complete: (i) R L detection and V N construction, and (n) pathname update relay (and D L detection). The first step changes the name graph and the second step propagates the effect of the update to descendants. Mul t ip l e name graph managers cooperatively take part in an update operation to change the state of the name graph (parts 1-3 Usted above) based on their local knowledge about the current state of name graph (parts 2 and 3 Usted above). Loca l knowledge about the name graph state may be obsolete, however, as changes to the name graph by an update take unpredictable amount of time to propagate. Inconsistent result could be produced if an update uses obsolete state information. A n update op2 affects another update opi i f the concurrent execution of op2 obsolètes the local knowledge of the name graph state used by opi. Updates are independent i f they do not affect each other when executed concurrently. Examples A join(x,y) operation uses P ^ , the pathname set at the originator x, to check for new R L and obtain its working set. If Px has not been properly updated to reflect the fact that other concurrent updates have changed the y x path(s) , join(x,y) may make an erroneous decision on R L detection. Figure 6.1 shows some examples: j o i n ^2 ^ join X . , (a) Undetected RL j o i n j o i n (b) Incomplete RL j o i n leave (c) Phantom RL leave join (d) Incorrect RL Figure 6.1: Examples of interference to join operation. • A n undetected RL is an R L generated by a set of concurrent join operations, but is detected by none of them (Figure 6.1.a). • A n incomplete R L is an R L that does not include a l l the nodes in the same loop deep equal relation. Join(x,y) detects an incomplete R L if when it is executed, P^ does not reflect the fact that some concurrent join operation(s) has introduced new branch(es) into the y X path(s) (Figure 6.1.b). • A phantom R L is a falsely detected R L that no longer exists. • A n incorrect R L is an R L containing nodes that are not loop deep equal. A join(x,y) detects an incorrect R L if when it is executed, Px does not reflect the fact that some concurrent leave operation(s) has removed some branch(es) from the y ^ x path (Fig -ure 6.1.d). W h e n no y A a; pa th is left, a phantom R L is detected (Figure 6.1.c). Case (a) shows that join operations may affect each other even though their working sets do not overlap (their working sets are (j) in Figure 6.1.a). The I G V has to be reorganized when a shadow edge or a fade arc is deleted. Instead of using pathnames to determine the retained R L s , a leave reorganizes the I G V according to the topology information of the originating V N , which is rephcated at every component when the V N is formed. The reorganization of the I G V may be incorrect i f the knowledge of the originating V N topology information at the originator has not been properly updated to reflect the changes made to the V N by other concurrent updates. Unhke the join, a leave is not affected by any other update i f it deletes a normal arc, since it does not need any V N topology information for the I G V reorganization. Discussion The following theorems state the conditions under which updates affect one another. Theorem 6.1.1 Given two updates opi and op2, the concurrent execution of op2 affects opi i f and only if op2 changes the working set of op i . Proof : The if part has been shown by the above examples. We prove the only if part by contradiction as follows. Assume that the working set of opi(xi,yi) is not changed by 0 ^ 2 ( 2 : 2 , 2 / 2 ) - Then work ing arc < 2:2,2/2 > of op2 must have nothing to do with the loop deep equal relation created or modified by opi; that is, i f op\ is adjoin, adding or deleting < X 2 , t / 2 > does not generate, change or break any yi A xi path ; or i f opi is a leave, either X2 or 2/2 is not in the originating V N of opi. Hence, the I G V transformation i n opi is independent of op2. • According to the definition, op2 does not affect opi i f the changes made by op2 is known to the originator of opi when it is executed, because in that case their executions are sequential. L e m m a 6.1.1 A necessary condition for an update opi to affect another concurrent update op2{x,y) is for op\ to occur in the subgraph of y. Proof: Accord ing to the definition, the working set of the update op2{x,y) is completely embedded in the subgraph of y. If opi does not occur i n the subgraph of y, it cannot change this subgraph, nor is it possible for opx to change the y ^ x path(s) to affect 0 ^ 2 ( 2 ; , y). • T h e o r e m 6.1.2 G i v e n two updates opi{xx,yi) and op2(x2,y2)-, their concurrent executions affect each other if and only if either their working sets overlap or they are join operations occurring in the subgraphs of each other. Proof: This theorems states how updates affect each other (as defined by Theorem 6.1.1). It is proven in two cases: case 1. Accord ing to the definition, the working set of an update contains aU the nodes which are going to have a loop deep equal relation after the update, or which are in the loop deep equal relat ion to be changed by the update. Updates having overlapping working sets wi l l affect each other if executed concurrently, because they may change the same loop deep equal relation without knowing the changes made by the other. On the other hand, if the working set of the two updates do not overlap and one of the updates is leave, interference cannot occur. Because the working set of the leave is outside the working set of the other update, deleting the working arc of the leave does not affect the loop deep equal relation being created or modified by the other concurrent update. The other update does not affect the leave either since it does not change the working set of the leave and the correctness of the leave does not depend on the pathname set at its originator. case 2. The only case left to be examined is when the working sets do not overlap and the two updates are j o i n operations. If they occur in the subgraph of each other as in Figure 6.1.(a) (where we have paths yi A X2 and j/2 —* ari), they affect each other by creating an undetected R L . O n the other hand, if one of the join operations is not in the subgraph of the other, according to L e m m a 6.1.1, it is impossible for it to change the working set of the other, hence there wi l l be no interference. • Concurrency C o n t r o l Policies Mainta in ing a name graph is similar to maintaining a distr ibuted database. A n execution of a set of updates on a given in i t ia l name graph is serializable i f the resulting name graph is produced as i f the updates were executed on the in i t ia l name graph in some sequential order. Seriahzable execution results are consistent because update executions are commutable — the final resulting name graph is independent of the execution order of the updates. The goals of concurrency control in distributed systems are to schedule the concurrent operations to produce a seriahzable result, and to maximize the paraUehsm among independent operations to increase efficiency. Different pohcies of concurrency control on the name graph updates and their imphcations are briefiy outhned below: • No concurrency control at all: Because of the potential interferences between concurrent updates, name resolutions may observe the intermediate states of updates and the re-sulting I G V may be inconsistent. This is not acceptable in general unless updates occur infrequently and the inconsistency between the I G V and the E G V can be detected and corrected during name resolutions. • Concurrency control among name graph updates only: A l t h o u g h name graph updates are serialized and the name graph is correct after each update, name resolutions may stiU observe some transit states of updates. This is acceptable only i f apphcations do not care about temporary inconsistency or can detect and recover from the temporary inconsistency by themselves. • Concurrency control on all accesses to the name graph: A l t h o u g h name resolutions are not commutable w i t h respect to name graph updates, it is acceptable to some apphcations if the result of a name resolution is produced on the basis of a consistent snapshot of the name graph. A snapshot is a state of the name graph at a part i cular point in time. It is consistent i f it shows the result of a seriahzable execution of completed updates. • Ordered group communication: Not only are name graph updates seriahzable, but name resolutions are also carried out i n a consistent order, that is, i f a message is received before another message at a member process, this order is preserved at aU the other members in the same group. Ordered group communication guarantees that members in the same group receive messages i n the same order, and thus are synchronized, A B C A S T and G B C A S T in the Isis system [Birman and Joseph 87a] are examples. The degree that concurrency control should be supported in a group communication system depends on the intended apphcations. Once this is determined, a variety of techniques may be used to implement the pohcy. 6.2 Update Ordering Protocol In this section, an update ordering protocol for concurrency control is described. This protocol is similar to the Isis G B C A S T [Birman and Joseph 87a]. It schedules an execution order among concurrent updates that may affect each other before their executions start. To support the protocol , every name graph manager maintains: • A priority counter. A priority is defined as a tuple < t,s >, where t is the value of the counter and s is the gid of the node. Pr io r i ty < ,^ 5 > is lower than priority < t'. s' > if i > t', OT t = t' and s > s'. The larger the priority value, the lower the priority is. The prior i ty counter value is always adjusted to be larger than the priority value of any message that the node has ever sent or received. The value of a priority is globaUy unique because the gid is globally unique. • A message queue. Messages in the queue are sorted according to their priorities. Message mi is ordered before (closer to the queue head) message m2 i f m i has a smaller priority value than m,2. Messages in the queue are marked deliverable or undeliverable. When an update request is received from a user, it is assigned a globally unique transaction id by the originator. Th is id is carried by aU the messages exchanged among the nodes partic-ipat ing in the update transact ion. Messages correspond to the same update transaction if they carry the same transact ion i d . 6.2.1 D e s c r i p t i o n The name graph update ordering protocol consists of two rounds: R o u n d One — O r d e r Determination The operation originator determines the order of an update op(x,y) by sending an order request to the subgraph of y, and every node in this subgraph votes a priority on the basis of its local priority counter. The originator takes the highest priority vote as the final priority for the update request. Because in a name graph, a node generaUy does not know the membership of others, this vot ing round has to be conducted by relaying messages hop-by-hop through the subgraph of y. Specifically: 1.1 W h e n an update request is received from a user, it is assigned a priority according to the value of the pr ior i ty counter, marked undeliverable, and appended at the end of the message queue at the originator. The originator then prepares an order request which contains the transact ion id and the type (join or leave) of the update, the ids of the two nodes connected by the working arc (i.e., x and y), and the in i t ia l priority proposed by the originator. The order request is propagated hop-by-hop in the subgraph of y. 1.2 Upon receiving an order request, a node votes a priority according to the value of its prior i ty counter. The order request is marked undeliverable and is appended at the end of the message queue at the receiving node. 1.3 If the receiving node is a leaf node, it returns its priority vote to the parent from which the order request was received. Otherwise, the receiving node relays the order request wi th its priority vote to its immediate children and awaits their votes. After aU immediate children have responded, the node changes its pr ior i ty vote to be the highest vote returned from its subgraph and the message queue is re-sorted. This new prior ity vote is then returned in an acknowledgment to the parent from which the order request was received.^ 1.4 The order determination round terminates when the priority vote collection terminates at the update originator. A t this t ime, the original update request is assigned the Max{prior it ies returned from the subgraph of y] and marked deliverable. The message queue at the originator is re-sorted. Update requests and order requests are control messages. They are exchanged between name graph managers only. In the foUowing, messages exchanged during the second round are called transaction instructions. Transaction instructions carry the transaction prior i ty determined during the first round. R o u n d T w o — Update Execut ion Once the prior i ty of an update request is determined at its originator, its execution order with respect to the other updates is determined. Its execution cannot start , however, unti l ' I f there exists a. y — x p a t h , the o r i g i n a t o r i m a y receive the order request for op(x,y) d u r i n g the order d e t e r m i n a t i o n r o u n d . In that case, x p e r f o r m s steps 1.2 a n d 1.3, but does not p u t the o r d e r request into its message queue since the u p d a t e request is a l r e a d y there (as d e t e r m i n e d by the t r a n s a c t i o n i d ) . updates ordered before it are completed; that is, u n t i l the request reaches the head of the message queue at the originator. Specifically: 2.1 W h e n a deliverable update request op(x,y) reaches the head of the message queue at the originator,^ the update is aUowed to start . R L detection and I G V reorganization can be performed based on the name graph topology information saved at the originator. The update is performed as if it were invoked in a sequential execution without concern of interference from other concurrent updates. 2.2 W h e n an order request reaches the head of the message queue, it is not processed imme-diately. A s a result, it blocks the queue and messages ordered after it in the queue are prevented from being processed. W h e n a transact ion instruction is received, the order request corresponding to the same update is removed. The received instruction is marked deliverable and is inserted into the message queue according to its priority. It wiU be processed when it reaches the head of the queue. 6.2.2 D i s c u s s i o n W h e n groups are not nested, the above protocol works exactly as the Isis G B C A S T , which has been shown to work correctly. W h e n groups are nested, it is important for all nodes in the subgraph of an update to receive the corresponding order request before executing the update (refer to Section 6.2.3). Since an update op(x,y) consists of two rounds and the execution round is delayed, problems arise when other updates changes the subgraph of y between the two rounds. We discuss these problems and propose solutions to them in this section. Their correctness is argued in Section 6.2.3. Since we only have two types of update {join and leave) and an update can only occur either inside a V N or outside any V N , we have the foUowing cases to cover in the foUowing discussions: 1. a join operation adds a normal arc; ^ N o t e t h a t o n l y the o p e r a t i o n or ig inator p u t s the u p d a t e request i n t o its message queue, o t h e r nodes in the o p e r a t i o n s u b g r a p h o n l y get an order request d u r i n g the o r d e r d e t e r m i n a t i o n r o u n d . 2. a join operation creates a new V N ; 3. a leave operation deletes a normal arc; and 4. a leave operation breaks a V N . A d d i n g a N o r m a l A r c A join(u,v) scheduled to be executed between the order determination round and the actual execution of op(x,y) brings the subgraph of v into the subgraph of y i f M is in the subgraph of y. W h e n this occurs, some nodes in the subgraph of v may not have learned the ordering of op(x,y) (since they were not reachable from y during the order determination round) , therefore, they could have started some concurrent updates that may affect op(x,y). To avoid this s i tuat ion, the fol lowing is included in the execution round of a join when a normal arc is added: 2.3 Before a join(u,v) that adds a normal arc < u,v > terminates, the message queue at u is merged into the queues at a l l the nodes in the subgraph of v. This can be achieved by piggybacking the message queue at u onto the pathname update relay requests in the execution oi join(u,v). W h e n join(u,v) terminates at a node, the node not only learns of its reachabiUty from u, but also receives every message that its ancestors ( including u and the ancestors of u) have received and ordered after the join(u,v) update. Before a message queue is sent out for other nodes to merge, each update request in the queue is replaced by an order request carrying the same transaction id and priority.^ W h e n an order request m is to be merged into the message queue at a node, i f the queue contains a message carrying the same transaction i d as m does, m is ignored. Otherwise, m is inserted into the queue according to its prior i ty to make the ordering of m known to that node. ^ T h i s ensures t h a t on ly the o p e r a t i o n o r i g i n a t o r has the u p d a t e request in its queue . D e s c e n d a n t s of the o r i g i n a t o r o n l y get a n order request . M e r g i n g Miss ing Messages during Leave Pathname Update Consider a node k in the subgraph of a leave. A s discussed in Section 4.7.2, if k is the sink of a D L and if the leave breaks the pr imary segment of the D L , a new primary segment wiU be selected. It is possible, however, that some messages received by the nodes in the new pr imary segment may not have been received by node k (because of a longer propagation delay in the old pr imary segment). These messages are denoted as missm^f messages. When the pr imary segment of the DL{s, k) is broken by a leave, the foUowing step is included in the leave pathname update executed at sink k to capture the missing messages: 2.4.1 .Assume that node A; has selected one of its secondary parents in DL(s.k), say as the new pr imary parent. It contacts z to delete the R D control record for DL(s.k) and ask for z^s message queue. The returned message queue from z is merged into the message queue at k and is piggybacked onto the leave pathname update request that is relayed to the descendants of k, so that the descendants can also merge the missing messages in their message queues while relaying the leave pathname update. If the D L affected by the leave has a V N sink, step 2.4.1 is only executed by the component that has the smaUest gid among aU the components of the V N sink. D r o p p i n g Irrelevant Messages Resulted from Leave Update A message m in the message queue at a node n is irrelevant if there is no path from the originator of m to n in the name graph. A s Figure 6.2 shows, if a leave(u,v) is executed in the subgraph of y after the order determination round but before the actual execution of op(x,y), some nodes in the subgraph of v may no longer be part of the subgraph of y after leave(u.v) completes. These nodes, having received an order request for op(x.y) during the first round of op(x.y). wiU not receive any transaction instruction for op(x.y). The order request for op(x.y) becomes irrelevant at these nodes and blocks them from processing further updates. Pathname update requests can become irrelevant too. A l though such a message does not block the message queue, its execution produces erroneous pathnames of non-existing paths. X X y ^ - i - u V III Jlillk (A) delete a normal arc (B) break a V N Figure 6.2: Generation of irrelevant messages after deleting < u,v >. The foUowing step is included in the execution round of each update to drop irrelevant messages: 2.4.2 W h e n a message for op(x,y) reaches the head of the message queue, the node checks if node w, where w = y ii op = join or w = x ii op = leave, is an ancestor according to its current pathname set."* If not, this message is deleted from the message queue; otherwise, the node awaits the transaction instruct ion for op(x,y) as specified in step 2.2. Whi l e a node is wait ing , it has to perform this check after each leave pathname update unt i l either a transact ion instruct ion corresponding to the same update arrives or the message becomes irrelevant. F o r m i n g a N e w V N To ensure the consistency resulted from loop deep equal relations, nodes in a V N must start wi th the same message queue when the V N is formed and execute the same sequence of ^If n o d e w is a c o m p o n e n t of a V N , it is not visible in the p a t h n a m e sets at its descendants . In t h a t case, the d e s c e n d a n t needs to ask w for the v n i d of the V N i n w h i c h is a c o m p o n e n t . messages during the hfe time of the V N . To ensure the consistent in i t ia l message queue, the following steps are included in the execution round of a join that creates a new V N : 2.5.1 Whi le cohecting the local topology information from the nodes in the working set, their message queues are coUected and merged as weU by the originator. The result is the in i t ia l message queue. It is distr ibuted together wi th the topology information of the new V N to the components. 2.5.2 After submitt ing the local message queue, a component of the new V N stops propagating messages (if the message does not carry the id of the current update transaction) untU it finishes the pathname update relay for the join. Because during this period, the I G V connection between the nodes in the new V N is in a transit state. 2.5.3 The in i t ia l message queue at the components of the new V N is piggybacked onto the pathname update relay request for the join. The descendants of the new V N merge this piggybacked queue V N into their queues. Once the I G V spanning shadow tree for the new V N is constructed, aU its components wiU receive the same set of messages and order them consistently according to the protocol . Breaking a V N W h e n a leave breaks a V N , messages propagated half way through the V N may see the transit state of the I G V reorganization and their propagation may be incorrect. Solutions to this problem must ensure that V N modifications (in name graph updates) and message propagation through a V N (in name resolutions) are seriahzable. To achieve this seriaUzabiUty, the foUowing steps are included in the execution round of a leave that breaks a V N : 2.6.1 When an I G V reorganization request is received, a component of the originating V N stops propagating messages untU it finishes the pathname update relay for the leave. 2.6.2 The message queues at the components of the originating V N are returned when they confirm to the originator that they have completed the I G V reorganization for the leave. These queues are merged and the result is piggybacked onto the pathname update request for the leave. Nodes in the subgraph of the leave merge this piggybacked queue into their queues while performing the pathname update relay. Only relevant messages are merged. Because (i) the piggybacked message queue contains aU the messages received by the com-ponents in the or ig inat ing V N before the leave execution starts, (ii) pathname update relay of the leave is launched from the originator, and (iii) after the leave, aU the nodes in the originating V N are the descendants of the originator, any message received by the nodes in the originating V N is received by a l l the nodes in that V N and their descendants. Furthermore, nodes in the same retained V N wiD start w i t h the same message queue. 6.2.3 C o r r e c t n e s s In this section, we argue that the above update ordering protocol seriahzes concurrent update executions if they may affect each other. A node participates in an update if the node has to relay the pathname update resulted from the update. A message arrives at a node late i f the prior i ty of the message is higher than that of the message being processed by the node. L e m m a 6.2.1 The ordering protocol guarantees that message queue merges are never late. Proof: Dur ing the execution of an update, the messages to be merged are from the queues of the nodes in the working set. Because messages i n a queue are processed according to their priorities, messages to be merged always have priorities lower than that of the update being executed. Since in each update , part ic ipating nodes merge message queues before the update completes, the merged messages are never late. • L e m m a 6.2.2 The ordering protocol guarantees that a node receives a relevant order request for an update if it participates in the update. Proof: Consider an update op(x,y). A node n participates in op(x.y) if and only if n is in the subgraph of y. • If 71 is in the subgraph of y when the order determination round of op(x,yj is executed, n must have received the order request for op(x.y) since n has to vote for the priority. • Otherwise, n must be brought into the subgraph of by a join(u.v) ordered before op(x.y) and V is an ancestor of n. If join(u,v) adds a normal arc, u must be in the subgraph of y and knows the order request for op(x,y). Step 2.3 ensures that aU messages received by (/ are also received by the nodes in the subgraph of v. li join(u,v) generates a V N , one of the nodes in the working set of join(u,v) must be in the subgraph of y. Steps 2.5.1 — 2.5.3 ensure that messages received by the components of the new V N are also received by the descendants of the V N . Because of lemma 6.2.1, n wih receive the order request for op(x,y) in both cases before it completes join(u,v). • Step 2.4.1 ensures that messages received by the source of a D L are also received by the corresponding sink (and its descendants) even though the D L is changed by a leave during the order request propagation. Steps 2.6.1 and 2.6.2 ensure that messages received by the components of a V N are also received by the descendants of the V N even though the V N is changed by a leave while the order request is propagated half way though. • If a node is moved out from the subgraph of the update and the corresponding order request becomes irrelevant at the node, step 2.4.2 ensures that that order request wiU be dropped and message processing at the node is not blocked. • T h e o r e m 6.2.1 If two concurrent updates may affect each other, their executions can be seriahzed by using the update ordering protocol. P r o o f : According to Theorem 6.1.2, it is sufficient to show that this update ordering protocol seriahzes updates that have overlapping working sets, or join operations that occur in the subgraphs of each other. Given two concurrent updates opi{xi, y^) and op2{X2, y2)i where op-[,op2 G {join, leave}. and assume that opi has a higher priority than op2. If the two updates occur in the subgraphs « of each other, there exist the paths yi X2 and 2/2 ^i-^ T h a t means there is an order request for opi ordered before the op2 update request in the message queue at the originator of op2. Update op2 cannot start unt i l the order request for opi is removed; that is , unt i l the originator of op2 finishes the I G V reorganization and pathname update for opi. Therefore, when op2 starts execution, the changes made by opi become known to the originator of op2 (i.e., reflected in its pathnames). If the working sets of the two updates overlap, there exists a node z in the intersection such that yi ^ z X2 and y2 ^ z xi; that is, they occur i n the subgraphs of each other, and the above result holds. • If the originator of an update is not in the subgraph of another update, according to L e m m a 6.2.2, its message processing can never be blocked by the order request of the lat-ter, hence both updates can be executed in paraUel. The nodes part ic ipat ing in both updates execute the updates in a consistent order. 6.2.4 M e s s a g e C o m p l e x i t y The message overhead of the name graph update operations is increased when the ordering protocol is used. This overhead consists of two parts: the order determination overhead in round one and the execution overhead in round two. Given the same worst case assumptions as in Section 5.1, the update execution overhead remains unchanged since al l the information required in round two can be piggybacked. The addit ional overhead for concurrency control is the messages exchanged in the order determination round. It consists of propagating the order request and coUecting pr ior i ty votes in the subgraph of the update. A n order request propagation foDows the name resolution procedure, the arcs it traversed form a spanning tree in the subgraph of the message originator. In the worst case, every node in the name graph may have to receive the order request. Therefore, the total number of messages required in round one is upper bounded by 2(| A'^l - 1), where | i V | is the total number of nodes in the name graph. This part of overhead does not change the message complexity of update operations, I^f o p 2 = leave, yi X2 — y2-since the worst case message overhead in round two remains in the order of 0(|iV|^). 6.3 Resiliency Support Issues In this thesis, the terms site/link are used to refer to the machine/communicat ion hnk in a physical network and node/arc refer to the logical node/arc in a name graph. Arcs in the name graph are mapped onto paths in the network by the underlying rout ing algorithm. S i te /hnk failures in the network are reflected as node/arc failures in the name graph. We use the term remaining EGV to refer to the connections in the E G V among the active nodes after a failure. Node/arc failures need not be hidden from apphcations. Some apphcations, such as email distribution hsts and news propagations, do not take any action when a node/arc fails. Messages are simply lost or queued unt i l the failed node/arc recovers. Other apphcations that require continuous communication support , such as teleconferencing, require a reorganization of the I G V connections among active nodes to maintain name resolution consistency wi th respect to the remaining E G V . This I G V reorganization activity after a failure is caUed failure handling. A name graph management mechanism is resilient i f it has this auto-reorganization capabihty. Node /arc failures may affect the result of an ongoing update or a name resolution. A n update operation is resihent i f it produces a consistent I G V irrespective of the failures occurred during its execution. A name resolution protocol is resihent if it dehvers messages based on a consistent name graph snapshot and preserves the required semantics (such as message atomicity and ordering) irrespective of failures. Resihent name graph update protocol , failure handhng in the resihent name graph management, and resihent name resolution protocols are the subjects of the rest of this chapter. Failure M o d e Assumptions The foUowing assumptions are made: • A node x monitors another node y i f there is an arc between them in the E G V . or if X is expecting a message/reply from y. We assume the underlying system provides some monitor ing mechanism similar to the process alias [Malcolm and Vasudevan 84], watchdog [Ravindran and Chanson <S6b], timer [Bernstein et al 88] or the receive-specific primitive in the V system [Cheriton 88b]. • A node y detects that an adjacent arc < x.y > fails and declares that node x is unreachable from y when x crashes or when al l the physicai communication paths between x and y are broken (i.e., x and y are in different partit ions) . A failure is reported to apphcations. so that users may take certain actions to amend the E G V when the s i te /hnk failure recovers and to redo the failed name graph updates or name resolutions. • Messages from the same source are received in the same order as ihey were sent. Lower layer protocols are responsible for handhng lost and out of sequence iiiessages. • The update ordering protocol in Section 6.2 is used for concurrency control . • Node failures are fail-stop [Schhchting and Schneider 83]. Once a node crashes, it loses aU information about the name graph, including its adjacent arcs (parent and children lists), its pathname set, the topology information of the V N of which the node is a component, and its message queue. • Messages to unreachable nodes are dropped. Also nodes that failed and then recover while failure handhng is in progress are not aUowed to participate in the ongoing failure handhng activity. Lasers are told about the failure and are responsible to reconnect the recovered nodes with other active nodes. In distributed systems, a node generaUy cannot distinguish a network part i t ion from a site failure. To simpUfy failure handUng, the nodes that have become unreachable due to network partitions are modeled as if they have failed. This allows us to deal with the two different types of failures in a similar manner. A s a result, when network part it ion occurs, nodes in a partition may forget about nodes in the other part it ions. Basic A p p r o a c h to Resiliency Although failures and user init iated updates both remove arcs from the name graph, they are different in a number of ways: • A failure may simultaneously delete more than one node/arc in the E G V . For example, a node failure removes a l l arcs incident at the failed node since nodes are fail-stop. Mul t ip le nodes may be cut off from the name graph by a network part i t i on . • Failures are asynchronous. The changes to the E G V resulted from a failure take effect as soon as the failure occurs. Failure handhng cannot be delayed or name graph updates and name resolutions wiU be affected. O n the contrary, user init iated updates are delayed (in the update ordering protocol) in order to be synchronized. Furthermore, note that : 1. Failure handhng activities and user initiated updates are commutable. 2. A sequence of consecutive failure handhng activities are combine-able. Given an in i t ia l name graph, deleting failed nodes/arcs one after another i n a sequence of consecutive failure handhng activities and deleting the same set of nodes/arcs aU together in a single failure handUng act ivity should produce the same resulting name graph. The basic approach to resiUency is simple. To produce a seriaUzable result, a two-phase commitment protocol is integrated into the execution round of each user initiated update that creates/modifies a V N . A n update fails and is aborted if any node in its working set or any arc connecting nodes i n the working set fails before the update is completed. A failed update is reported to the user so that it can be re-invoked later.^ The two-phase commitment protocol is also used to guard each failure handhng activity that changes a V N . Failure handhng is aborted and restarted immediately if further failure occurs before its completion. ^ A l t e r n a t i v e l y , an a b o r t e d u p d a t e c a n be a u t o m a t i c a l l y r e - s c h e d u l e d by the s y s t e m to be executed after the failure h a n d l i n g . 6.4 Name Graph Update Protocol Resiliency Consider the update ordering protocol i n Section 6.2. Failures may occur in both the order determination round and the execution round of an update. 6.4.1 M a k i n g O r d e r i n g D e t e r m i n a t i o n Res i l ient To make the order determination round resihent, we have to ensure that irrespective of failures, all nodes in the subgraph of an update operation receive the order request for the update before the update execution starts. Th is assertion is required in L e m m a 6.2.2 to ensure the correctness of the update ordering protocol . A simple way of deahng wi th the failures that occur during the order determination round is to report to the originator of the update when a node expecting a pr ior i ty vote from its children detects that a child has failed. The originator may simply abort the update (by issuing an abort message to the subgraph) and tell the user to re-invoke the update later. Once the above assertion is estabhshed for an update, it is not affected by failures, as failures can only cut nodes away from the subgraph of the update. The order requests for the update at the nodes that are cut away are treated as irrelevant and dropped later (refer to Section 6.2.2). 6.4.2 M a k i n g E x e c u t i o n R e s i l i e n t The execution round of an update is viewed as an atomic transaction if a V N is to be created or modified. A two-phase commitment protocol is used to guard the transaction against failures.^ The originator of the update is the transaction coordinator, the other nodes in the to-be-created/modified V N are the part ic ipants . The following steps are performed in a join transaction: 1. The coordinator asks the participants for their local E G V topology information and their message queues. ' T h e t w o - p h a s e c o m m i t m e n t p r o t o c o l is not necessary w h e n an u p d a t e only a d d s / d e l e t e s a n o r m a l arc . S u c h an u p d a t e is unaf fected by fai lures s ince it does n o t r e q u i r e any I G V r e o r g a n i z a t i o n . 2. Part ic ipants send their local E G V topology information and their message queues to the coordinator. They also stop resolving messages and start to monitor the coordinator unt i l the update transact ion commits or aborts. 3. After collecting and computing the topology information of the new V N and merging the coUected message queues, the coordinator distributes the result to the participants. 4. Upon receiving the result from the coordinator, a participant saves the topology infor-mation of the new V N and the merged message queue, temporari ly sets the state of its adjacent arcs based on the spanning shadow tree construction, computes the pathname set of the new V N , and acknowledges a ready-to-commit to the coordinator. 5. Depending on the acknowledgments from the partic ipants, the coordinator issues a com-mit/abort instruct ion to the participants. 6. Part ic ipants commit or abort the join update transaction according to the instruction received from the coordinator. If abort , whatever result produced by the join is discarded. If commit , the node launches a pathname update relay to its immediate children connected by normal out-going arcs after committ ing the results obtained in step 4. The order request for the committed update is removed from the message queue also. Clearly, steps 4 - 6 constitute a two-phase commitment. SimUar steps are taken in a leave transaction: 1. The coordinator informs the participants of the deleted arc. 2. The partic ipants reorganize the I G V and send their message queues to the coordinator. 3. The coordinator merges the coUected message queues and starts pathname update relay wi th in the to-be-modified V N . The merged message queue is piggybacked. 4. A part ic ipant returns a ready-to-commit after the merged message queue is obtained and its pathname set is updated. Steps 5 and 6 are the same as in join. 6.4.3 D e a l i n g w i t h Fai lures d u r i n g U p d a t e E x e c u t i o n To show the correctness of the update protocol resihency, detecting and deahng wi th failures in the execution round are summarized in this section. After an update is aborted, a failure handhng act iv i ty (discussed in Section 6.5) is in i t iated (by the system) immediately. Participant failure between steps 2 and 4 This failure can be detected at step 5 when the coordinator expects acknowledgments from the part ic ipants . The coordinator then issues an abort instruction to the participants to cancel the update. A s a result, failure handhng is ordered before the update. Coordinator failure between steps 3 and 5 This failure can be detected at step 6 by a partic ipant expecting the final instruction from the coordinator. The participant then consults w i th other participants for the state of the up-date. If any reachable participant has committed /aborted the update, it also commits /aborts ; otherwise, the update is aborted.^ Effectively, i f the update is committed, the failure appears to the active components in the new/modif ied V N as if it had occurred after the update. If the update is aborted, the update request appears as if it were never issued because it does not make sense to add/delete an arc connecting a failed node. In both cases, the resulting name graph is consistent since updates and failure handhng activities are commutable. Participant failure after step 4 This failure cannot be detected by the two-phase commitment protocol. The effect of such a failure, however, is equivalent to the case when the participant fails immediately after completing step 6. Being totally unaware of such a failure, active participants follow the ' i n a join, the p a r t i c i p a n t s c a n be f o u n d f r o m the c o m p o n e n t list of the new V N o b t a i n e d at step 4. In a leave, they c a n be f o u n d f r o m the c o m p o n e n t list of the t o - b e - m o d i f i e d V N . instruction from the coordinator to commit /abor t the transaction. This failure wi l l be detected by the node monitor ing mechanism and wil l be dealt wi th after the update completes. Discussion A coordinator or participant failure before the completion of step 3 can be detected at step 4 (by the participants) or step 5 (by the coordinator) . Such a failure results in the update transaction being aborted unilaterally. .A. network part i t ion may leave an update being committed in one part i t ion but aborted in another, since the two-phase commitment protocol does not block uncertain participants when the coordinator fails. This result is not harmful , however, since the name graph is consistent wi th in each par t i t i on . The system initiates a failure handhng act iv i ty after an update is aborted. U n t i l the path-name update relay resulted from the failure handhng is completed, nodes in the subgraph of the aborted update cannot participate in any other update or name resolution as they are blocked by the order request for the aborted update. Th is order request wi l l be considered as irrelevant and removed when the pathname update relay resulted from the failure handhng is performed at these nodes. 6.4.4 M e s s a g e C o m p l e x i t y In this section, we analyze the additional message overhead resulted from the two-phase commitment . Consider an update that creates/modifies a V N of size k, where k is upper bounded by the size of the name graph |A'|. If no failure occurs, the addit ional message overhead resulted from the two-phase commitment are (fc - 1) ready-to-commit messages plus {k — 1) commit/abort messages.^ If a failure occurs during update e.xecution. participants have to exchange more messages to determine the fate of the affected update. In the worst case, the coordinator fails and each participant has to consult with every other participant before deciding to abort the update, resulting in {k - 2) x (k - l ) / 2 messages exchanged among the ^Messages e x c h a n g e d for reachabi l i ty m o n i t o r i n g are not c o u n t e d . remaining {k — 1) part ic ipants . More message overhead is required if further failures occur during the consultation. In summary, m a k i n g the update ordering protocol resihent requires an additional message overhead which is upper bounded by 0{\N\'^) per failure. The complexity of update operations remains 0(|A^P). 6.5 Name Graph Resiliency Failure handhng refers to the activity of re-organizing the I G V after each failure to preserve name resolution consistency. Depending on whether the failure changes any R L , different failure handhng actions are taken. 6.5.1 H a n d l i n g N o r m a l A r c Fai lures Consider the failure of a normal arc < x,y >. The name graph manager at y initiates an activity to delete the failed arc as soon as x is detected unreachable. This system initiated activity to handle a normal arc failure is caUed NA-failure-handling. Since only a normal arc < X,?/ > is removed, no reorganization in the I G V is required. Instead, a NA-failure-handling(x,y) request is relayed in the subgraph of y to inform the descendants to remove the pathnames that contain the unreachable node x. A NA-failure-handling cannot be affected by any other updates or concurrent NA-failure-handling activities since it does not reorganize the I G V . O n the other hand, NA-failure-handling activities and user in i t iated updates are always seriahzable, because a user init iated update is aborted if the two-phase commitment protocol in the update execution round detects that a node in the working set becomes unreachable. Due to these reasons, a NA-failure-handling(x,y) need not be safe guarded by a two-phase commitment protocol or be scheduled by the update ordering protocol . It can be init iated as soon as the failure is detected since only one pass of message propagation in the subgraph of y is performed. A NA-failure-handling(x. y) request carries the same k ind of information as that in a leavefx, y) pathname update request, and a node in the subgraph of y does the same thing as if a leavefx, y) pathname update request were received. The node also removes the order requests for the updates aborted due to the failure (refer to Section 6.4.3). H a n d l i n g Node Failures A node failure causes al l the arcs incident at the node to fail simultaneously. Handhng a node failure is equivalent to handhng the concurrent failures of the arcs incident at the failed node. Each normal arc incident from the failed node can be deleted by a separate NA-failure-handling ( initiated at the immediate child). The foUowing VN-failure-handling can be invoked to delete shadow edges or fade arcs since loop deep equal relations in the name graph have changed. 6.5.2 H a n d l i n g V N P a r t i a l F a i l u r e s A V N part ial failure occurs if some (but not aU) components of the V N become unreachable. The activity of handUng a V N part ia l failure is caUed VN-failure-handling. SimUar to a user init iated leave that breaks a V N , aU the active members of a broken V N in the same part i t ion take part in VN-failure-handling to reorganize their connections in the I G V to preserve name resolution consistency with respect to the remaining E G V . The differences are: • The coordinator and the participants of each VN-failure-handling have to be determined, because a V N part ia l failure may be detected by more than one node. • Possibly more than one arc wiU be deleted in a single I G V reorganization activity. • A VN-failure-handling act ivity does not need an order determination round. It is init iated (by the system) as soon as a V N part ia l failure is detected. • A pathname update relay has to be launched from every node whose immediate parent (in the E G V ) becomes unreachable due to the failure. Determining the Coordinator and the Participants Members in a V N can be ranked by their gid (from high to low). This ranking is caDed the coordinator-rank (c -rank). W h e n a V N is part it ioned, the active member with the highest c-ranking in a part i t ion of the broken V N is chosen as the coordinator for the VN-failure-handling activity. The distr ibuted coordinator election algorithm in [Bernstein et al 88] (page 254) can be used for this purpose. Once a coordinator is elected, it has to determine the participants of the transaction because more than one node may become unreachable in a V N part ia l failure. The coordinator sends a probe to each component of the broken V N . The components that acknowledge the probe with their message queues are the participants of the VN-failure-handling transaction. Reorganizing the I G V and U p d a t i n g Pathnames After the partic ipants are determined, the coordinator deletes aU inactive members and the arcs incident at the inactive members from the adjacency m a t r i x of the broken V N . The resulting adjacency m a t r i x reflects the remaining E G V connections among the participants. The coordinator then computes the reachabihty among the part ic ipants , assigns a vnid to each retained V N , and merges the message queues coUected. It then sends to the participants a VN-failure-handling request which contains a Ust of inactive members, the member Ust and the vnid of each retained V N , as weU as the merged message queue. Upon receiving a VN-failure-handling request, a participant p runs the same procedure as in a leave operation, plus:^° • A U nodes in the inactive member Ust and arcs incident at these nodes are deleted from the adjacency m a t r i x of the broken V N . • If p is in a member Ust of a retained V N , it adopts the retained V N id assigned by the coordinator. ' ° I f p is w a i t i n g for a commit/abort mstruct'ion. it continues to wait u n t i l the u p d a t e is c o m m i t t e d / a b o r t e d . X o p a t h n a m e u p d a t e relay needs to be p e r f o r m e d even t h o u g h the u p d a t e m a y be c o m m i t t e d , since the failure obsolètes the p a t h n a m e s g e n e r a t e d by that u p d a t e . • Pathname representing paths to nodes in the inactive member hst or to the participants from which p is no longer reachable in the remaining E G V are deleted. Each remaining pathname that represents a path to another participant q is appended with the q — p path(s) if p and q are not loop deep equal and if any q ^ p path exists in the remaining E G V . Discussion A VN-failure-handling cannot be affected by any other concurrent updates or failure han-dhng activities because they are safe guarded by a two-phase commitment protocol. A user init iated update is aborted if it may be affected by any failure, its effect to the name graph is thereby canceled. A NA-failure-handling deals wi th a failure outside any V N , therefore cannot affect a VN-failure-handling. Concurrent VN-failure-handling activities never interfere with each other since they reorganize the I G V connections of non-overlapping node sets. After a failure, a message in the message queue at a participant (or a descendant of a participant) may become irrelevant, because the failure may cut the node off from the subgraph of the message originator. Af ter pathname update relay, a node may use the algorithms in Section 6.2.2 to ehminate irrelevant messages in its queue. 6.6 Name Resolution Resiliency A name resolution is consistent if it is performed on a consistent snapshot of the name graph irrespective of concurrent updates and failures. A snapshot is consistent if it is the re-sult of a seriahzable execution of name graph updates and failure handhng activities. A name resolution protocol is resihent if the semantics of name resolution operation can be preserved ir-respective of concurrent changes to the name graph. In group communications, name resolution semantics has two aspects: delivery order and delivery coverage [Garc ia -Mohna and Kogan 88. Garc ia -Mohna and Spauster 91]. DeUvery order is specified by some t iming requirements in deUvering messages; deUvery coverage is defined by the ratio of the number of group members that received the message to the size of the receiving group. The update protocol in Section 6.4 and the failure handhng protocol in Section 6.5 guarantee that the name graph wi l l be consistent eventually. A l though transit states of the name graph may be used when no resihency support is provided by the name resolution protocol , inconsistent results are only temporarily. A name resolution protocol with no resihency support is simple and cheap as it does not synchronize receivers for message dehvery. Some apphcations may choose to hve w i t h it i f they can tolerate temporary inconsistencies. 6.6.1 N a m e R e s o l u t i o n C o n s i s t e n c y Besides resolving messages on a consistent name graph snapshot, further ordering require-ments may be specified to meet the requirements from various apphcations. For example: • Members in the same group have to receive messages in a consistent order. This requires name resolutions to be seriahzable among themselves [Birman and Joseph 87a]. • The message dehvery order must be consistent wi th respect to some causal relation defined by apphcations [Birman and Joseph 87b]. Supporting name resolution consistency (and other ordering requirements) requires not only a seriahzable execution of updates and failure handhng activities (to produce consistent name graph snapshots), but also a seriahzable execution of name resolutions wi th respect to changes to the name graph. A synchronization mechanism has to be built into the name resolution protocol to ensure that a node does not deliver a message to the application unless it is assured that all the active receiving members will deliver the message in a consistent order. Assuming the update ordering protocol in Section 6.2 is used, two protocols that support name resolution consistency are outhned below. G r o u p Datagram Service In the internet environment, group datagram service is an approximation to the multicast service such as that provided by the V system. When resolving a message, a node checks if its message queue is empty. If this is the case, the message is resolved; otherwise, the message is dropped. Because the queue only holds update requests and order requests, an empty queue indicates that this node is not involved in any update and its adjacent arcs are part of a consistent name graph snapshot. This approximation is reasonable when the name graph does not change frequently. The probabihty that a message is not dehvered to an active member of the receiving group depends not only on the probabihty of machine and communication hnk failures or network congestion, but also on the frequency of name graph updates. This approximation tends to drop more messages than necessary, because a message may be dropped even though the node is only involved in some pathname update relay. Ordered N a m e Resolution Protocol The same two-round ordering strategy in Section 6.2 can be used to order name resolutions. As a result, name resolutions are ordered with respect to name graph updates (so that messages are dehvered based on consistent name graph snapshots) as well as among themselves (so that messages from different sources to overlapping groups are dehvered in a consistent order). Discussion Group datagram service and ordered name resolution protocol ensure that name resolutions are ordered w i t h respect to name graph updates. A simple way to order name resolutions with respect to failures is to abort name resolutions affected by failures. User messages are discarded at the nodes (and their descendants) involved in failure handhng activities and the failure is reported. The apphcation is expected to deal with message loss (e.g., retransmit the message) and duphcations generated from retransmissions. 6.6.2 N a m e R e s o l u t i o n A t o m i c i t y .Atomicity support addresses the dehvery coverage aspect of name resolutions. W i t h atom-icity support, not only are name resolution results consistent, but message dehveries are also atomic with respect to failures: that is. a group message is received either by all the active members or by none of them in failure situations. B y "exaggerating" a part ia l failure (occurred dur ing a message transmission) as a to ta l failure, atomicity turns a group into a single entity and leaves apphcations wi th a simple failure mode to deal wi th . This is part icularly desirable i n distr ibuted database systems [Lampson 83a]. A n atomic name resolution protocol has to guarantee that a receiving node does not deliver a message to the application until it is assured that all the other active members will deliver the message. To reach a consistent decision on message dehvery when the originator of a message fails, the complete membership of the destination group is often required to allow active members in the receiving group to consult with each other. Reaching consistent decision across network partitions is difficult. When communication failure (such as network partit ion) is possible, every atomic commitment protocol may result in blocking [Bernstein et al 88]. Because our goal is to provide continuous communication support wi th in each part i t ion , we have to either restrict the failure mode (e.g., assuming network part i t i on never occurs) or compromise the definition of atomicity (e.g., only guarantee the consistency and atomicity within each part i t i on) . Users requiring consistent message dehvery across partit ions have to either use blocking protocols or bmld their own consistency assurance mechanism at the apphcation level. In that case, the goal and approaches of failure handhng and resihency support for name graph updates and name resolutions are different from those discussed in this chapter. These topics are out of the scope of this thesis. Even though it is assumed that communicat ion failures never occur, one has to assume that al l information at failed nodes are lost. This may not be reahstic because messages may have been dehvered to a human being. But blocking may occur without this assumption, since until the failed node comes back, the active nodes cannot teU if the failed node has dehvered the message. The ordered name resolution protocol in Section 6.6.1 does not provide atomicity because after a node learns the final priority of a message, it is stiU not sure whether other members in the receiving group have received the final priority a n d wiU dehver the message. When a failure occurs, a member cannot consult wi th the other members because the membership of the receiving group is unknown. .A. third phase can be added to the protocol in Section 6.6.1 to achieve atomic dehvery (with the above failure mode assumptions) as foUows: Phase O n e : T h e user message is propagated hop-by-hop to the nodes in the subgraph of the message originator and priority votes from these nodes are propagated hop-by-hop back to the or ig inator . Phase T w o : A f t e r coUecting priority votes from the descendants, the originator propagates the final pr ior i ty hop-by-hop in its subgraph. Upon receiving the final priority, a node reorders the user message in its message queue accordingly, marks the message to-be-committed and relays the final priority to its descendants. After coUecting ready-to-commit acknowledgments from aU immediate children, the node returns a ready-to-commit acknowledgment to the parent from which the final priority was received. Phase T h r e e : A f ter the originator has finished coUecting ready-to-commit acknowledgments from aU immediate chUdren, it knows that aU nodes in its subgraph have received the final pr ior i ty and are ready to commit to deUvery of the user message. The originator then propagates a commit instruction hop-by-hop in its subgraph. .A. node in the subgraph marks the user message deliverable when the commit instruction is received. .A deliverable user message is dehvered to the apphcation when it reaches the head of the message queue and remains relevant. Note that the t h i r d phase is necessary because the commit instruction announces that aU nodes in the destination group have confirmed reception of the finial priority of the message and are ready to deUver the message. The failure of a group member before returning a ready-to-commit acknowledgment can be detected by its immediate parent and can be reported to the originator. The originator then issues an aôor^ instruct ion in the third phase and reports the failure to the sender. The message is dropped by the descendants when the abort instruction is received. The failure of a node (including the originator) before completing the forwarding of the commit/abort instruct ion in the third phase can be detected by its immediate children expecting the instruct ion . Handhng such a failure requires knowledge of the complete membership of the message destination group. The member hst of the destination group could be gathered when the prior i ty votes were coUected in the first phase. T h a t is. together with its priority vote, a node also returns its member Ust. The returned member Ust includes the node itself and the union of the member Usts returned from its immediate children. When the final priority is determined at the originator, the complete membership of the destination group is also obtained. This member Ust may be piggybacked onto the final pr ior i ty message in the second phase, so that when a node sets the user message state to to-be-committed, the complete membership of the destination group of the message is also ready at the node. W i t h this Ust, the node is able to consult w i th other active members in the receiving group for the deUvery status of the message when a parent fails. If any active member in the receiving group has received a commit instruct ion from the originator, the other receiving members also commit and dehver that message. If the user message status at aU active members is to-be-committed, the message is dropped. The latter case occurs when the originator fails before the third phase starts. 6.7 Related Work Synchronizing group membership updates and group messages is not a new subject. The protocol in [Chang and Maxemchuk 84a] orders group messages through a single site in a broad-cast type network. The ordered group communication protocol in [Navaratnam et al 88] uses simUar concept. These protocols are not suitable for the internet because the internet is not broadcast oriented. A propagation protocol is proposed in [Garc ia -Mohna and Spauster 91] for ordering group messages in the internet environment. Groups are structured into a tree caUed a propagation graph. There is a pr imary site for each group and a unique path from the primary site to each member. Messages to a group are submitted to the pr imary site and are routed to other members through intermediate sites located in the group intersections. Messages are ordered along the way by merging messages to different groups at the sites in the group intersections. A l l messages eventuaUy end up at their destinations already ordered. It is the requirements of the propagation graph that hmits the apphcabihty of this propagation protocol to the nested group model. The propagation path of a message to a nested group is determined by the topology of the name graph, which may have an arbitrary topology. A lso , messages to nodes in a V N are not necessarily routed through the same path. Therefore, messages cannot be ordered consistently by simply adopting the propagation protocol. Furthermore, without complete knowledge of group membership, it is very difficult to construct a propagation graph. In Isis [Birman and Joseph 87a], a set of protocols for synchronized group communications were proposed. This protocol suite was designed for non-hierarchical groups. We have adopted the basic idea of the Isis protocol in the update ordering protocol and the atomic name reso-lution protocol . But several aspects of the nested group model make our extension non-trivial : • In ISIS , groups are structured as one-level trees and the membership of the group is known to every member. In the internet environment, the structure of a nested group can be arbitrary and the complete membership of a group are usuaUy not known even to group members. A s a result, messages to nested groups must take a set of multi-hop resolution paths. • Because of the multi -hop communication topology, dynamic changes to the name graph compUcates the synchronization in seriahzing updates and name resolutions. Two-phase commitment protocols and three-phase nonblocking protocols have been weU studied in the Uterature. Ideas, arguments and techniques from [Bernstein et al 88] are appUed to our problem setting in this chapter. 6.8 Chapter Summary This chapter deals wi th the concurrency control and resihency aspects in name graph up-dates and name resolutions. Concurrency control and update/message ordering are achieved by a two-round pr ior i ty vot ing; failure handling and update resihency are achieved by using a two-phase commitment protocol . Updates affect each other if their working sets overlap, or if they are join operations that occur in each other's subgraph directly or indirectly. In the update ordering protocol described in this chapter, updates w i t h overlapping subgraphs are ordered before their execution start. Updates are aUowed to be executed in parallel as long as they do not affect each other. It is the arbi trary topology of name graph and the incomplete group membership information that make synchronization among group members more comphcated than existing group communication protocols. The goal of resLUency support is to provide correct group communication continuously among active nodes based on the remaining E G V . A two-phase commitment protocol is embedded in the execution round of updates (or failure handUng) that change the loop deep equal relations in the name graph . A n update is aborted when it is affected by failures during the execution. This is acceptable because updates and failure handUng activities are commutable. Different degrees of resiUency support in name resolutions are briefly outUned in this chap-ter. It is argued that w i t h the update protocol and faUure handUng protocol discussed earUer, name resolution inconsistency is only temporary even if the name resolution protocol does not provide any resihency support . More synchronization overhead is involved when a higher degree of consistency and resiUency is required. Due to the lack of complete membership information, atomic group communicat ion in the internet environment is difficult and expensive (requiring a three-phase protocol ) . Various assumptions concerning failure modes and membership infor-mat ion have to be made i n designing atomic group communication protocols. W h e n a network part i t ion is possible and consistent deUvery across part it ion is required, group communication may be blocked. Our goal was not to provide a reUable name graph update protocol (or a reUable name resolution protocol) that always successfuUy carries out user requests irrespective of failures. Instead, operation failures are reported to appUcations so that users can take the appropriate actions. Chapter 7 Conclusions and Future Research We conclude the thesis by summariz ing the major results and hsting some future research. 7.1 Summary of Results This thesis has two parts. The first part provides a comprehensive classification of process groups; the second part focuses on group name management in the internet environment. Classifications of Process Groups To design distributed systems that support process group and group communicat ion, it is important to understand how these mechanisms are used and what are the communication requirements expected from the apphcations. For the purpose of classification, a group is defined as a set of cooperative processes that mainta in objects to provide a service. Based on the homogeneity of the internal structure of group members (i.e., the state of the objects that they maintain and the operations that they perform on the objects), groups are classified into four categories: data and operation homogeneous, operation homogeneous only, data homogeneous only and heterogeneous. Based on their external behavior, groups are classified into deterministic and nondetermin-istic categories. The main difference between the two is the degree of grouping transparency in communication, reply handling, naming, and failure handhng. Deterministic groups hide their internal structure (e.g., membership) from apphcations as much as possible and do most of the coordination work (e.g., membership change, atomic /ordered group message dehvery) at the system level, so that users are provided with a simple interface to groups. Nondeterministic groups provide users wi th the flexibihty of not having to pay the overhead of intrinsic coor-dination among group members when it is not need, so that interacting with groups is more efficient (but may also be more comphcated). The major concerns in deciding which one is more suitable for a certain apphcation are transparency, efficiency and flexibihty. Internet G r o u p N a m i n g Models In the internet , subnets/subdomains are connected by internet hnks that have relatively low bandwidth and long delay. The nested group model is proposed to reduce the traffic on internet hnks and to mainta in autonomy in subnets/subdomains. The nested group model allows group members in a subnet /subdomain to form a group. It also aUows a group to be included in another group as a member. Organizing internet groups in this structured manner hides membership changes within a subnet /subdomain from observation outside the subnet / subdomain . A lso , only one message across an internet hnk needs to be sent to aU members of a subgroup. Two formal models for the nested naming structure are developed: name graph model and name grammar model . W i t h the name graph model , problems such as name resolution loops (RLs ) and duphcation loops (DLs ) are recognized, and loop deep equal relationship is identified as an intrinsic characteristic of R L s . Correctness criteria for the algorithms solving these problems are formulated in Sections 3.2.3 and 3.2.4. Various approaches to handhng R L s and D L s are surveyed in Section 3.3. It turns out that none of the existing approaches meets the correctness cr i ter ia . Static A p p r o a c h to Handl ing R L s and D L s A static a lgor i thm maintains a system view ( I G V ) of the name graph and imposes control structure in the I G V to avoid the R L and D L erFects i in a user transparent uianneri . The topology of the I G V may be different from that of the users' view ( E G V 1 . Update operations must preserve name resolution consistency between the 1G\' and the E G \ ' . The v i r tua l node ( V N ) concept characterizes the loop deep equal relation among nodes in an R L . It defines the amount of global knowledge of the name graph that a node needs to maintain for static name graph update operations to function correctly. Nodes within a \'.\ are connected by a spanning shadow tree in the I G V . .A. new name resolution procedure is designed to guarantee that a message arr iv ing at a node in a shadow tree wiU be resolved to every other tree node at most once so that the R L effect is removed. D L s are controlled by assisning R D control record to the secondary parents of the D L sink. .\ message is not resolved to a child if the arc leading to the child is R D controlled and if the message is routed through the D L source. To detect R L s and D L s . a pathname representation is proposed for each node to record the topology of its ancestors in the derived name graph. The shadow tree algor i thm not only provides a correct solution handhng RLs and D L s , but also reduces the run-t ime overhead in name resolutions. The worst case message overhead of update operations in the shadow tree algorithm is bounded by (9(|.V|^) where |.V| is the total number of nodes in the name graph. .A prototype of the shadow tree algorithm has been implemented as an existence proof of the algorithm. To complete the model, concurrency control and failure handhng protocols for static name graph updates are designed. Name graph updates that share nodes in their subgraphs are ordered using a two-phase update ordering protocol before they start to e.xecute. Synchronized name graph update operations are safe guarded against asynchronous failures by using the weU-known two-phase commitment protocol in their e.xecutions. so that an update is either completed correctly or aborted when failure occurs during execution. Various name resolution protocols supporting ordered group communication, consistent group datagram, and atomic sroup communication are briefly outhned. .As a result, the communication, namins: and failure handhng transparency requirements for deterministic groups are met (refer to Section 2 . 3 . 1 ) . ' T h e t r a n s p a r e n c y requirements m reply hand l ing and real - t ime were not considered in this exercise. Conclusions The foUowing conclusions can be drawn from this thesis: • Process groups can be classified based on their internal structure. They can also be clas-sified based on their external behavior depending on the degree of grouping transparency expected from their intended apphcations. • Nested group model can be used to reduce group communication traffic on internet hnks and support subnet / subdomain autonomy. • R L and D L effects can be controUed using distributed static approach. The shadow tree algorithm is an example. • The shadow tree a lgor i thm completely removes R L and D L effects. It provides name resolution consistency and is transparent to users. • The communication complexity of the update operations in the shadow tree algorithm has an upper bound and the average case complexity can be expected to be much better. The algorithm can be implemented with a reasonable effort and is robust as demonstrated by the prototype experiments. • The design of concurrency control and failure handUng protocols shows that although atomicity and ordering are achievable in internet group communications, hiding member-ship of subgroups makes them difficult. The protocol can be complex and its overhead can be high. Trade-off has to be made in transparency (autonomy) , reUabiUty and efficiency based on the intended appUcations. 7.2 Limitations .\ few assumptions made in this thesis hmit the appUcabiUty of the static algorithms: • The static approach assumes that updates occur much less frequently than name resolu-tions and trades higher update overhead for lower name resolution overhead to obtain a optter overaU group communication performance. W'lien aroup [ueinoersnips are ;:ioiily ' iynamic in some applications, this assumption no longer valid anu tlie performance of the static approach may deteriorate. The reason is that a static update i)locks name res-olutions while an update is performed (otherwise, the result may be inconsistent . .\lso a number of messages are e.xchanged durins an update, increasing the network trarfic. • The shadow tree algor i thm assumes that users do not care about the order that messages are resolved at the nodes within a destination oroup. Certa in apphcations structure eroups according to some administrat ion hierarchy and require messages alwavs i)e passea from the fop-level down. The shadow tree algoritnni is not suitable in this case. Even though the assumption of low bandwidth internet link wi l l no longer be true when iiigh-speed optical hnks become widely available, the nested group model for internet aroups remains valuable because (i) it reduces the traffic in internet data highway and supports sub-net /subdomain autonomy, and (ii) it also reduces the group message distribution time since only a single message is sent to members in a subgroup. Furthermore, compared to the aynamic approach, the static approach reduces the overhead of name resolution, hence, is preferred in :iigh-speed networks where node processing speed is the bottleneck. 7.3 Future Research There are a number of areas that future research may lead to useful results: • .A. systematic study focused on the imphcations and apphcations of the name srammar model proposed in .Appendix A may lead to some new insight to the naming prooiem. • .An optimistic approach to concurrency control and failure handhng is to start execution '![ an update when it is invoked. Before coniniittina' the update, a certification procedure conducted to verify that the update i» noi affected l)y other concurrent upca'ps or :aiiures. The update is aborted if the certirication tails. This approacii prelerrpu than 'i'.e conservative approacn in Chapter b) if concurrent updates and failures do nu: cjccur frequently, since the overhead involved in each update may be made smaU when it is not affected. The design and analysis of optimistic name graph update operations is an interesting future research direction. • The name resolution protocols for atomic group communication, ordered group commu-nication and group datagram in this thesis either require a large amount of overhead in message dehveries or provide poor dehvery coverage when updates or failures occur. It is desirable to design new name resolution protocols that support the required communica-tion requirements (atomicity and ordering) with as smaU an overhead as possible. • We have assumed that update failures and message transmission failures are reported to users and handled at the apphcation layer. It wiU be interesting to e.xamine how communication failures are actuaUy handled in various apphcations, and try to build some mechanism at the system level to ease this task. Furthermore, to support apphcations that always require consistent message dehvery (such as distributed database transactions), group message transmission needs to be blocked when network part i t ion occurs. To support this type of apphcations, recovery algorithms need to be designed so that the E G V and the I G V can be restored after the failed components have recovered and the blocked messages can be dehvered consistently. • To improve the overaU system performance, the static approach trades the expense of less frequent update operations for a cheaper but much more frequent name resolution operation. For apphcations where group membership is more dynamic , it is desirable to have some quantitative methods to determine to what extent the above trade-off is worthwhile. For highly dynamic groups, a new group management approach with smaller update latency and less message overhead wiU be needed. • In the case that users do care about the order of name resolutions at the group members, the name resolution procedure needs to be redesigned to take this into consideration while preserving name resolution consistency. Bibliography [Aguilar 84] L . A g u i l a r , "Datagram routing for internet mult icast" . Proc. ACMSIGCOMM'84, Ju ly 1984, pp 58-63. [Ahamad and Bernstein 85a] M . A h a m a d and A . Bernstein, " A n application of name based addressing to low level distributed a lgor i thms" , IEEE Trans. Software Engineering, Vo l . S E - U , N o . 1, J a n . 1985, pp 59-67. [Ahamad and Bernstein 85b] M . A h a m a d and A . J . Bernstein, "Mult icast communication in U N I X 4.2 B S D " , Proc. IEEE Fifth International Conference on Distributed Computing Systems. M a y 1985, pp 80-87. [Ahamad et ai 88] M . A h a m a d , M . H . A m m a r , J . Bernabéu-Aubân and M . Y . K h a h d i , "Us ing mult icast communication to locate resources in a L A N - b a s e d distributed system"', Proc. IEEE Thirteenth Conference on Local Computer Networks. Oct. 1988. pp 193-202. [Aho et a l 74] A . A h o , J . Hopcroft and J . UUman, "The design and analysis of computer algo-rithms" ' . Addison-Wesley, Readings, M A 1974. [.A.mano 87] H . A m a n o , " R S M (receiver selectable mult i cast ) " , Proc. International Conference on Computers and Applications. Bei j ing, C h i n a , June 1987. pp 149-156. [Atkins and C a r t e r 86] M . S. Atk ins and A . W . Carter , "Rehable multicast interprocess com-m u n i c a t i o n " , Proc. CIPS Congress'86. 1986, pp 85-91. [Atkins et a l 89] M . S. A t k i n s , G . Haftevani and W . S. L u k , " A n efficient kernel-level depend-able mult icast protocol for distributed systems", Proc. Eighth Symposium on Reliable Distributed Systems, Seattle. W A , Oct . 10-12, 1989, pp 94-101. [Auerbach et al 91] J . .A.uerbach, M . G o p a l . M . K a p l a n and S Kut len , "Mult icast group mem-bership management in high speed wide area networks". Proc. of IEEE Eleventh Inter-national Conference on Distributed Systems, M a y 1991, pp 231-238. [Belkeir and .Vhamad 89] N . E . Belkeir and M . .A.hamad, "Low cost algorithms for message dehvery in dynamic multicast groups" , Proc. IEEE Ninth International Conference on Distributed Computing Systems. June 1989, pp 110-117. ^Benford 91] S. Benford, " B u i l d i n g group communication on O S l " . Computer networks and ISDN systems. 23 (1991) 87-90. Benford and Onions 87] S. Benford and J . Onions. " P i l o t distribution lists - agents and di -rectories". Proc. IFIP WG6.5 International Working Conference on Message Handling Systems. A p r i l . 1987. pp 3.4.1-3.4.24. [Bernabéu-Aubân et al 89] J . M . Bernabéu-Aubân. M . H . .\mmar and .M. .\hamad. " O p t i m a l selection of multicast groups for resource location in a distributed system". Proc. IEEE INFOCOM'89, pp 312-321. .\prd 1989. I Bernstein et al 88] P. X. Bernstein. V . Hadzilacos and N . Goodman . "Concurrency control and recovery in database systems". Addison-Wesley Publishing Company, 1988. ^Berry 90] L . Berry , " G r a p h theoretic models for multicast communications". Computer net-works and ISDN .systems. 20 ( 1990), pp 95-99. [Birman 85] K . P. B i r m a n . "Rephcation and fault-tolerance in ISIS system". Proc. ACM Tenth Symposium on Operating System Principles. Dec. 1985. pp 79-86. [Birman et al 91] K . A . B i r m a n . R. Cooper and B Gleeson. "Programming with process groups: group and multicast semantics". Tech. Rep. TR-91-1185, computer science, Cornell Univ . Feb. 1991. [Birman et al 85] K . P. B i r m a n , T . X. Joseph, T . Baenche and A . E . .Abbadi, "Implementing fault-tolerant distributed objects", IEEE Trans. Software Engineering. Vo l . SE-11 . No. 6. June 1985, pp 502-508. i B i r m a n and Joseph 87a] K . P. B i rman and T . X. Joseph, "ReUable communication In the presence of faUures". ACM Trans, on Computer Systems. \'ol. 5. .\o. 1. Feb. 1987. pp 47-76. [Birman and Joseph 87b] K . P. B irman and T . A . Joseph, "Exp lo i t ing virtual synchrony in distr ibuted systems". Proc. ACM Eleventh Symposium on Operating System Principles, Nov. 1987, pp 123-138. [Birman and Joseph 88] K . P. B i rman and T . A . Joseph, "Exp lo i t ing rephcation". Lecture Notes of Arctic S8 - an advanced course on operating systems. Tromso. .\orway. July 1988. [Birman et al 91] K . B i r m a n , A . Schiper and P. Stephenson. "Lightweight causal and atomic group mult i cast " ACM TOCS. 9. 3. A u g . 1991, pp 272-314. [Bondy and M u r t y 76] J . X. Bondy and U . S. R M u r t y " G r a p h theorv with appUcations". -MacmiUan. London. 1976. [Bogg 83] D . R. Bogg, "Internet broadcasting". Ph.D Thesis. Dept. of Electrical Engineering, Stanford Univ . , 1983. A lso Xerox P A R C Report CSL-83 -3 . [Bubenik and Turnet 89] R. G . Bubenik and J . S. Turner, "Performance of a broadcast packet swi t c i i " . IEEE Trans, on Commun.. 37 (1) (1989) pp 60-69. [Bux et al 82] V V . B u x . F . Closs. K . Kuemmerle . H . KeUer and H . R. Mueller, "The token r i n g " . Chapter 2 in " L o c a l A r e a xN'etworks: an advanced course". Lecture notes in computer science 184, Springer-Verlag Berhn Heidelberg, 1985. [Byrd et al 87a] G . T . B y r d , R . Xakano and B . A . Delagi , "A point-to-point multicast commu-nication protocol" ' . Technical Report KSL-87-02, Stanford Univ . . 1987. •Byrd et al S7b] G . T . B y r d , R. Xakano and B. A . Delagi, " A dynamic cut-through communi-cation protocol with mult i cast " . Technical Report KSL-S7-44- Knowledge System Lab. . Stanford U n i v . . A u g . 1987. [Byrd et al 89] G . T . B y r d , N . P. Saraiya and B . .\. Delagi, "Mul t i cas t communication in mul-tiprocessor systems"'. Technical Report. STAN-CS-89-1246 (also KSL-88-81J. Stanford Univ . , 1989. [Calo and Easton 81] S. B . Ca lo and M . C . Easton , " A broadcast protocol for file transfers to multiple sites", I E E E Trans, on C o m m u n . , Vo l . C O M - 2 9 , X o . 11, Nov. 1981. [Caples and Young 87] E . Caples and C . D . Young , "Mult idest inat ion protocols for tactical radio networks" . Proc. MILCOM, Oct . 19-22, 1987, pp 615-619. [CDNnet 90] C D N N e t headquarter, private communications. 1990. [Chang and Maxemchuk 84a] J . M . Chang , "Simphfying distributed database design by using a broadcast network" . Proc. ACM SIGMOD'84, June 1984. pp 223-233. [Chang and Maxemchuk 84b] J . M . Chang and N . F . Maxemchuk. "Rehable broadcast proto-cols", ACM Trans, on Computer Systems. Vo l . 2. X o . 3. A u g . 1984. pp 251-273. [Chanson and Rav indran 86] S. T . Chanson and K . Rav indran . " A distributed kernel model for rehable group communicat ion" , Proc. Sixth lEEE-CS Symposium on Real Time Systems. .\ew Orleans. Dec. 1986. pp 138-146. [Cheriton 86a] D . R. Cher i ton . "Problem-oriented shared memory: a decentrahzed approach to distributed system design"'. Proc. IEEE Sixth International Conference on Distributed Computing Systems. 1986. pp 190-197. 'Cheriton 86b] D. R. Cher i ton . "Request-response and multicast interprocess communication in the V kernel" . Proc. IBM Networking Workshop. C h . 3. Springer-Verlag. 1986. Cheriton 86ci D . R. Cher i ton . " V M T P : a transport protocol for the next generation of com-munication systems'". Proc. ACM SIGCOMM'86. 1986. pp 406-415. Cheriton 88al D . R. Cher i ton . " V M T P : versatile message transaction protocol — protocol specif ication". NIC/RFC 1045. Feb.. 1988. [Cheriton 88b] D . R . Cher i ton . "The V distributed system". ACM Commmiications Vol . 31. .\o. 3. M a r . 1988. pp 314-333. [Cheriton and Deering 85] D . R . Cheriton and S. E . Deering, "Host groups: a multicast exten-sion for datagram internetworks". Proc. Ninth Data Communication Symposium. .Sept. 1985, pp 172-178. 'Cheriton and Zwaenepoel 85] D . R. Cheriton and W . Zwaenepoel. "Distr ibuted process groups in the V kerne l " . ACM Trans, on Computer Systems. Vol . 3. Xo . 2. .May 1985. pp 77-107. [Comer 91] D. E . Comer , "Internetworking with T C P / I P . Vol 1: principles, protocols and ar-chitecture" , second edit ion. Prentice H a l l , 1991. [Comer and Peterson 86] D. E . Comer and L . L . Peterson. "Conversation-based mad" . ACM Trans, on Computer Systems. Vo l . 4, No . 4. Nov. 1986, pp 299-319. [Cooper 84a] E . Cooper , " C i r c u s : a rephcated procedure caU facihty"', Proc. IEEE Fourth Symposium on Reliable Distributed Systems and Database Systems. 1984. pp 11-24. [Cooper 84b] E . Cooper , "Rephcated procedure c a l l " , Proc. ACM Third Symposium on Prin-ciples of Distributed Computing, Aug. 1984. pp 220-232. [Cooper 85] E . Cooper . "Rephcated distributed programs". Proc. ACM Tenth Symposium on Operating System Principles. Dec. 1985. pp 63-78. [Cooper 86] E . Cooper , ".Mechanisms for constructing rehable distributed programs". Ph.D Thesis. Computer Science D i v . . Univ . of Cahfornia. Berkeley, 1986. [Cooper 90] E . Cooper , " P r o g r a m m i n g Language Support for Mult icast Communication in Distr ibuted Systems" . Proc. IEEE 10th International conference on distributed computing systems. P a i r s . 1990. pp 450-457. [Cristian 91ai F . C r i s t i a n . "Understanding fault-tolerant distributed systems". CAC.M. Vol. 34. 2. Feb. 1991. pp 57-78. [Cristian 91bl F . C r i s t i a n , "Research agreement on processor-group membership in synchronous distributed systems". Distributed computing. Vol . 4. No. 4. 1991. pp 175-187. [Cristian et ai 85] F . C r i s t i a n , H . .A.ghih, R. Strong and D. Dolev. ".Atomic broadcast: from simple message diffusion to Byzantine agreement". Technical Report R.J454Û(48668). I B M Research L a b . . Dec. 1986. [Crowcroft and Pal iwoda 88] J . Crowcroft and K . Pal iwoda. " A multicast transport protocol" . Proc. ACM SIGCOMM'88. 1988, pp 247-256. [Dadal 77] Y . K . Dada l , " 'Broadcast protocols in packet switched computer networks" , Ph.D thesis, Stanford Univ . , D S L T R - 1 2 8 , A p r i l 1977. [Dadal and Metcalfe 78] Y . K . Dada l and R . M . Metcalfe, "Reverse path forwarding of broad-cast packets", ACM Communications, V o l . 21, No. 12, Dec. 1978, pp 1040-1048. [Daily and Seitz] W . J . DaUy and C . L . Seitz, "Deadlock-free message routing in multiprocessor interconnection networks" , IEEE Trans, on Computers, Vo l . C-36, No . 5, May . 1987, pp 547-553. [Danielsen et a l 87] T . Danielsen, T . Folkow, and P. W . Richardsen. "Relat ion and inheritance in group communicat ion" , Proc. IF IP WG6.5 International Working Conference on Mes-sage Handling Systems, pp 4.5.1-4.5.15, A p r i l , 1987. [Danzig 89a] P. B . Danzig , " F l o w control for fast mult icast" . Proc. ACM SIGMETRICS'89. M a y 1989, pp 108-117. [Danzig 89b] P. B . Danz ig , " O p t i m a l l y selection the parameters of adaptive backoff algorithms for computer networks and multiprocessors" , Ph.D thesis, Computer Science D iv . , Univ . of Cahfornia , Berkeley, Dec. 1989. [Dasser 92] M . Dassser, " T O M P , a to ta l ordering multicast protocol" . ACM SIGOPS operating system review, Vo l . 26, No . 1 pp 32-40, J a n . 1992. [Dechter and Kleinrock 86] R . Dechter and L . Kleinrock. "Broadcast communications and dis-tr ibuted algorithms" , IEEE Trans, on Computers. Vo l . C-35. No. 3, M a r . 1986. pp 210-219. [Decitre et a l 82] P. Decitre et a l , " A n efficient error detection mechanism for a multicast trans-port service on the Danube network" , Proc. IFIP Workshop on Local Computer Networks. 1982, pp 335-347. [Deering 88a] S. E . Deering, "Host extensions for IP mult icast ing" . NIC/RFC 1054, May , 1988. [Deering 90] S. E . Deering, "Mul t i cas t rout ing in datagram internetworks and extended L A N s " . ACM TOGS, Vo l . 8, No . 2. M a y 1990, pp 85-110. [Deutsch 84] D . P. Deutsch, "Implementing distr ibution hsts in computer-based message sys-tems" , Proc. IFIP 6.5 Working Conference, North-HoUand. May, 1984. pp 1-11. [ErramiUi and Singh 87] A . ErramiUi and R. P. Singh. " A rehable and efficient multicast pro-tocol for broadcast networks" . Proc. ACM SIGCOMM'87 workshop, 1987. pp 343-352. [Even 79] S. Even, " G r a p h A l g o r i t h m s " , Computer Science Press. 1979. [Evangelist et al 89] M . Evangelist. X . Francez and S. K a t z . " M u l t i p a r t y interaction for inter-process communicat ion and synchronizat ion" . IEEE Trans. Software Engineering. \'ol. 15. X o . 11. Nov . 1989. pp 1417-1426. [Frank 841 A . J . Frank. " D y n a m i c group communication on G r i d netcomputers" . Ph.D thesis. Computer Science. S U N Y at Stony Brook. 1984. iFrank et al 85] .A.. J . Frank. L . D . Wi t t i e and A . J . Bernstein. "Mul t i cas t communication on network computers'", IEEE Software, V o l . 2, No . 3. .May 1985, pp 49-61. [Garcia-.Mohna and Kogan 88] H . G a r c i a - M o h n a and B . Kogan , ".An implementation of reliable broadcast using an unrehable multicast fac ihty" . Proc. IEEE Seventh Symposium on Reliable Distributed systems. 1988. pp 101-111. [Garcia-.Mohna et al 88] H . G a r c i a - M o h n a . B. Kogan and N . Lynch . "Rehable Broadcast in networks wi th nonprogrammable servers"', Proc. IEEE Eighth International Conference on Distributed Computing Systems, San .lose, June 1988, pp 428-438. i G a r c i a - M o h n a and Spauster 91] H . G a r c i a - M o h n a and .A. Spauster. "Ordered and reliable multicast communication"". ACM TOCS, Vol, 9 No. 3 Aug. 1991, pp 242-271. [Gehani 84] N . H . Gehani , "Broadcast ing sequential processes ( B S P ) " , IEEE Trans. Software Engineering, V o l . SE-10 . No . 4. July 1984, pp 343-351. [Goldman 91] K . J . G o l d m a n , "High ly concurrent logicaUy synchronous mult i cast" . Distributed Computing, V o l . 4. No . 4. 1991, pp 189-207. :Gopal and JafTe 84] 1. S. G o p a l and J . JafFe, "Po int - to -mult ipo int communication over broad-cast h n k s " . IEEE Trans, on Commun.. C O M - 3 2 (9), Sept. 1984. pp 1034-1044. Gopal and R o m 88] 1. S. G o p a l and R. R o m . "Mul t i cas t ing to multiple groups over broadcast channels". Proc. Computer Networking Symposium. .Apr. 11-13. 1988, pp 79-81. [Gopal and Wong 84] G . G o p a l and J . Wong, "'Two protocols for rehable broadcast: a perfor-mance s t u d y " , Proc. IEEE GLOBECOM'84, 1984. pp 11.1.1-11.1.6. [Hopcroft and UUman 79] J . E . Hopcroft and J . D . UUman, "Introduction to automata theorv. languages, and computat ion" . Addison-Wesley Publishing Co. 1979. [Hughes S6] L . Hughes. " M u l t i c a s t communications in distributed systems'". Ph.D Tliesis. Comput ing L a b . . Univ . of Newcastle upon Tyne . Oct . 1986. Hughes •^^ a] L . Hughes. " C h a t : an .\-party talk facihty for the U N I X 4.2 operating svstem". Computer Communications. 11(1), Feb. 1988. Hughes ^Sb] L . Hughes. ".\ multicast interface for U X I X 4.3", Software Practice ana Eiperi-I nee. Vo l . 18. .Xo. 1. Jon Wiley and Sons. L t d . . J a n . 1988. pp 15-27. [Hughes 89a] L . Hughes, " A multicast response handhng taxonomy" . Computer Communica-tions. Vo l 12 No 1, Feb. 1989, pp 39-46. [Hughes 89b] L . Hughes, "Gateway designs for internetwork multicast communication" . Com-puter Communications, Vo l 12 No 3, June 1989, pp 123-130. [Jacobs 89] K . Jacobs, " O S I - an appropriate basis for group communication?"'. Proc. ICC. 1989, pp 346-350. [Jaffe 85] J . M . Jaffe, "Dis t r ibuted multi-destination routing: the constraints on local informa-t i o n " , SIAM J. Comput., 14 (4) (1985) pp 875-888. [Jone et al 91] M . G . W . Jone, S. A . Sorensen and S. R. W i l b u r . "Protoco l design for large group mult icast ing : the message distribution protocol" . Computer Communications. Vo l . 14, No . 5, June 1991, pp 287-297. [Joseph and B i r m a n 88] T . A . Joseph and K . P. B i r m a n , "Rehable broadcast protocols". Lec-ture Notes of Arctic 88 — an advanced course on operating systems. Tromso, Norway, Ju ly 1988. [Kaashoek and Tanenbaum 91a] M . F . Kaashoek and A . S. Tanenbaum, "Fault tolerance using group communicat ion" , ACM Sigops, operating systems review. Vo l . 25. No. 2, pp 71-74, A p r i l , 1991. [Kaashoek and Tanenbaum 91b] M . F . Kaashoek and A . S. Tanenbaum, "Group communica-tion in A m o e b a distributed operation system", Proc. of IEEE Eleventh International Conference on Distributed Systems, M a y 1991, pp 222-230. [Kaashoek et a l 89] M . F . Kaashoek. A . S. Tanenbaum, S. F . Bummel and H . E . B a l , " .\n efficient rehable broadcast protocol" , ACM SIGOPS Operating System Review. Vol 23, No 4, Oc t . 1989, pp 3-19, [Lamport 78] L . L a m p o r t , " T i m e , clocks, and the ordering of events in a distributed system", ACM Communications, Vo l . 21, No . 7, July , 1978, pp 558-565. [Lampson 83a] B . W . Lampson, Atomic transactions" . Lecture Notes in Computer Science, Vol . 105, ch. 11, Springer-Verlag, 1983, pp 246-265. [Lanceros and Saras 88] A . G . Lanceros and J . A . Saras, "Definit ion of group communication facihties in the M H S " , Proc. IFIP WG6.5 Message Handling Systems and Distributed Applications, 1988, pp 295-308. [LeBlanc and Cook 84] T . J . LeBlanc and R . P. Cook, "Broadcast communication in Star.VIod". Proc. IEEE Fourth International Conference on Distributed Computing Systems. May , 1984. pp 319-325. [LeBlanc and Cook 85] T . J . LeBlanc and R. P. Cook. ' 'High level broadcast communication for local area networks" , IEEE Software, Vo l . 2, No. 3, M a y 1985, pp 40-48. [Lee 88] T . T . Lee, "Non-b lock ing copy networks for multicast packet switching" , IEEE J. Select. Areas in Commun., 6 (9) (1988) pp 1455-1467. [Leitner 84] G . Leitner , "Styhzed interprocess commuidcation — a kernel pr imit ive for rehable distributed comput ing " , Proc. IEEE Fourth Symposium on Reliable Distributed Systems and Database Systems, 1984, pp 25-33. [Leitner 85] G . Leitner , " D i s t r i b u t e d synchronization of multiple servers", Proc. IEEE INFO-COM'84, San Francisco, A p r i l 1985, pp 77-82. [LeLann 91] "Rehable atomic broadcast in distributed systems with omission faults" , ACM SIGOPS, operating systems review, Vo l . 25. No. 2, pp 80-86, A p r i l , 1991. [Liang et al 90a] L . L i a n g , S. T . Chanson and G . W . Neufeld, "Group communications: classi-fications and requirements" , IEEE Computer, Vo l . 23, No . 2, Feb. 1990, pp 56-66. [Liang et al 90b] L . L i a n g , G . W . Neufeld and S. T . Chanson, "Avo id ing name resolution loops and duphcations in group communications" , Proc. ACM SIGCOMM'90, Sept. 1990, pp 220-230. [Liang et al 92] L . L i a n g , G . W . Neufeld and S. T . Chanson, "Nested Groups for Internet Group C o m m u n i c a t i o n " , In preparat ion , 1992. [Liskov and Herhhy 83] B . L iskov and M . Herhhy, "Issues in process and communication struc-ture for distributed programs" , Proc. IEEE Third Symposium on Reliable Distributed Systems and Database Systems, Oct . 1983, pp 123-132. [Lin et al 86] Y . H . L i n , T . B . Gendreau, C. T . K i n g and L . M . N i , " A session layer design of a reUable I P C system i n the U N I X 4.2 environment", Proc. IEEE Computer Networking Symposium, Nov . 1986, pp 120-129. [Luan and GUgor 90] S. L u a n and V . D . GUgor, " A fault-tolerant protocol for atomic broad-cast", IEEE Trans. Parallel Distributed Systems. 1, 3, July . 1990, pp 271-285. [Malcolm and Vasudevan 84] M . A . Mal co lm and R. Vasudevan, " C o p i n g wi th network part i -tions and processor failures in a distributed system", Proc. IEEE Fourth Symposium on Reliable Distributed Systems and Database Systems, 1984, pp 36-44, [MarzuUo and Schmuck 88] K . MarzuUo and F . Schmuck, "Supplying high avaOabihty with a standard network file sys tem" , Proc. IEEE Eighth International Conference on Distributed Computing Systems, San Jose, June 1988, pp 447-453. :\Iaxemchuk and Chang 84] X . F. Maxemchuk and J . M . Chang. "Analys is of the messages transmitted in a broadcast protoco l " , Proc. International Computer (Conférence. .May. 1984. pp 1263-1267. [McKin iey 88] P. K . M c K i n l e y . "Mul t i cas t routing in bus-based computer networks". Proc. Computer Networking Symposium. .A.pr. 1988, pp 277-287. [McKin ley 89] P. K . M c K i n l e y , "Group communication in bus-based computer networks". Ph.D dissertation. Computer Science. U n i v . of Illinois - Urbana. 1989. [McKin ley and L i u 88] P. K . M c K i n l e y and J . W . S. L i u . "Mult icast routing in spanning bus hypercubes", Proc. International Conference on Parallel Processing, A u g . 1988. pp 204-211. [.McKinley and L i u .s9] P. K . M c K i n l e y and J . W . S. L i u . "Group communication in multichan-nel networks with staircase interconnection topologies". Proc. ACM SIGC0.\IM'S9. Sept. 1989. pp 170-181. [McKin ley and L i u 90] P. K . .McKinley and J . W . S. L i u . "Mult i cast tree construction in bus-based networks". ACM Communications. Vol.33. X o . l , Jan . 1990. pp 29-42. [Merl in and Schweitzer 80a] P. Mer l in and P. J . Schweitzer, "Deadlock avoidance in store-and-forward networks — I: store-and-forward deadlock". IEEE Trans, on Computers. Vol. C O M - 2 8 . No. 2. M a r . 1980, pp 345-354. [.VIerlin and Schweitzer 80b] P. Mer l in and P. J . Schweitzer, "Deadlock avoidance in store-and-forward networks — II: other deadlock types" . IEEE Trans, on Computers. Vo l . C0 .M-2S . .\o. 2, M a r . 1980, pp 355-360. [MeUiar et al 90] P. M . ^vleUiar-Smith. L . E . .Moser and \' Agrawala . "Broadcast protocols for distributed systems". IEEE Trans. Parallel Distributed Systems. 1. 1. J a n . 1990 pp 17-25. [Metcalf and Boggs 76] R. M . .Metcalf and D . R. Boggs. "Ethernet: distributed packet switch-ing for local computer networks". AC.lt Communications. \'ol. 19, .\'o. 7. July . 1976. pp 395-404. [Minet and .Vnceaume 91] ".\tomic broadcast in one phase". ACM SIGOPS. operating systems review. Vo l . 25. Xo . 2. pp 87-90. A p r i l . 1991. [Mockapetris 83] P. V . Mockapetris , "Analys i s of reliable multicast algorithms for local net-works" . Proc. Eighth Data Communication Symposium. Oct. 1983. pp 150-157. [Xakamura and Takizawa 911 "Reliable broadcast protocol for selectively partial ly orderins P D U s ( S P O protocol)" Proc. of IEEE Eleventh International Conference on Distributed Systems. M a y 1991. pp 239-246. [Nathan et a l 89] G . Nathan , P. Holdaway and G . A n i d o . " A mult ipath multicast switch archi-tecture" , Fast packet switch workshop, O T C , Sydney. 1989. [Navaratnam et a l 88] S. Navaratnam, S. Chanson and G . Neufeld. "Rehable group communica-tion in distributed systems", Proc. IEEE Eighth International Conference on Distributed Computing Systems, San Jose, June 1988, pp 439-446. [Neufeld et a l 90] G . W . Neufeld, M . Go ldberg and B . Brachman, "Dis tr ibuted apphcation support package, user m a n u a l " . Technical Report 90-37. Computer Science Dept, U B C , J a n . 1991. [Ngoh 89] L . H . Ngoh, "Mul t i cas t facihties for mult imedia group communication environ-ments" , P H . D Thesis. Univ of Manchester , J u l y 1989. [Ngoh 91] L . H . Ngoh, "Mul t i cas t support for group communications". Computer Networks and ISDN Systems. 22 (1991) ppl65-178. [Ngoh and Hopkins 89] L . H . Ngoh and T . P Hopkins , "Transport protocol requirements for distr ibuted mult imedia information systems" . Computer J., 32 (3) (1989) 252-261. [Palme 87] J . Pa lme , "Distr ibuted computer conferencing". Computer networks and ISDN sys-tems, 14 (1987) pp 137-145. [Pankoke-Babatz and Prinz 88] U . Pankoke -Babatz and W . Pr inz , "Support for coordinating group act iv i t ies" , Proc. IFIP WG6.5 Message Handling Systems and Distributed Appli-cations, 1988, pp 317-330. [Palme 85] J . Pa lme . "Distr ibut ion agents (maihng hsts) in message handhng systems", Proc. Computer Message Systems'85, Elsevier Science Pubhshers B . V . (North-HoUand) , 1985, pp 117-131. [Pankaj 86] J . P a n k a j , "Us ing Broadcast ing for multiprocess recovery", Proc. IEEE Sixth In-ternational Conference on Distributed Computing Systems. 1986, pp 582-589. [Pankoke-Babatz 87] U . Pankoke-Babatz , "Requirements for group communication support in electronic communicat ion" , Proc. IFIP WG6.5 International Working Conference on Message Handling Systems, A p r i l , 1987, pp 4.2.1-4.2.10. [Pardo and L i u 79] R. Pardo and M . T . L i u , "Mult i -dest inat ion protocols for distributed sys-tems" , Proc. Computer Networking Symposium, N B S , Gaithersburg, M D , 1979. [Peacock et al 80] J . K . Peacock. J . W . W o n g and E . G . .Manning, "Synchronization of dis-tr ibuted simulation using broadcast a lgor i thms" . Computer Networks, V o l . 4 No. 1, 1980, pp 3-10. [Perlman 85] R . Per lman , " A n algorithm for distributed computation of a spanning tree in an extended L A N " , Proc. Ninth Data Communications Symposium, A C M / I E E E , Sept. 1985, pp 44-53. [Peterson 88] L . Peterson, " 'Dragonmail : an exercise in distributed computing" . Software Prac-tice and Experience, 18(8), A u g . 1988, pp 791-803. [Peterson et a i 89] L . Peterson, N . C. Buchholz and R. D. Schhchting, "Preserving and using context information in interprocess communicat ion" , ACM Trans, on Computer Systems, V o l . 7 N o . 3, A u g . 1989, pp 217-246. [Prinz and Speth 87] W . Pr inz and R. Speth, "Group communication and related aspects in office au tomat i on " . Froc . IFIP WG6.5 International Working Conference on Message Handling Systems, A p r i l , 1987, pp 4.3.1-4.3.17. [Quarterman 90] J . S. Quarterman, "The M a t r i x " , Digital Press, 1990. [Rahnema 89] M . Rahnema, " T h e fast packet ring switch: a high-performance efficient archi-tecture w i t h multicast capabihty" , Proc. ICC, Boston (1989) pp 884-891. [Rajagopalan and M c K i n l e y 89] B . Rajagopalan and P. M c K i n l e y , " A token-based protocol for rehable, ordered mul t i cas t " , Proc. Eighth Symposium on Reliable Distributed Systems, Seattle, W A , Oct . 10-12, 1989, pp 84-93. [Ramakrishnan and J a i n 87] S. Ramakrishnan and B . J a i n , " A negative acknowledgement wi th periodic poUing protocol for multicasts over L A N ' s " , Proc. IEEE INFOCOM'87, 1987, pp 502-511. [Ravindran and Chanson 86a] K . Ravindran and S. T . Chanson. "Structure rehable interac-tions in distr ibuted server architectures". Technical Report TR 86-13, Computer Science, U B C . June 1986. [Ravindran and Chanson 86b] K . Ravindran and S. T . Chanson. "Process ahas-based structur-ing techniques for distributed computing systems" Proc. IEEE 6th international confer-ence on distributed computing systems, 1986, pp 355-363. [Rozier and M a r t i n s 87] M . Rozier and J . L . M a r t i n s , "The chorus distributed operating sys-tem: some design issues", Y . Paker et a l , "distributed operating systems, theory and pract ice" , NATO AS I series, Vo l . F28 , Springer-Verlag, 1987. pp 261-287. [Ricciadi and B i r k a n 91] A . Ricc iadi and K . P. B i r m a n , "Using process groups to implement failure detection in asynchronous environments". Tech. Rep. TR91-1188, computer sci-ence, CorneU U n i v . Feb. 1991. [Saras and H u i t e m a 87] J . A. Saras and C . Huitema, "The use of distribution hsts in i M H S " , Proc. IFIP WG6.5 International Working Conference on Message Handling Systems. A p r i l . 1987, pp 4.1.1-4.1.8. iSatyanarayanan and Siegel 90] M . Satyanarayanan and E . H . Siegel. •"ParaUel communication in a large distr ibuted environment". IEEE Trans, on Computers. Vol. 39. .Vo. 3. .March 1990. pp 328-348. [Schiper et al 89] .A.. Schiper, J . Eggh and A Sandoz, "A new algorithm to implement causal o rder ' . P r o c . the 3rd international workshop on distributed algorithms. Lecture notes on computer science 392. Springer-Verlag, New York . 1989, pp 219-232. [Schmuck 88] F Schmuck, " T h e use of efficient broadcast primitives in asynchronous distributed systems". P h . D Thesis , CorneU Univ . , 1988. [Schneider et a l 84] F . B . Schneider, D . Gries and R. D . SchUchting, "Fault-tolerant broad-casts", Science of Computer Programming, Vo l . 4. No . 1, 1984, pp 1-1-5. [Schneider 90] F . B . Schneider, '"Implementing fault-tolreant services using the state machine approach: a t u t o r i a l " , .ACM computing survey, V o l . 22. No. 4. Dec. 1990. pp 299-319. [SchUchting and Schneider 83] R . D . SchUchting and F . B . Schneider, "Fail-stop processors: an approach to designing fault-tolerant computing systems", ACM Trans, on Computer Systems. V o l . 1, No . 3, 1983, pp 222-238. [Schroeder et a l 84] M . D . Schroeder, .\. D . BirreU and R . M . Needham, "Experience with Grapevine : the growth of a distributed system" , ACM Trans, on Computer Systems. Vol . 2. N o . 1, Feb. 1984, pp 3-23. [SegaU and A w e r b u c h 83] A . SegaU and B. Awerbuch, " A reUable broadcast protocol" . IEEE Trans, on Commun., V o l . C O M - 3 1 . No. 7. 1983, pp 896-901. [Shimizu et a l 88] K . Sh imizu . M . Maekawaand J . Hamano , •'Hierarchical Object groups in dis-tributed operat ing systems" Proc. IEEE Eighth International Conference on Distributed Computing Systems, San Jose. 1988, pp 18-24. [Shizgal] 1. Shizga l , " A n A m o e b a rephcated service organizat ion" . Report CS-R8723. Center for Mathemat i c s and Computer Science, P .O . Box 4079, 1009. . \ B , .Amsterdam. The -Netherlands. [Smith 80] R . G . S m i t h , " T h e contract net protocol: high level communication and control in a distr ibuted problem solver". IEEE Trans, on Computers. Vol . 29. C-12. Dec. 1980. pp 1104-1113. [Smith et al 891 H . S m i t h , J . Onions and S. Benford, "D i s t r ibuted group communication - the A M I G O information model" . Ellis Horxcood Ltd. 1989. iSpauster 91] A . Spauster, "Ordered and rehable umlticast communication" . (^'S-TR-312-91. P h . D Thesis , Princeton Univ. June i991. CHAPTER?. CONCLUSIONS AND FUTURE RESEARCH 144 [Stephenson and B i r m a n 91] "Fast causal M u l t i c a s t " , ACM SIGOPS. operating systems review. Vol . 25, No . 2, pp 75-79, A p r i l , 1991. P h . D Thesis, CorneU Univ . , Feb. 1991. [Turner 87a] J . S. Turner , " T h e ChaUenge of multipoint communicat ion" , WUCS-87-6. C o m -puter Science, Washington U n i v . (Saint Louis) , 1987. [Verissimo et al 89] P. Veriss imo, L . Rodrigues and M . B a p t i s t a , " A M p : a highly parallel atomic multicast pro toco l " , Proc. ACM SIGCOMM'89, Sept. 1989, pp 83-93. [WaU 80] D . W . W a l l , "Mechanisms for broadcasts and selective broadcasts", Ph.D Thesis, Computer Science, Stanford U n i v . , June 1980. [Wall 82] D . W . WaU, "Selective broadcast in packet switched networks" , Proc. Sixth Berkeley Workshop on Distributed Data Management and Computer Networks. Feb. 1982. pp 239-258. [Waters et a l 84] A . G . Waters , C . J . A d a m , I. M . Leshe and R. M . Needham, "The use of broadcast techniques on the Universe network", Proc. ACM SIGCOMM'84, July 1984, pp 52-57. [Waxman 88] B . M . W a x m a n , " R o u t i n g of multipoint connections", IEEE J. Select. Areas Commun., 6 (9) (1988), pp 1617-1622. [WUbur 88] S. W i l b u r , " G r o u p communication as a tool in an organization support environ-ment" , Proc. IFIP WG6.5 Message Handling Systems and Distributed Applications, 1988, pp 309-316. [Wong and G o p a l 83] J . W . Wong and G . Gopa l , "Analys is of reUable broadcast in local-area networks", Proc. Eighth Data Communications Symposium, Oct . 1983, pp 158-163. [Wong and G o p a l 85] J . W . Wong and G . Gopal , "ReUable broadcast on local area network". Proc. IEEE ICC'85, 1985, pp 43.4.1-43.4.5. [Wosnitza 85] L . Wosn i t za , " G r o u p communication in the M H S context" , Proc. Computer Mes-sage Systems'85, Elsevier Science Pubhshers B . V . (North-HoUand) , 1985, pp 147-154. [Zhao and L i 92] H . Zhao and H . T . L i , " A mechanism of process group for appUcation reUa-biUty in distributed systems" , ACM SIGOPS operating system review, Vo l . 26. No. 1 pp 66-77, J a n . 1992. [Zwaenepoel 84] W . Zwaenepoel, "Message passing on a local network" . Ph.D Thesis. Com-puter Science, C S L - T R - 8 5 - 2 8 3 , Stanford Univ . , 1984. Appendix A Group Name Grammar Model Group naming structure can also be characterized by a grammar. For each top-level group .5". there exists a grammar Ls = {V,T. P,S), where K is a finite set of gids (non-terminals), and r is a finite set of pids (terminals).^ P is a finite set of production rules, representing the name expansion from a gid to aU of its immediate children. The name resolution for a group, say 5. returns a set of pids that corresponds to a string derived from the start ing symbol S by applying the production rules recursively. It should be noted that a name expansion grammar is context free ( C F G ) ; that is, the expansion of any gid does not depend on its expansion context. Also note that in the context of group name resolution, the order of symbols in the right hand side of every production rule is immater ia l . Therefore, the C F G Ls of the group S can be written as a right-hnear grammar ; that is, Ls can be a represented by a regular grammar. Because regular sets are closed under union operation [Hopcroft and UUman 79], the grammar corresponding to the global naming system (which is the union of aU top-level groups) is regular. Given a name grammar L = {V.T. P. S), a name grammar automata D = (U, T' ) can be generated. .Assuming L has been written in Chomsky Normal Form ( C N F ) , an arc (A — B) £ T' if and only if (.4 — xB) £ P. where x £ T. Since i is a regular grammar , Z} is a finite-state machine ( F S M ) . In a grammar representation, a name expansion corresponds to a sentence derivation of • A l t h o u g h the n u m b e r of processes a n d the n u m b e r of groups in a d i s t r i b u t e d s y s t e m vary in t ime, they are finite at any p a r t i c u l a r t ime i n s t a n c e . APPENDIX A. GROUP NAME GRAMMAR MODEL 1 4 6 the grammar , and group membership update corresponds to adding or deleting production rules. It is the latter that makes the model distinct from general regular language grammar manipulations. It is easy to see that the group name graph model is equivalent to the group name grammar model; that is. for each name graph, there exists a name grammar, such that a set of processes can be resolved from a group name in the graph if and only if the same set of processes can be derived from the grammar. The argument for this statement is tr iv ia l i f one notes the fact that a production rule ^ '1 =^ CTJJG ' ,2 • • -Gji^Pi^ ' ' ' Ptr exists in P if and only if the group G, contains 6 ' i , , G,^, • • •. Gt,^, pi^, • • -, p,v as its immediate children in the corresponding name graph, where G^, , • • •. G,^ G V and , • • •. p,-^  € T. This name grammar model exposes another dimension of research — is there any case in practice that the grammar is not C F G ? if yes, what are the imphcations and issues? New insight may well be obtained by exploring this dimension. However, we leave this as future research since it is out of the scope of this thesis. Appendix B Prototype Experiments This appendix describes the experiments conducted on the prototype implementation. There are eight experiments for join and seven for leave. A U the graphs given in this appendix are E G V graphs. The state of each arc in the I G V is highhghted by a different hue type — a thick hue represents a shadow edge, a dotted hue represents a fade arc, and a thin hue represents a normal arc. User requests are numbered in the order that they are executed. After each update, name resolutions are run at nodes in the name graph to verify name resolution con-sistency between the I G V and the E G V . These name resolution requests are not shown in the diagrams. The adjacency matrices of the E G V and the I G V were printed by the prototype and the diagrams were drawn (by hand) directly from the output of the prototype. In the foUowing, the labels in square brackets indicate the special cases tested in each experiment (refer to Figure 5.4 and 5.5). Join Experiments 1. Join experiment 1 in Figure B . l is used to test V N construction, including: (i) the construction of a t r iv ia l V N ([J.2.a] at step 7), (ii) the construction of a nontr ivial V N whose component hst contains physical nodes only ([J.2.b] at step 3), (iii) V N topology modifications when adding an arc that connects two nodes in the same V N ([J.2.d] at step 4), and (iv) the construction of a nontrivial V N whose component hst contains both physical nodes and V N s ([J.2.c] at step 9). a 1. join (a, b) 2. join (b, c) 3. join (c, a) 4. join (a, c) 5. join (e, c) 6. Join (d, f) 7. join (f, d) 8. join (b, d) 9. join(f,e) Figure B . l : J o i n experiment 1. 2. Join experiment 2 in Figure B.2 is used to test D L detection and control, including: (i) the detection of a simple D L that contains no embedded D L or V N ([J.3.a] at step 6, DL{b,d) is generated and detected), (u) the detection of D L s that share some segments ([J.3.b and J.3.c] at step 7, DL(a, f) and DL(a,d) that share some nodes with DL{b,d) are generated and detected), (in) the generation of a V N source ([J.3.d] at step 9) or a V N sink ([J.3.d] at step 13), (iv) the construction of a V N that contains embedded D L s ([J.4] at step 14. DL(a,d), DL{b,d) and DL(a,f) are embedded in the new R L ) . (v) the detection of paraUel arcs between V N s ([J.3.g] at step 16), and (vi) the effect of update to the ancestors of the source on the R D control records ([J.3.e] at steps 8 - 1 0 and 12). 1. join (a, b) 10. join (u, x) 2. join(b, c) l l . jo in(u , v) 3. join (c, d) 12. join (v, u) 4. join (e, f) 13. join (d, f) 5. join (f, d) 14. join (d, a) 6. join(b, f) 15.join(c,e) 7. join (a, e) 16.join(v, e) 8. join (x, a) 9. join (a, x) d Figure B .2 : Join experiment 2. 3. Join experiment 3 in Figure B.3 is used to test: (i) the detection of paral le l arcs between two V N s ([J.3.g] at step 8), (u) the effect of topology change within the V N source/sink on a D L consisting of paraUel arcs ([J.2.d and J.3.f] at steps 9 and 10), and (in) the detection of an R L that contains an embedded D L having paraUel arcs between two V N s ( [J .2 .C and J.4] at step 11). 4. Join experiment 4 in F igure B.4 is used to test: (i) the effect of changing a physical D L sink to a V N sink ([J.3.d] at step 4), (u) the detection of paraUel arcs between a physical source node and a V N sink ([J.3.g] at step 5), 1. join (a, b) 2. join(b, c) 3. join (c, a) 4. join (x, y) 5. join (y, z) 6. join (z, x) 7. join (a, x) 8. join (b, z) 9. join (a, c) 10. join (y, x) 11. join (z, c) y Figure B .3 : Join experiment 3. (iii) the detection of a D L that has a V N source and a physical sink node ([J.3.g] at step 12), and (iv) the detection of paraUel arcs between two V N s ([J.3.g] at step 13). In case (i) , the D L is replaced by a new D L with a V N sink and the R D control record is re-assigned. 5. Join experiment 5 in Figure B.5 is used to test: (i) the construction of a V N that contains nodes in different segments of a D L and its effect on D L s ([J.5] at step 8), (u) the effect of topology change internal to a V N that is the sink of a D L and the source of another D L ([J.3.fj at step 9), and (iU) the detection an R L that contains embedded DLs ([J.4] at step 10). 1. join (a, c) 2. join (b, d) 3. join (c, d) 4. join (d, c) 5. join (a, d) 6. join (e, f) 7. join (f, e) 8. join (c, e) 9. join (f, g) 10. join (g, h) 11. join (h, i) 12. join (e, h) 13. join(d,f) c K Figure B.4 : Join experiment 4. 6. Join experiment 6 in Figure B.6 is used to test the detection of paraUel arcs leading to the same component in a V N sink ([J.3.gJ at steps 8 and 9), and Join experiment 7 in Figure B . 7 is used to test the detection of paraUel arcs leading to different components in a V N sink ([J.3.g] at steps 8 and 9). B o t h experiments also detect R L s containing embedded D L s ([J.4] at step 10). 7. Join experiment 8 in Figure B.8 is used to test: (i) the detection of a D L containing a V N source and a V N sink ([J.3.d] at step 8), (ii) the detection of paraUel arcs in a segment of a D L and the R D control assignment for the new embedded D L ([J.3.c] at step 9), and (in) the detection of an R L that contains embedded D L s ([J.4] at step 10). a 1. jom (a, b) 2. jom (b, c) 3. join (c, a) 4. join (d, e) 5. join (e, f) 6. join (f, d) 7. join (a, d) 8. join (b,d) 9. join (c, d) 10. join (f, a) e f Figure B.6: Join experiment 6. Leave E x p e r i m e n t s 1. Leave experiment 1 in Figure B.9 is used to test: 1. join (a, 2. join (b, cj 3. join (c. a.) 4. join (d, e; 5. join (e, 0 6. join (f, d) 7. join (a, d) 8. join (b, ej 9. join (c, 0 10. join (f, a) Figure B.7: Join experiment 7. 1. join (a. b) 2. join (b. C) 3. join (c. aj 4. join (d. ej 5. join (e. t) 6. join (f. e) 7. join (a, d) 8. join (b. e) 9. join (c. 0 10. join (f. a; ^ c (i) a normal arc deletion without breaking any R L or removing any segment of a D L ( [L. l .a] at step 9), (u) the removal of a secondary segment when a normal arc is deleted ( [L.l .b] at step 11), and (iu) the removal of the p r i m a r y segment due to a normal arc deletion, and the selection of a new pr imary segment for this D L ( [L . l . c . l ] at step 14). 1. join (a, b) 2. join (b, c) 3. join (c, d) 4. join (a, e) 5. join (e, f) 6. join (f, d) 7. join (d, g) 8. join (g, h) 9. leave(d, g) 10. join (d, g) 11. leave (e, 0 12. join (e, f) 13. join(b, d) 14. leave (b, c) Figure B .9 : Leaue experiment 1. 2. Leave experiment 2 in F igure B.IO is used to test the effect of deleting V N internal arcs without changing the loop deep equal relation among al l the components ([L.2]). The deletions of fade arcs are tested at steps 7, 8 and 9. The deletions of shadow edges (with the same in i t ia l name graph) are tested at steps 7, 8, and 9. 3. Leave experiment 3 in Figure B . l l is used to test paraUel arc deletions: (i) the deletion of a paraUel arc between a physical source node and a V N sink ([L.3] at step 11). a 1. join (a, b) 7. leave (a, c) 2. join (b, c) 8. leave (c, b) 3. join (c, a) 9. leave (b, a) 4. join (a, c) 5. join (c, b) 7. leave (a. b) 6. join (b, a) g jgave (b, c) 9. leave (c, a) Figure B.IO: Leate experiment 2. (ii) the deletion of a parallel arc between two V N s ([L.3] at step 12), and (iu) the deletion of a paraUel arc between a V N source and a physical sinli node ([L.3] at step 13). 4. Leave experiment 4 in Figure B.12 is used to test V N breaking: (i) brak ing a V N completely so that no retained V N needs to be constructed ([L.4.b] at steps 14 and 15), and (h) constructing retained V N s after a V N is broken ([L.4.a] at step 13). Re-estabhshing ancestor-descendant relationship among aU nodes (including retained V N s ) in the broken V N is also tested ([L.4.b]). 5. Leave experiment 5 in Figure B.13 is used to test the effect of V N breaking on D L s : (i) breaking a V N does not break any segment of a D L and aU the nodes in that V N remains in the segment of the D L (at step 13), and (u) breaking a V N also breaks a D L ( [L . l . c and L.4.c] at step 15). 6. Leave experiment 6 in Figure B.14 is used to test the effect on a D L when its V N source or V N sink is broken: 1. join (b, c) 2. join (c, b) 3. join (d, e) 4. join (e, d) 5. join (a, b) 6. join (a, c) 7. join (b, d) 8. join (c, e) 9. join (d, f) 10. join (e. f) a a ll.leave(a, c) 12. leave(b, d) a a 13. leave(d, f) 1. join (a, b) 2. join (b, c) 3. join (c, a) 4. join (d, e) 5. join (e, f) 6. join (f, d) 7. join (b, e) 8. join (f, c) 9. join (g, b) 10. join(f,h) 11. join (a, x) 12. join (y, e) 13. leave (b, e) 14. leave (f, d) f Figure B.12: Leave experiment 4. (i) the V N source is broken ([L.4.c] at step 11) and the D L is replaced by a new D L with a physical source node, and (ii) the V N sink is broken ([L.4.d] at step 12) and the D L is replaced by a new D L wi th a physical sink node. a f f Figure B.14: Leave experiment 6. 7. Leave experiment 7 in Figure B.15 is used to test D L detections in leave: (i) the detection of re-exposed D L s that were embedded in the broken V N ([L.4.e] at step 11), (h) the effect of modifications to V N source/sink ([L.4 .C and L.4.d] at step 12), and (iii) breaking V N source/sink and replacing the affected D L by a new one that has a physical source/sink node ([L.5] at step 13). 12. leave(f, e) d d Appendix C Glossary In this appendix, we provide a glossary to help readers to clearly understand some of the terms used in this thesis. Adjacency matrix and adjacent arc list: Adjacency matrix is a standard data structure representation of graph topology. In this thesis, it is used to store the E G V connectivity information of a V N . This information is required when an R L is broken. In this thesis, an adjacent arc list is a data structure used by a node to store the state information about its adjacent arcs, or the adjacent arcs of the V N in which the node is a component. This information is required in name resolution and pathname update. Client and server: A client invokes an operation to request certain service, a server or a server group receives and processes the request to provide the service. Concurrency control : Synchronize the execution of concurrent updates that may aflfect each other to produce a serializable result. A n execution of a set of updates on an init ia l name graph is serializable i f the resulting name graph is produced as if the updates were executed one after another on the in i t ia l name graph. Deterministic and nondeterministic group: Deterministic group requires aU members to be synchronized at the system-level in receiving and processing messages, nondeterministic group allows this requirement to be relajced to certain degree, relying on the apphcation-level software to handle the inconsistency in an apphcation-specific manner. Duplication loop: A segment is a set of paths from a source node to a sink node. Paths in a segment share the same last hop leading to the sink. Two segments between a pair of nodes are distinct i f their last hop to the sink are different. A duplication loop ( D L ) consists of mult ip le distinct segments between a pair of nodes. A segment in a D L is caUed the p r i m a r y segment i f no R D control for that D L is performed at the last hop of that segment. Other segments are called secondary segments. E G V and I G V : T h e E G V is the external group view of the name graph perceived by omni -scient observers wi th global information, the I G V is the internal group view of the name graph mainta ined by the underlying naming system. The two views are name resolution consistent i f the results of name resolution performed on both views always match . Failure handl ing : T h e act ivity of reorganizing the I G V to preserve name resolution consis-tency w i t h respect to the remaining E G V after a failure. NA-fai lure-handling refers to failure handUng when a normal arc fails; VN- fa i lure -handl ing refers to faUure handUng when a V N is broken (i.e., when shadow edges or fade arcs fad). G r o u p : A set of receivers that are put together to receive group messages and to provide service cooperatively. A nested group is a group that contains other groups as its members. G r o u p C o m m u n i c a t i o n : Sending a message from one sender to a group of receivers. A t network-level , this one-to-many message exchange is referred as multicast. Unicast refers to one-to-one message exchange and broadcast refers to one-to-aU. Exchanging messages between an external sender and a receiver group is called inter-group commu-nication, exchanging messages within a group is caUed intra-group communication. O r d e r e d group communication ensures that message dehvery satisfies certain order-ing requirements defined for aU the receiving members. A t o m i c group communication ensures that a message to a group is received either by al l the active members or by none i n failure situations. Internet: A network that inter-connects a set of sub-networks (e.g., L A N s ) . The links con-necting sub-networks are caUed internet links. Irrelevant message: A message becomes irrelevant at a node if the message is sent to a group i n which the node is no longer a member at the time of message dehvery. Loop deep equal: T w o groups are deep equal if their name resolution results are the same. Two groups are loop deep equal if they are in the same R L . Loop deep equal is an equiv-alence relation. Missing message: A message is missing at node if that message has been received by one of its immediate parents, but has not by that node. N o r m a l arc, shadow edge and fade arc: A normal arc connects two nodes that are not loop deep equal. N o r m a l arcs always show up in both the E G V and the I G V . A shadow edge or a fade arc connects two nodes that are loop deep equal in the E G V . Fade arcs are excluded from the I G V and shadow edges are bidirectional. To avoid loop effect, a special name resolution procedure is executed in accessing shadow edges. .A. shadow tree is a tree consisting of only shadow edges in the I G V . Name graph: A directed graph model that represents the naming structure of nested groups. A derived name g r a p h is a D A G that represents the ancestor-descendant relationship among nodes in a name graph. Nodes that are in the same strongly connected component in the name graph are represented as a single node in the derived name graph. .Xo ancestor-descendant relation is defined among these nodes. Name resolution: A process that maps a group identifier to the group member process iden-rifiers by traversing through a name graph. The paths in the name graph traversed to resolve a group identifier are caUed the name resolution paths of that group. Originator or coordinator: The node that receives an update request from the user is the originator OT the coordinator oi the update transaction. The node which receives a group message (i.e., a name resolution request) from the user is called the originator group of that message. Parallel arcs: M u l t i p l e arcs from one node to another node in a graph. Parent and child : W h e n a normal arc a b exists in the name graph, node a is an immediate parent of b and node b is an immediate child of a. W h e n a path from node a to node b exists in the name graph and a and b are not loop deep equal, node a is an ancestor of b and node 6 is a descendant of a. Unhke a tree, a node can have more than one immediate parent in the name graph. In a D L , a node is called the pr imary parent of the sink in that D L i f no R D control is performed at that node for the D L . Other immediate parents of the sink in that D L are called secondary parents. Pathname: A sequence of nodes representing an incoming path to a node in the derived name graph. Th is representation is internal to the naming system at that node and is used for R L / D L detections. A U the pathnames at a node form the pathname set of the node. Pathname inheritance is a method to derive pathnames from the pathname set of an immediate parent. Pathname update relay is the process of updat ing pathnames at the descendants when a name graph update is performed. R D control: The action of resolution duplication control to suppress name resolution duph-cations. It should not suppress dupUcated messages resulted from retransmissions. Resolution loop: A cycle is a single path loop in the name graph. A resolution loop ( R L ) contains aU the cycles that are chained together. A n R L is a strongly connected component in the terminology of graph theory. The existence of an R L indicates that nodes in the R L contain themselves as members. Root node and leaf node: In a name graph, a node with in-degree zero is caUed a root node and a node with an out-degree zero is caUed a leaf. Simple graph: A simple graph that has no self-loop or multiple edges. Strongly connected component: In a graph, a strongly connected component is a subgraph in which aU nodes are mutuahy reachable. Subgraph: .\ subgraph rooted at a node consists of aU the nodes and arcs that are reach-able from that node. The subgraph of an update is the subgraph rooted at the update originator. The pathname sets at the nodes in this subgraph have to be updated after the update. V i r t u a l node: A virtual node ( V X ) contains all the nodes in an R L . .\odes in an \".\ are called components of the V N . V N does not physicaUy exist. It is a logical concept used by the naming system to maintain the connectivity information among nodes that are loop deep equal. Working arc and Working set: The working arc of an update is the arc to be added or deleted, the working set of the update contains aU the nodes among which the loop deep equal relation is to be changed by the update. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051179/manifest

Comment

Related Items