UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Reliable group communication in distributed systems Navaratnam, Srivallipuranandan 1987

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1987_A6_7 N39.pdf [ 4.21MB ]
Metadata
JSON: 831-1.0051921.json
JSON-LD: 831-1.0051921-ld.json
RDF/XML (Pretty): 831-1.0051921-rdf.xml
RDF/JSON: 831-1.0051921-rdf.json
Turtle: 831-1.0051921-turtle.txt
N-Triples: 831-1.0051921-rdf-ntriples.txt
Original Record: 831-1.0051921-source.json
Full Text
831-1.0051921-fulltext.txt
Citation
831-1.0051921.ris

Full Text

RELIABLE GROUP COMMUNICATION IN DISTRIBUTED SYSTEMS By SRIVALLIPURANANDAN  NAVARATNAM  B.Eng.(Hons.), The University of Madras, 1983 M.A.Sc, The University of British Columbia, 1986  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF MASTER OF SCIENCE  in T H E FACULTY OF GRADUATE STUDIES (DEPARTMENT OF C O M P U T E R SCIENCE)  We accept this thesis as conforming to the required standard  T H E UNIVERSITY OF BRITISH COLUMBIA October 1987 © Srivallipuranandan Navaratnam, 1987  In  presenting  this  degree at the  thesis  in  partial fulfilment  of  University of  British Columbia,  I agree  freely available for reference copying  of  department publication  this or of  and study.  this  his  or  her  The University of British Columbia 1956 Main Mall Vancouver, Canada V6T 1Y3 Date  DE-6(3/81)  j C f -  that the  representatives.  may be It  thesis for financial gain shall not  permission.  requirements  I further agree  thesis for scholarly purposes by  the  is  for  an  advanced  Library shall make it  that permission  for extensive  granted  head  by the  understood  that  be allowed without  of  my  copying  or  my written  Abstract This group be  describes the  communication  received b y  (atomicity). the  work  In  recipients  distributed mechanism  all  design  mechanism. The the  operational  addition, (order).  database continues  failures (survivability).  the The  systems to  and  members  distributed despite  of  the  group  messages w i l l  message ordering  operate  details  m e c h a n i s m guarantees  sequence of  and  implementation  property  be  that  or  by  the  can  be  host  and  S u r v i v a b i l i t y is essential i n fault-tolerant  a  reliable  messages none  of  will them  same at  each  used  to  simplify  The  proposed  processing algorithms.  process,  of  communication applications.  of  link  Table of Contents Abstract List of Tables List of Figures Acknowledgements  ii vi vii viii  Chapter One Introduction  l  1.1 Goal of the Thesis  1  1.2 Motivation 1.3 Underlying System Model  and  Assumptions  3  1.4 General Design Philosophy of the Proposed Group Communication  Mechanism  5  1.5 Related Work  9  1.6 Outline of the Thesis  11  Chapter Two Properties of a Reliable Group Communication Mechanism  12  2.1 Full Delivery  12  2.2 Correctness  13  2.2.1  Order  13  2.2.2 Atomicity  15  2.2.3  16  Survivability  2.3 Outline of the Group Send Primitives  16  2.4 Chpater Summary  18  Chapter Three Design of the Proposed Group Communication Mechanism  20  3.1 Primary and Secondary Group Managers  21  3.2 Design of the Group Send Primitives  22  3.2.1  Ordered Group Send (OGSEND) Primitive  3.2.2 Unordered Group Send (UGSEND) Primitive 3.3 Failure Detection and Recovery Procedures 3.3.1  Group Member Failure i  1  i  22 25 27 28  3.3.2 Secondary Manager Host Failure 3.3.3 Primary Manager Host Failure 3.4 New Primary Manager Selection Scheme : An Overview 3.4.1 Succession List Selection Scheme : A Finite State Model 3.4.1.1 Description of the States 3.5 Network Partition 3.5.1 Discarding Messages From Different Subgroups 3.5.2 Merging Subgroups 3.5.2.1 Detection of Subgroups 3.5.2.2 Resolving the Leadership 3.6 Chapter Summary  28 31 33 34 35 38 42 43 43 44 48  Chapter Four Implementation Details and Performance of the Proposed Group Communication Mechanism  50  4.1 Group Management 4.1.1 Creating a Group 4.1.2 Registering a Group 4.1.3 Joining a Group 4.1.4 Leaving a Group 4.2 Organization of the Manager Member List 4.3 Group Communication 4.3.1 Ugsend Implementation 4.3.2 Ogsend Implementation 4.3.3 Detection of Duplicates 4.4 Worker Processes 4.4.1 Courier 4.4.2 Prober 4.4.3 Vulture 4.5 Failure Detection and Recovery 4.5.1 Secondary Manager Failure 4.5.2 Primary Manager Failure 4.6 Network Partition 4.7 Performance of the Group Send Primitives 4.8 Chapter Summary  51 52 52 53 55 55 57 57 58 59 60 61 62 63 63 64 65 66 67 69  Chapter Five Conclusions  71 iv  Bibliography Appendix  A  List of Tables Elapsed time (mill! seconds) for ugsend and ogsend. Sending process l n the same host as the primary manager. Elapsed time (mllll seconds) for ugsend and ogsend. Sending process i n the same  host as a secondary manager.  vi  List of Figures 1.1  Distributed database update using group IPC  3  1.2  Single sender - multiple receivers  7  1.3  Multiple sender - single receiver  7  1.4  Multiple senders - multiple receivers  8  2.1  Happened before relation for ordered delivery  14  3.1  Group manager's  23  3.2  Vulture process  29  3.3  Prober process  30  3.4  State transition diagram of  3.5(a)  Group view before the network partition  3.5(b)  Group view after the network partition  40  3.5(c)  Group view after the network remerge  41  3.6  State transition diagram of primary managers resolving  message transmission  primary manager selection scheme  35 39  leadership upon network remergence  45  4.1  Manager member list  56  4.2  Receive buffers  60  4.3  Courier  61  A.I  V domain of local network-connected machines  77  A. 2  Send-receice-reply  78  A. 3  Send operation i n V  80  A.4  ReceiveSpecific operation i n V  82  A.5  Reply operation i n V  83  process  message transaction  vii  Acknowledgements I  would  Chanson  like to acknowledge  a n d D r . G e r a l d N e u f e l d , who have  during the course of this  I  would  m y appreciation  also like to t h a n k  to both m y supervisors given  valuable  advice  research.  R a v i who aided w i t h ideas and c r i t i s i s m .  Encouragement f r o m M e h r n a z a n d C i n d y is gratefully  vi i i  acknowledged.  D r . Samuel a n d guidance  1  Chapter  One  Introduction 1.1  G o a l of the  Thesis  This thesis is concerned with the design and one-to-many  inter  process  communication  distributed computations in an environment  (IPC)  implementation of reliable mechanism  for supporting  where certain types of failure could  occur. One-to-many IPC (also known as multicast or group communication) refers to an activity by which a single message may to many other processes which may  be transferred from one process  be in the same or different hosts in the  distributed system. The , mechanism guarantees that the message will be received by all the operational receivers or by none of them. It also ensures that the messages sent from the senders will be delivered in the same order to all the receivers. The  following  section  describes  the motivation behind  bringing out examples where a reliable group communication  this work by  mechanism such as  the one proposed is necessary. Section 1.3 briefly describes the underlying system model and mechanism. mechanism Section  1.5  the assumptions The and  design  made in the design of the group  philosophy  of  the  proposed  group  communication communication  a general description of the scheme is given in Section  reviews  previous work  and  highlights  their  differences  from  1.4. our  proposed mechanism. Section 1.6 concludes this chapter by giving an outline of the thesis. 1.2  Motivation  One  of  the  promises  of  distributed  computing  is  a  more  available  computing system. To achieve this goal it is necessary to replicate computations  2  and databases at different hosts which allows a computation to continue to run despite  the  failures  of some of the  hosts.  distributed cooperating processes, possibly  In this  environment  residing on different  a set  of  hosts, can be  viewed as a single logical entity called a process group. The individual processes of a group are sometimes called members of the group. Such an architecture allows certain critical resources to be maintained on more than one host and be conveniently  shared by client processes with enhanced modularity, performance  and reliability [4]. The clients access the members of the group as a single logical  entity  communicate  using  the  group's  logical  name.  Hence  there  is  a  need  to  the same information to all members of a group. Thus, many  applications can benefit from a multiple-destination message transport mechanism such as broadcast and multicast. Broadcast is the delivery of a message to all the  destination  addresses.  Multicast is  the  delivery of a message to some  specified subset of the possible destinations. Requirement  for  reliable  group  communication  mechanism  arises  in  applications that are distributed to achieve parallel processing, resource sharing, data availability and reliability. For example, consider an application that updates replicated copies of a distributed database  maintained on different  hosts as  illustrated in Figure 1.1. In order to perform an update to the database, a process first requests the database managers (DBMs) on each host to obtain a lock on the item to be updated. Each DBM will reply with an indication whether or not the lock is available. Here the set of DBMs can be viewed as a process group and the request message can be sent to the DBM group. Clearly this request must be performed reliably in order to assure that all the DBMs will receive the request message. Once it is confirmed that all the locks are acquired, a notification containing the update can tlien be sent to the DBM group. This notification must also be reliably delivered to each DBM.  3  Figure 1.1  In group,  is  required  to  Requiring  all  all  the  the  that  the  members  members  of  sequence is stricter  (and  to  messages.  obtain  systems.  all If  in different additional  the  thus  the m e m b e r s order, they  used  messages but a  must group  may  to  to  also  be  to  However  this  at  synchronize  simplify  recovery procedures i n a distributed  U n d e r l y i n g System Model and  the  the  group  delivered  receive  the  overhead) property  residing a t  not a r r i v e to  sent  includes higher  of a group  communications  sequencing c a n be  1.3  IPC  some applications w h e r e s e v e r a l processes are interacting  it  delivered  Distributed database update using group  w i t h the  must in  not  the  messages  only  same  than j u s t requiring  them  useful  in  distributed  hosts receive messages  the same state at a l l , or m a y states.  design of  database s y s t e m  Assumptions  Furthermore,  concurrency [3].  order. same  different  their  be  the  is  in  same  control  require message  and  crash  4  The  prototype model of the proposed  built on top of the V Kernelt, number of SUN  group commnunication mechanism is  a distributed operating system running on a  workstations in our Distributed  Systems Research  Laboratory.  These workstations are diskless and connected to a 10 Mbps Ethernet which is a broadcast  network. However, the principles of the proposed  mechanism  is  not  dependent on the underlying kernel or the network. In  the context of our work two types of processes run in each host;  processes responsible for implementing  the group communication  mechanism and  application processes which make use of the group communication mechanism. We assume that the application processes may fail but the processes responsible for implementing  the group  communication  machine itself fails. We  mechanism  never  fail  unless  the host  also assume that when processes or hosts fail  they  simply cease execution without making any malicious action (i.e., fail stop) [16]. If the host at which a failed process was executing remains operational, we assume that this failure is detected by the underlying operating system and that all the interested parties are notified [14]. On the other hand if the host itself fails, all the processes executing in it fail  and processes  at other hosts can  detect this only by timeouts. Furthermore, we assume that the underlying system provides a reliable one-to-one message transport protocol. In other words, error detection  and  correction  mechanisms  retransmission) exist which guarantee  (such  as  checksum,  timeout  and  a unicast message to be delivered to it's  destination free of errors. In the environment where our proposed group communication mechanism is built, no information survives host failures. Since hosts are diskless there is no possible recovery from stable storage. Therefore the case of a process in a host t The semantics of IPC facilities provided by the V Kernel is described in Appendix A.  5 receiving a message before host failure and one where the host fails before the message is delivered to it are indistinguishable. Thus, our group communication mechanism  can only guarantee that all operational members of a group will  receive all the messages in the same order. 1.4  General  Design  Philosophy  of  the  Proposed  Group  Communication  Mechanism  The design of the group communication mechanism should be general and not dependent on specific characteristics of the group or functions available from the  underlying  depending  network.  on  hardware may  whether or may  For their  example  the group  membership  list  may  may  be  static  change.  The  or  dynamic  underlying  not support broadcast and multicast facilities. Consider the  case where the underlying hardware  supports  only a single-destination message  transport mechanism (unicast). In this case, delivery of a message to the group can  be achieved  only by  maintaining  the list of members in the group, and  sending the message to individual members using one-to-one IPCs. However if the underlying hardware  supports  broadcast  and  multicast then  group can subscribe to a particular multicast address. A  the members of a message intended for  the group can be sent to this address and only those hosts where one or more members of this group reside will read the message. Broadcast  networks  such  as  Ethernet  gives  the impression  that  they  provide reliable delivery in the hardware; but in reality they do not. Messages transmitted in these networks are available to all the receivers, but some or all of  the receivers may  lose messages.  Some examples  [15] of how  this  may  happen are given below: 1.  The buffer memory might be full when a message arrives at the interface unit.  6 2.  The  interface  unit  might  not be  monitoring the network  at the time the  message is delivered. 3.  In  a  contention  network,  an  undetected collision  that  affects  only  certain  network interface units could cause them to miss a message.  Unlike  the  retransmit  reliable the  transport  message  of  packet  the  until  unicast the  packet  receiver  where  the  acknowledges,  support reliable transport of multicast packets unless the number the  group  members  message  can  members  whose  can  be  are known. If the membership  multicast  to  the  members  acknowledgements  in a  are not received  list  sender  can  it is hard to and  identity of  is maintained, then the  datagram within  a  fashion fixed  first  time  and  interval  be sent the message again on a one-to-one basis.  Thus, for reliable delivery of messages to all members of a group, some coordination mechanisms are needed  to maintain the group  the  membership  static  mechanism group  group can  where  necessary  where be  the  built  members  into  may  group  the underlying join  or  exit  at  never  membership  changes,  system. However any  time, a  the  For  coordination  for the  group  list.  dynamic  manager is  to maintain the membership list. However, this scheme will be render  ineffective if the host where  the group  manager is executing fails. One  solution  to  this problem is to replicate the group manager at all member sites and  select  a  new  group  group  will have  manager  among these replicas in case of failures. Thus,  one primary group  manager (or simply  primary manager) and  a  zero or  more secondary managers. Mechanisms for selecting the primary manager include token scheme [3], succession list [10] and election [12].  In the  group  addition to reliable delivery of messages to all members of the group, communication  mechanism  must  also  ensure  that  messages  are  7 delivered  in the same  order to all the members.  In a system  with  a  single  sender and many receivers, sequencing messages to all of the receivers is trivial. If the sender initiates the next the  multicast transmission only after  confirming that  previous multicast message has been received by all the members, then the  messages will be delivered in the same order. This is illustrated in Figure On  the other hand, in a system  with many senders and a single receiver the  messages will be delivered to the receiver in the order in which they the  receiver's  host.  Ordering  in this  case  msg 1  Figure 1.3  Single  Multiple  arrive at  is a non-problem as illustrated in  Figure 1.3.  Figure 1. 2  1.2.  msg 1  sender - multiple receivers  senders - single  receiver  msg 1  8  Figure 1.4  In senders may  general, group  and many  arrive  at  a  Multiple senders - multiple receivers  communication  mechanism  must  operate  between  receivers. In such a s y s t e m , a message sent destination  before  the  arrival  of  a  from  message  many  a sender  from  another  sender; however this order m a y be reversed at another destination. A solution [4] to  order  two  simple  with  systems,  a single  messages the  messages i n such a s y s t e m is to make  sender  one w i t h and many  senders  process a s shown i n F i g u r e a n y additional  cost  and a  receivers. Therefore  to a single receiver w h i c h then  receivers i n a n orderly  without  many  it appear a s a c o m b i n a t i o n of  transmits  because  receiver,  the senders w i l l  receiver  acts  idea c a n be incoorporated the  group  the  other  send  their  the messages to the rest of  fashion. T h u s , the single 1.4. T h i s  single  managers  as a  into  which  funnel  o u r design we  use to  guarantee reliable delivery c a n be used as the funnel processes a s w e l l . T h u s i n our to  scheme, the senders w i l l it's p r i m a r y  manager  the members of this  send the messages intended  which  group.  will  then  reliably  for a p a r t i c u l a r  a n d orderly  transmit  group  them  to  9  1.5 Related Work Although  group  [1,3,4,7,8,9], only facilities.  We  have  communication  a few distributed chosen  to look  has  systems at four  received have such  considerable  attention  actually implemented projects which  we  such  consider  relevant to our work. V one  system [7] defines reliable group communication to mean that at least  member of the group receives the message and replies to it. Each host has  information only about local members of the groups. This information includes the identifiers of the local members and their group addresses.  So when messages  are sent to a group address, hosts where members of this group are executing will  receive  retransmit  it and deliver the packet  it to the members. The underlying  until  at least  one  of the members  kernel  will  of the group  acknowledges the message. Therefore the V Kernel supports a very basic group communication mechanism to transport a message to mutiple processes; additional properties such as reliability and order have to be built on top of it.  Cristian et al. [8] proposed a protocol for the reliable and ordered delivery of a message to all hosts in a distributed system (i.e., broadcast) whereas our focus is on the delivery of a message to a set of processes, several (or all) of which  could  reside  on a  single host.  information diffusion technique. A  Their  protocol is based  on a  simple  sender sends a message on all it's (outgoing)  links and when a new message is received on some (incoming) links by a host, it forwards that message on all other (outgoing) links. After the reception of the message at a host, it's delivery is delayed  for a period of time determined by  the intersite message delivery latency. The messages are time stamped to enable order  delivery and to detect  dependent  on the accuracy  duplicates. The performance  with  which  of this  the clocks are synchronized  protocol is and the  10  operating system's task scheduling mechanism which is responsible for scheduling the relay task which relays an incoming message to the adjacent hosts. Chang  et al. [2]  proposed  a protocol which, like  Cristian's  work, is  responsible for the delivery of a message to all the hosts in the distributed system. However their philosophy is similar to our's where the messages are funneled through a coordinator called token host. Senders send their messages to the token host which then transmits the message to the rest of the hosts. The protocol places the responsibility on the receiver hosts for reliable delivery. The token host sequences the messages and transmits them to the rest of the hosts in a datagram fashion. If a host misses a sequence number then it sends the token host a negative acknowledgement for the missing message. The token host is rotated among the operational hosts to provide reliability and resilency. Birman's ISIS system [1] supports reliable group communication mechanism similar to our's. However, to ensure the order property, the messages are not funneled through a coordinator, instead a two-phase protocol is used. The protocol maintains  a set of priority queues for each member, one for each stream of  messages, in which  it buffers messages before placing  them  on the delivery  queue. When a message is received by a member, it temporarily assigns this message an integer priority value larger than the priority value of any message that was placed in the priority  queue corresponding  to the message's  stream.  Each member sends back this priority value to the sender. The sender collects all the replies and computes the maximum value of all the priorities received. It sends this value back to the recipients which assign this priority to the new message and place it on the priority queue. The messages are then transferred from  the priority  queue to the delivery queue in order of increasing priority.  This guarantees order. However, the sender has to reliably communicate with the  11 members twice before the message is delivered. All the above works make the same assumptions as outlined in Section 1.3. In addition to these, they also assume that the underlying network never partitions. We  do not make such an assumption. In our scheme if the network  partitions resulting in subgroups of sites, communication remains  possible. When  the networks  remerge  within these subgroups  again, the proposed mechanism  merges these subgroups to form a single group.  1.6  Outline  of the Thesis  The rest of the thesis is organized as follows. Chapter Two examines the properties of reliable group communication mechanism. Issues related to reliability, namely, availability, order, atomicity and survivability as applicable to our group communication  mechanism are also discussed. In Chapter Three we describe our  group communication  mechanism in detail and discuss how the scheme works in  the  presence of failures. Chapter Four describes the implementation details and  the  performance of the proposed mechanism. Chapter Five concludes this work.  12  Chapter Two Properties of a Reliable Group Communication Mechanism An  important  property  of group communication  Many researchers use the term reliability  mechanisms is reliability.  to mean full delivery of the message,  i.e., assuming the sender does not fail in the middle of transmission, messages are delivered to all members of the group [4,7]. However, in this thesis, we consider a group communication mechanism reliable only if it satisfies the two aspects of reliability: full delivery and correctness. Issues related to correctness are  order,  atomicity  messages sent operational  and  survivability.  The  order  property  guarantees  from  all the senders  are delivered in the same order  members  of the group.  Atomicity  ensures  that  every  that to  all  message  transmitted by a sender is either delivered to all operational members of the group  or to none  of them.  Survivability  is a  measure  of how  well the  mechanism is able to tolerate and recover from failures. The following section briefly discusses the concept of full delivery and in Section 2.2, issues related to the correctness of the group communication mechanism running on a cluster of diskless workstations the  group  mechanism  send  primitives  provided  by  which  satisfies  the above  in a distributed system  are discussed. Section 2.3 outlines the proposed  properties. Section  group 2.4  communication concludes  this  chapter.  2.1 Full Delivery Full delivery ensures that a message sent to a group will be delivered to all operational members provided the sender does not fail in the middle of the transmission. If the underlying network supports broadcast then  one way  message  or multicast facilities  to implement full delivery is for the sender to broadcast the  to the group  first  in a  datagram  fashion  and later  transmit the  13  message individually to the members which did not receive the message the first time  using one-to-one IPC. However, if the underlying network  unicast  then  the sender  may  adopt  the brute-force method  supports only  of sending the  message to each member individually using one-to-one IPC. Therefore as long as the underlying system  supports one-to-one IPC which guarantees reliable delivery  of a message to it's destination, the group communication mechanism can ensure the full delivery property.  2.2  Correctness  In  addition to full delivery, we also attempt to ensure the correctness of  the proposed group communication mechanism. In this section we will discuss the issues related to the correctness property. 2.2.1  Order  In  a distributed system where processes coordinate their actions by sending  messages to one another  and do not use a global clock for  synchronization,  events can only be partially ordered in terms of the h a p p e n e d b e f o r e relation [13].  If we  process then  assume  that sending or receiving  a message  is an event  we can define the h a p p e n e d b e f o r e relation denoted  in a  by -> as  follows. 1.  If p and q are events in the same process and if p occurs before q, then p->q.  2.  If event p corresponds to sending a message by one process and event b corresponds to receiving the same message by another process, then p->q.  3.  If p->q and q->r then p->r. Two distinct events p and q are concurrent if p->q and q->p.  14  Another means  way  of v i e w i n g the definition  that  it  is possible for  a n event b corresponding to event  of  initiated the  sending a  event  of h a p p e n e d b e f o r e  p  to  c a u s a l l y affect  sending a message B  message B '  to  the  is to s a y t h a t  event q  to a group  s a m e group  G . If  [13].  of  the  group  will  receive  the  messages B  Consider  G . L e t b' be  the  two  b y the s a m e process a n d i f b occurs before b' then b - > b \  members  p->q  and  B'  an  events  are  Therefore  all  in  the  same  order: B f i r s t and B ' second.  However and  S'  unless  the  both  sources  them  event  group  one  cannot  send  the  it  and b'  to  their  members. R  has received it  the  group  are  initiated  c a u s a l l y relate messages Thus  to  from  the a  by  the  two  events  single  u s i n g the  transmitting  action is given b y b->c.  message B '  then b ' - > c \  to  events b  corresponding to  after  denoting this the  the  respectively,  transmits is the  if  b  and b'  receiver  R,  sources S in  which  message B  to  the  R  then  members  source S . T h e h a p p e n e d b e f o r e  after  general  above example, assume t h a t c  S i m i l a r l y , if c' is the  members  different  event of R  has received it  from  of  relation sending  source S ' ,  Since events c a n d c ' occur i n the same process R, and R can only  Figure 2.1  Happened  before  relation for  ordered delivery  15 send a message at a time, events c and c' cannot occur simultaneously. If c->c' then b->b' else if c'->c then b'->b. The h a p p e n e d  before  relation c->c'  is  illustrated in Figure 2.1. Thus, by using a single receiver to first receive the messages  from  the sources  and then  guarantee that the messages will  transmitting to the members, one can  be delivered to the members  in the same  order. For  some  applications it is not sufficent  that messages from  senders are received in the same order but it is also necessary  different  that this order  be the same as some predetermined one. Birman [1] gives a following example of such a condition. Consider  a process  p which instructs a group of devices  with the message "place wine bottles under taps" and process q that orders the same group of devices with the message "open taps". Clearly, it is important that the first message be delivered to all  members of the group before the  second one. One way this can be implemented in a distributed system that does not use a global clock for synchronization is to require the process p to send a message to process q after the wine bottles have indeed been placed under the taps. This message causally relates the group message from p to that from q. Our  proposed group communication mechanism does not provide this facility which  is left to the application programs. 2.2.2  Atomicity  Atomicity ensures that every message sent to a group is either delivered to all operational members of the group or to none of them. It is important to distinguish  the difference between  full  delivery  and  atomicity.  Full  delivery  ensures the delivery of messages to all members as long as the sender does not fail during the message transmission. However if the sender fails in the middle of the message transmission, it is possible that some of the members have not  16  received the message resulting in a partial delivery. Partial delivery is harmful in many applications. Consider an application using the group communication mechanism to implement a replicated file service. Here, all the file servers will belong  to  a group and updates  are  sent to  this  group using  the group  communication mechanism. If an update is not delivered to any of the file servers,  the files at the  servers  will still be consistent with one another.  However, if ah update is delivered only to some file servers then some files will be updated while others  are not, resulting in inconsistencies.  Therefore if a  message is delivered to at least one operational member of a group then the group communication mechanism must make sure that this message will be delivered to the rest of the operational members as well. Atomicity property guarantees such an action. 2.2.3 Survivability  Survivability  guarantees  continuous  operation  despite  failures.  In  the  proposed group communication mechanism, the failure of the primary manager in the middle of a message transmission will result in a partial delivery. In order to survive such failures, the primary manager is replicated in all the member sites. In case the primary manager fails, a new primary manager is selected from among these replicas using some selection mechanism. The new primary manager must finish any incomplete message transmission initiated by the failed primary manager before resuming normal operation. Failures may occur during the selection of new primary manager, and the network may partition. The survivability property must ensure that the group communication mechanism will survive any such failures  and still provide order and atomicity to message  transmission. 2.3 Outline of the Group Send Primitives  17 In  this section, we first summarize the properties of the proposed group  communication  mechanism  and  outline  the two  primitives  provided  by the  proposed mchanism. Our  reliable  group  communication  mechanism  satisfies  the  following  properties. 1.  A message sent to a group must be delivered to all operational members of the group or to none of them.  2.  If message B is sent to a group before message B' by the same sender, then if B' is received, B is also received.  3.  If two messages B  and B' are sent by the same sender to the same  group, then the messages are received by the members of the group in the same order as they were initiated. 4.  If two messages B and B' are sent by two different senders to the same group then the messages are received by all the members of the group in the  same order, either B first and B' or B' first and B.  Although  the last  property  is essential  in many  applications,  some  applications do not require an order to be enforced between two messages as the outcome of one may computation  which  not causally  updates  affect  copies  the other. For example, consider  of two  different  variables  VI  and  a' V2  maintained by the members of a group. Assume message B broadcast from a source in the computation is responsible for updating V I and B' from another source  in the same computation  is responsible  for updating V2. In such a  scenario it is not necessary that both messages be received by the members of the  group in the same order as updating the variable V I does not have any  effect on V2 and vice versa. The only requirement here is that all members must receive the updates or none of them should receive the updates.  18  Since the overhead in enforcing the order property is non-trivial, our group communication mechanism provides two primitives. One guarantees delivery of the messages  in  the  same  order  to  all  members  of  a  group and the  other  guarantees only atomicity, but messages may be delivered in some arbitary order. The former type of message transmission is known as OGSEND (Ordered Group Send). OGSEND messages will be delivered in the same order to all members of a group or to none of them. OGSEND message transmission is initiated by invoking ogsend(msg, gid, msgtype) where msg is a pointer to the message to be transmitted and gid is the identifier of the group to which the members belong. The  second  type  of transmission UGSEND  (Unordered Group Send) does not  guarantee ordered delivery but ensures atomicity. UGSEND message transmission is initiated by invoking ugsend(/ns^, gid, msgtype). In both the primitives if the IMMEDIATE REPLY  BIT is not set in msgtype, then the process which invokes  these primitives will be unblocked only after the message is delivered to all the operational  members  of  the  group.  Otherwise,  the  sending  process  may be  unblocked before the message is indeed delivered to the members of the group as explained in Section 3.2. 2.4 Chapter Summary  Properties  of  a  reliable  group  communication  mechanism  includes  full  delivery and correctness. Full delivery ensures that a message sent to a group will be delivered to all operational members provided the sender does not fail in the middle of the transmission. Issues related to correctness are order, atomicity and survivability. In the proposed mechanism ordering is achieved by funneling the messages through a single process. Atomicity guarantees that if the message is delivered to at least one operational member of a group, then it will be delivered to the rest of the operational members as well. Survivability ensures  19  continous operation despite host, process and network failures. The proposed mechanism provides two group send primitives above properties.  ogsend  and  ugsend  with the  20  Chapter Three Design of the Proposed Group Communication Mechanism This  chapter  describes  communication  mechanism. We  communication  mechanism  delivery  and  communication  the  the design  philosophy  of the proposed  have seen in Chapter Two  requires  correctness  mechanism, each  some  form  properties. group  that a reliable group  of coordination  Thus,  group  in  the  to ensure proposed  has a primary manager  full group  process which  maintains the membership list of the group and also acts as a funnel process for the messages transmitted to the members of the group.  In order to ensure survivability in case of primary manager failure, the primary manager is replicated in all the member sites. We  call these replicas  s e c o n d a r y m a n a g e r s . These secondary managers do not take part in any group management  activities  which  are only  carried  out by  the primary  manager.  Secondary managers act as backups, so that in case of primary manager failure, one of the secondary managers will take over as the new  primary manager.  Section 3.1 describes the activities of the primary and the secondary managers as well as the group state information maintained by them. In Section 3.2, the design of the group send primitives outlined in Section 2.3 is discussed. Section 3.3 describes the failure detection and recovery procedures for group members, primary  manager  and  secondary  managers. Obviuosly, failure  of the primary  manager is more serious than the failure of the secondary managers. A  new  primary manager must be selected from among the secondary managers. There are  several  schemes  proposed  in the literature  to select  a  leader  in an  environment such as ours. Section 3.4 gives an overview of the selection schemes and presents a new  scheme based on finite state model used in our proposed  group  mechanism.  communication  Section  3.5  deals with  a  different  kind of  21  failure, i.e., failure due to network partition. Section 3.6 concludes this Chapter. 3.1 P r i m a r y a n d S e c o n d a r y G r o u p  Each  group  has  a  Managers  primary  manager  and  zero  or  more  secondary  managers. When a new group is created, a primary manager for this group is also created in the same host by the underlying group mechanism. When a member from a different host joins the group, and a secondary manager for this group does not already exist on the joining member's host, a secondary manager for this group will also be created on that host. The primary as well as the secondary managers maintain the process identifiers (pids) of the members of the group local to their respective hosts in the l o c a l g r o u p  member  list.  Also the  primary and secondary managers maintain the pids of all the managers for the group in their m a n a g e r  member  lists.  When a new secondary manager is  created, the primary manager's manager member list is copied into the new secondary manager's manager member list. The primary manager then updates it's manager member list with the pid of the new secondary manager and informs all the secondary managers of the group to update their lists as well. When a member joins the group from a host where the primary or a secondary manager for this group already exists, the pid of the new member is simply  added to  information managers  is  the  local group member list.  distributed  across  all the  hosts,  Although group membership the  maintain information about those members  primary of the  and secondary group executing  locally (i.e., local members). Thus, when a message is sent to a group, the group communication mechanism must make sure that the message is delivered to all the group managers each of which will then deliver the message to it's local members. This requires less space, less network traffic and reduced code complexity compared to the case of replicating the entire membership information  22 in the primary manager and in all the secondary managers. 3.2 Design of the Group Send Primitives This ogsend  section  presents the design details  and ugsend  of the group  send  primitives  (outlined in Section 2.4), which make use of the primary  manager and the secondary managers to provide the reliable  properties discussed  in Chapter Two. . 3.2.1  Ordered Group Send (OGSEND) Primitive  Messages sent to a group by invoking the ogsend primitive are delivered in the same order to all members of the group. OGSEND messages to a group are  first  sequence  received  by  the primary  manager  for that  group, which  will then  the messages in the order they were received and send them to it's  local members and to the secondary managers of the group. When the secondary managers  receive  the message, they deliver it to their  local  members. After  ensuring that the message is received by all the secondary managers (i.e., after all the secondary managers acknowledge the receipt of the message), the primary manager unblocks the sender which has remained blocked after invoking ogsend. A  high level description of the message transmission activity  manager  is given in Figure 3.1. The  primary  manager  will  of the primary receive  a  new  message for transmission only after it has completed the delivery of the previous message. Messages arrived at the primary manager's host while it is not ready to  receive  will  be  queued first-in-first-out (FIFO)  in the primary  manager's  message queue by the underlying system. When the primary manager is ready to receive a message, the underlying system will deliver the first message in the message queue (if any) to it. Since the message queue is FIFO, messages will be delivered to the primary manager in the order they arrive.  23  Type : group manager Task : sending a group message  FOREVER DO Begin R e c e i v e (from source) Send (to all members) R e p l y (to source)  End Figure 3.1 Group manager's message transmission  The secondary  method  of  managers  architecture. manager  If  can  the send  transmitting  depends  on  network  only  the  message  the  to  from  functionality  supports  message  one-to-one IPC. However, if the  a  the  of  unicast  primary  the  manager  underlying  facility,  individual  network also  the  then  secondary  to  network  the  primary  managers  using  supports broadcast facility then the  message can be first multicast to the group's secondary managers in a datagram fashion.  The  primary  acknowledgements. managers the  at  the  If  manager  then  acknowledgements  expiration of  the  message to these secondary  primary  manager  waits  to  exploit  are  time  not  a  specific  received  interval, the  managers  the  for  time  from  primary  some  acknowledgement  and  for  secondary  manager  using one-to-one IPC. This  positive  period  resends  allows  the  retransmission  properties of the one-to-one IPC for reliable delivery and to determine the failures described in Section  Let's assume a T2  primary  manager  seconds for this  3.3.  that it takes an average of T l seconds for a message from to be delivered  to  message  processed  to be  the  secondary  average of T3 seconds for an acknowledgement received secondary  and  processed  managers,  by  then  the it  will  primary take  by  a  +  secondary  an  average  manager,  of  and an  from a secondary manager to be  manager.  (Tl  managers,  T2  Thus, +  if  the  group  nT3) seconds  has  n  for all the  24 secondary managers to acknowledge for primary manager's message (assuming no resends). If T2 includes the time required by a secondary manager to deliver the message  to it's local  one-to-one  members  and to receive  their  IPC), then T2 is an application dependent  acknowledgements  (i.e.,  quantity. Thus, if some  local members take a long time to process the sender's message, then their secondary  manager cannot  immediately wishes  which  that  send  an acknowledgement to the primary  manager  in turn cannot unblock the sender. However, if the sender  all the members should receive the message before the primary  manager unblocks it, then there is no other alternative than to wait for secondary  managers  to acknowledge  after  guaranteeing  that  the  the message is  received by all of their local members. On secondary  the other hand, it may be acceptable to unblock the sender after the managers  acknowledgements  from  have  received  all the group  the  message  without  members. In this  waiting for  case, the secondary  managers can send acknowledgements to the primary manager without waiting to deliver the message to their local members. The primary manager then unblocks the sender. The secondary  managers will queue the messages in the delivery  queue for the local members and when the local members are ready to receive, they can obtain the messages from the delivery queue. The implementation provides flexibility for the applications to specify which scheme they prefer using the IMMEDIATE  REPLY bit of the msgtype  parameter  in ogsend and ugsend primitives. The  mechanism described so far is inadequate to guard against duplicate  messages. For example, acknowledgements for the datagram  from some secondary  25  managers may not be received by the primary manager within the specific time period if the message or the acknowledgement  is lost. Thus, when the primary  manager resends a message, the secondary managers which did not receive the message the first time  around will be receiving the  those secondary managers whose acknowledgements duplicate  message.  incorporate primary  Therefore,  some duplicate  and  secondary  the  group  detecting  managers  In  keeps  manager  track  OGSEND  of  maintains the  message  next  a variable called OGSEND  transmission  is  our  transaction  discard duplicate messages. Transaction identifiers primary  were lost will be receiving a  communication  scheme. use  mechanism  proposed identifiers  should  mechanism, to  the  detect and  are simply integer values. The  ogsend-send-seq-no (ossno)  message's initiated,  right messasge. However,  transaction  the  primary  identifier. manager  which  When a  assigns  the  o s s n o to the message and transmits it to the secondary managers. The o s s n o is then incremented by one ready to be used with the next OGSEND message. On the  receiving  end,  ogsend-receive-seq-no  the (orsno)  secondary to  keep  managers  maintain  track of the  a  variable  called  transaction identifier of the  next incoming OGSEND message. When an OGSEND message is received by a secondary manager, the o s s n o and o r s n o are compared and depending on their values, the message is either delivered to the local members or discarded.  The  OGSEND primitive therefore guarantees  the delivery of the messages  to all operational members of the group in the same order. Failure detection and recovery procedures of OGSEND  message transmission in the event of primary  manager failure will be discussed in section 3.2.2 U n o r d e r e d G r o u p S e n d ( U G S E N D )  3.3.  Primitive  Although many applications require that messages be delivered to all the members of a group in the same order, some applications do not require such  26  strict ordering with it's attendant overhead. For these applications, the proposed group communication mechanism provides a primitive called u g s e n d . transmitted by invoking u g s e n d  are guaranteed  Messages  to be delivered to all the  members of the group but in some arbitary order. Unlike ordered delivery, unordered delivery does not require that  the  messages be funneled through a single receiver. Thus, we have multiple senders and multiple receivers. If each sender maintains a list of all the receivers' pids then every sender can participate in the message transmission activity. Even though the messages from senders can be guaranteed to be delivered to all the members, they may not be delivered in the same order. In the proposed group communication mechanism, each secondary manager has  information about the  primary manager  as  well  as  all the  secondary  managers (manager member list). Thus, every secondary manager can initiate a message transmission similar to the primary manager's OGSEND transmission. For example assume that c is the event corresponding to secondary manager C transmitting the message M to the members of the group after it has received it  from  sender  S.  Similarly,  c'  is  the  event  of  secondary  manager C'  transmitting the message M' to the members of the group after C has received it from sender S'. Since events c and c' occur at different processes, it is possible  that  both  events  may  occur  at  the  same  time.  Under  such  circumstances, message M will be delivered before message M' to some members of the group and in the reverse order to the rest of the members. Thus, when applications invoke the u g s e n d primitive to send messages to a group, the group communication mechanism first checks to see whether there is a manager for this group available in the sender's host. If there is, the message will first be sent to it which will then transmit it to the rest of the  27  managers, each of which in turn deliver the message to thier local members. However if a local manager does not exist, then the message is sent to the primary  manager for the group which then transmits it to it's local members  and to the secondary managers for the group. Similar  to  OGSEND  transmission,  UGSEND  transmission  needs  a  mechanism to detect duplicates. In our scheme, when the primary or a secondary manager (ussno)  for a  group  transmits  a  UGSEND  message,  a ugsend-send-seq-no  is assigned to the message. The recipients will have the corresponding  ugsend-receive-seq-no  (ursno).  When  a  manager  UGSEND message, the u s s n o and u r s n o  for the group  receives  a  are compared and depending on their  values the message will either be delivered to the local members or discarded.  3.3  Failure Detection a n d Recovery  To  ensure  that  the group  Procedures  communication  mechanism  provides  reliable  service, the survivability property must be guaranteed. Survivability is a measure of how well a system can tolerate and recover from failures. Our discussion in this  section will  failures. We  have  focus  on two aspects  assumed  of failures: process  that application processes  such  failures  and host  as group members  may fail, but operating system processes such as primary or secondary managers which  are used  to implement  the group  communication  mechanism  are well  debugged and do not fail unless the host machine itself fails. When the host fails all the processes in it fail. Therefore host failures are more serious than process failures. Suppose the host fails while a primary or a secondary manager executing in it is in the middle some  of the members  will  of a message transmission, it is possible that  not receive  delivery. The following section briefly Section  3.3.2, failure  detection  the message  resulting  in a partial  describes failures of group members. In  and completion  of any incomplete  UGSEND  28 message transmission in the event of secondary manager failure will be discussed. The case of primary manager failure is described in Section 3.3.3. 3.3.1 G r o u p  Member  Failure  Failure of a group member does not affect group communication in  the other operational group  members. Our group  communication  activities  mechanism  guarantees that all the operational group members of the group will receive the messages sent to them. The failure of a member is detected when a primary or a secondary manager tries to deliver one-to-one  a message to the failed member using a  IPC (refer to Section 3.2.1). On detecting the failure of one of it's  local members, the primary or the secondary manager simply removes the failed member's pid from it's local group member list. After the removal, if the local group member list maintained by a secondary manager becomes empty, then this manager  ceases execution. On  the other hand  if the primary  manager's list  becomes empty and it's manager member list is also empty, then the primary manager ceases execution and the group is considered nonexistent.  3.3.2  Secondary  Manager  Host  Failure  If a secondary manager fails, the primary manager has to detect this and finish  any incomplete  UGSEND  message  transmission initiated  secondary  manager. To detect the failure of secondary  manager  has  many  options. . All  these  by the failed  managers, the primary  schemes  exploit  the  positive  acknowledgement property of one-to-one IPC to determine process failure. In one scheme, the primary manager will detect the secondary manager failure  in the next  OGSEND  transmission of a group  view  or update  UGSEND such  message  the creation  transmission, or a of a new  secondary  manager for the group. When the primary manager tries to send a message to  29  the  individual  has  failed  secondary m a n a g e r s  then  the  underlying  u s i n g one-to-one  system  will  inform the  is t r y i n g to send a message to a nonexistent  Another a  secondary m a n a g e r .  primary the  scheme is to  V  manager  at  K e r n e l , the  p r i m i t i v e provided A  high  sm(i)  secondary  if  the  nonexistent manager  is  the  of the in  Rather  as the  fails,  the it  to  A).  the  failure  latter  that it A).  create  in  is t r y i n g to The  n  the  does not  kernel  vulture  is built on  failure  the  top  of  of  failures. secondary  received blocked  send a n y the  on  messages  vulture's  host  to will  receive a message f r o m then  :  informs  sm(i). F o r this  vultures  of  ReceiveSpecific IPC  vulture w i l l be  of secondary manager  has  the  if  it  there  the  primary  scheme to are  a  n  work,  secondary  group.  than  liveliness of it's  The  underlying  Appendix  failure  of  for  that  process created by  vulture process to detect  long  manager  manager  look  manager  manager  s y s t e m to detect secondary manager  as  sm(i)  (see  m a n a g e r s i n the  advantage  3.2.  process the  process to  proposed mechanism  Figure  a n d notify  about  Since the  underlying  shown  sm(i)  primary  process (see A p p e n d i x  a vulture  process takes  vulture  primary  primary  host.  vulture  manager  However,  unblock  the  it's  by  a secondary  v u l t u r e process is a light weight  level description  manager  it.  A  create  I P C , if  creating may  a  create  vulture a  process  single  for  process  secondary m a n a g e r s . T h e  prober  each  called  secondary a  periodically  prober  : detect failure of primary manager (pm)  Begin R e c e i v e S p e c i f i c (from  pm)  S e n d ("pm failed" to secondary manager) Figure 3.2 Vulture process End  the  to  the  probe  sends probe  Type : vulture Task  manager,  message  30  ARE  Y O U ALIVE  to  each  secondary  manager.  one-to-one  I P C ' s to w h i c h  the secondary managers  messages.  If  manager  the  prober  prober  a secondary  The  time  first scheme m a y take  detect  secondary  the  the  manager  expensive than on  message failure.  needs  the u n d e r l y i n g  with I  primary  manager.  A M ALIVE  system  A high  longer, before  transmission T h e second  a separate  manager's  It is also faster  the  message  site than  transmission.  to  the p r i m a r y  which  will  inform  process a n d the  level description of  manager  scheme  vulture  detect  m a y not h a p p e n is  process. a single the  expensive The third  detects  a  manager  for a  because option  is  long each less  process h a s to be created  failures  of  all  the  secondary  the f i r s t scheme because it does not depend o n O u r prototype  implementation  scheme.  Type : prober Task : detect failure of secondary managers FOREVER DO Begin For i «= 1 to n Do Send (ARE_YOU_ALIVE to sm(i)) If reply !- I_AM_ALIVE Then Send (sm(i) failed to primary manager) Sleep (for a specified time period) End n = number of secondary managers Figure 3.3  uses  3.3.  the second scheme since only  managers. next  will reply  message  h a s failed. T h i s is due to the fact t h a t the p r i m a r y  o n it's next  to  to the p r i m a r y  process is given i n F i g u r e  secondary manager depends  then  probe  that it is t r y i n g to send a message to a nonexistent  w i l l notify the failure  the prober  fails  The  Prober process  uses  the  third  31  Once  the  failure  of a  secondary  manager has to delete the failed  manager  secondary  is detected, the primary  manager's pid from  it's manager  member list and inform the rest of the operational secondary managers to do the this,  same in order to maintain a consistent group view. However, before doing the  primary  transmission  manager  initiated  by  must  finish  the failed  any  secondary  incomplete  UGSEND  message  manager. The primary  manager  requests all the secondary managers to send to it their last UGSEND message received. If the returned messages as well as the last message received by the primary  manager  have the same  transaction  identifier  value, then  the failed  secondary manager has either successfully completed it's last UGSEND message transmission  activity  or no member  has received  it's last UGSEND message.  Either of these outcomes assures atomicity. However if there is a discrepancy among the transaction  identifier  values t, then the primary manager takes the  message with the highest transaction identifier and transmits it to the secondary managers. Those  secondary  managers that have already received the message  simply discard the duplicates, but others receive the message and deliver it to their local members. Once this message retransmission activity is completed, the primary manager deletes the failed secondary manager's pid from it's manager member list and informs the rest of the operational secondary managers about the failure.  3.3.3 P r i m a r y M a n a g e r Host F a i l u r e The primary group manager fails when the host in which it is executing fails. Primary manager failure is more serious than secondary manager failure. If the  primary manager fails, OGSEND activities cannot be carried out and new  members cannot join the group from a host where a secondary manager for this  t The transaction identifier values will differ by at most one.  32  group  does not reside. Also, failure of secondary  managers cannot be detected  and incomplete message transmission activities initiated by the failed managers cannot be finished. Even may  be  guarantee  able  to participate  atomic  delivery.  though  the operational secondary managers  in UGSEND  Thus,  a  group  message cannot  manager and function correctly for any extended provide a continuous group communication  secondary  transmission, one exist  without  a  cannot primary  length of time. In order to  mechanism, secondary managers must  employ a scheme to detect the failure of the primary manager and select a new primary manager from among themselves. One possible scheme is the  death-will  scheme proposed by Ravindran [14]. In this case we assume that the underlying  aliases  Kernel supports facilities for a process to create  that may  reside in  different address spaces (in the same or different machine) to perform functions related to failure detection and notification on behalf of their creator. Another scheme is that if the prober method is used by the primary manager to detect secondary manager failures, the lack of probes for extended period of time will indicate primary manager failure. However this scheme requires each secondary manager to create manager may  a  timer.  Another  possible  scheme  is that each  secondary  create a vulture process to look for the failure of the primary  manager. Since the underlying system  on which  the prototype of the proposed  mechanism is built supports such abstraction, this scheme has been implemented.  Every  secondary  manager  is a potential candidate to become  the next  primary manager due to the fact that each of them has the same global view of the group and each of them has the capability of detecting primary manager's failure. The scheme to select a new issues which may  primary manager must deal with several  arise. For example, there may  be inconsistencies due to two  or more secondary managers attempting to become  the new  primary manager.  Failures may even occur during the selection of the new primary manager itself.  33 Therefore the scheme must guarantee that when the selection is over, the group must be left with only one primary  manager and all the secondary  managers  must know the identity of the new primary manager. The following section gives an overview  of several possible selection schemes and in Section 3.4.1, a finite  state model of the selection scheme used in the proposed  group communication  mechanism is presented.  3.4 New Primary Manager Selection Scheme : An Overview Let  us first examine some of the existing selection schemes which include  token passing [3] and election [10]. The token passing scheme is suitable in an environment  where  the leadership is rotated among  the members  even when  there are no failures, as in Chang's [3] reliable broadcast protocol implementation. The  election scheme is suitable when a new leader is selected only if the old  leader has failed. In the election scheme, each potential candidate does not have any  knowledge about other candidates and depends on random timeouts before  proclaiming itself as the new leader. Such a scheme is used in electing a leader in TEMPO, a distributed clock synchronizer running on UNIX 4.3 BSD systems [10].  To elect a primary manager we propose a scheme called the succession list  scheme which is simpler than the election scheme. In this scheme each potential candidate has information about the other candidates. Normally this information is in  the form  of an ordered  list.  For example, the list  may  increasing value of the candidate's pids or ordered by the time  be ordered in at which the  candidates were created. All the candidates agree to select the first (or the last) candidate in the list as the successor when the leader fails.  In  our group communication mechanism each secondary  manager member pid  manager has it's  list ordered by the time they joined the group. Therefore the  of the first secondary  manager to join the group will be first after the  34  primary manager's pid in the manager member list, and the pid of the last secondary manager to join will be last in the list. In case of primary manager failure, the first operational secondary manager in the list will become the next primary manager. Based on this, one could propose a very simple succession list scheme  where  secondary  the younger  secondary  manager inform them  about  managers the new  will  wait  until  the oldest  leadership without running an  agreement protocol among themselves. However, this will not work under certain circumstances. Suppose the oldest secondary manager also fails immediately after the  primary manager has failed, the secondary  notified  to any of the operational  younger  secondary  manager  without  necessary  managers just probing  to run an  secondary wait  it, then  manager's failure  managers. In this  to hear  they  agreement protocol  may among  from  wait  will not be case, if the  the oldest  forever.  secondary  Therefore it is  all the operational  secondary  managers before a new primary manager is selected. In the next section we use a finite state model to explain the details of our selection scheme.  3.4.1  Succession List  During  their  Selection  Scheme  lifetime, secondary  : A  Finite  managers  State  can be  Model  in one of a finite  number of states. Transition from one state to another is caused by the arrival of a message. A state transition may  cause a secondary manager to transmit a  message which triggers subsequent transitions in other secondary managers. It is important to clarify that in explaining this scheme we focus only on the state of one secondary manager, say sm(i), and not on the state of the entire distributed program. Figure 3 . 4 represents the state diagram for a secondary manager sm(i). Circles represent states, arrows represent transitions. A transition occurs upon the arrival of a message or the occurrence of an event. The event which causes the transition is shown on the upper part of the label associated with the arrow.  35  recovery of msgs (if any)  Figure 3.4 State transition diagram of primary manager selection scheme The lower part of the label shows the message that is sent at the time of the transition. A n asterisk by a message indicates, that it is a broadcast (i.e., the message  is sent to all the secondary managers).  A  null label indicates that no  message is sent or received.  3.4.1.1 D e s c r i p t i o n active  o f t h e States  :  Normally  the primary  and the secondary  managers of a group  will be in  the  36  active state, suspended  :  This is a transition state which is. reached when secondary manager sm(i) learns about the failure of the primary manager. At this state, sm(i) checks to see if there are secondary managers in front of it in the manager member list. If so, it  sends  a probe  message  INFORM_STATUS  to sm(j), the first  among the  secondary managers in the list ahead of sm(i). If the underlying system informs sm(i) that sm(j) is not operational, sm(i) probes  the next secondary  manager  down the list. However  if sm(j) replies  with the message  CANDIDATE  to the probe  message, then sm(i) enters the r e l a x state. On the other hand if sm(i) finds out that it is the first operational secondary manager in the manager member list, then it broadcasts the message CANDIDATE to all the secondary managers and enters the c a n d i d a t e state. relax  We  :  have already seen how a secondary manager can reach this state from the  suspended active  state. A  secondary  manager  can also  reach  this  state  from the  state. For example if there is a delay in learning about the primary  manager's failure, sm(i) will  still be in a c t i v e  state. If it now  receives the  message CANDIDATE from sm(j) which is infront of sm(i) in the list, then sm(i) will  enter  ACCEPT the  the  relax  state  after  sending  the  reply  message  CANDIDATE to sm(j). On entering the r e l a x state, sm(i) notes down  candidate's identifier in it's last candidate field. Also it informs it's vulture  process to detect the failure of the primary manager candidate.  37 While  sm(i) is in the r e l a x  state, the primary  fail. When sm(i) learns about this failure from  manager  candidate  may  it's vulture it returns to the  s u s p e n d e d state. However if there is a delay in learning about this failure, it may  either receive the message INFORM  STATUS or CANDIDATE from some  other primary managers. For example, if sm(i) is the next operational secondary manager  in the list after the failed candidate, other secondary  managers will  probe it with INFORM_STATUS. On the other hand, the CANDIDATE message may  be received from a secondary manager sm(k) which is the first operational  secondary  manager in the manager member list. Sm(i) in the r e l a x  state  may  also receive the CANDIDATE message in other circumstances. For instance, may  it  have entered the r e l a x state from the s u s p e n d e d state because secondary  manager sm(j) ahead of sm(i) in the manager member list has replied with the message CANDIDATE to sm(i)'s INFORM before  sm(j) broadcasts the message  STATUS probe. If this has happened  CANDIDATE, then  sm(i) will receive the  message CANDIDATE from sm(j) one more time. If sm(i) receives the CANDIDATE message while in the r e l a x  state, it  will compare the pids of the last candidate and the sender of the message. If they are different, then a new candidate has initiated the election. Sm(i) instructs it's vulture to look for the failure of the new  candidate, notes down the new  candidate's pid in it's last candidate field, and replies to the new candidate with the ACCEPT  CANDIDATE  message. However  if the pids are the same, then  sm(i) is already monitoring the right candidate and therefore it simply replies with the ACCEPT candidate  CANDIDATE message.  :  This is the state when a secondary  manager declares it's intention to contend  for the primary manager position. As explained in the s u s p e n d e d state, the first  38  operational secondary manager in the manager member list will reach this state after  broadcasting  the CANDIDATE  message  to all the secondary managers.  sm(i) can also reach this state from the a c t i v e  Secondary manager example, if there  is a delay  in learning about  manager, sm(i) will still be in the a c t i v e sm(i) receives an INFORM  STATUS  the failure  state. For  of the primary  state. While in the a c t i v e  message  from  secondary  state, if  manager sm(k)  which is behind it in the manager member list, sm(i) will reply to sm(k) with the CANDIDATE message. It will then broadcast  the CANDIDATE message to  all the secondary managers and enter the c a n d i d a t e the c a n d i d a t e  sm(i) can also enter  state from  state. Secondary manager  the r e l a x  state as explained  under r e l a x . If in the c a n d i d a t e it  will  reply  with  state sm(i) receives an INFORM  the message  CANDIDATE.  STATUS message,  After validating  secondary managers have received it's CANDIDATE  that  all  the  message and have reached  the r e l a x state (i.e., after receiving CANDIDATE_ACCEPT message from all the operational secondary managers), the primary incomplete primary PMGR  OGSEND or group view message transmissions initiated by the failed  manager. The new primary  manager  will then  broadcast  the message  ACTIVE to all the secondary managers. On receiving that message, the  secondary manager  manager candidate will finish any  managers member  reply  list,  enter  with  the message  the a c t i v e  PMGR  ACCEPT,  state and resume normal operations.  After the new primary manager has received the PMGR  ACCEPT message from  all the operational secondary managers, it creates a prober the a c t i v e state to resume normal operation. 3.5 N e t w o r k P a r t i t i o n  update the  process and enters  39  When  the  network  partitions,  subgroups of hosts; a l l but In  our  scheme,  manager  and  communication  within  system  managers  to  function.  these  within The  subgroups  Figure  3.5(a).  numbers number view  example, consider a group  2 1.  The  secondary  through Assume  maintained  5  with  that  by  the  the  will  be  Consider, now rest  of  the  consisting subgroup  of B  group. the with  a network As  managers  a  merge  and  their  manager  multicast  address  and  all  will  mechanism and  two  or  select  hosts  M of  the  manager. a  assures  atomic.  more  primary that  The  the  difficult  again.  of this  are  the  secondary list  as s h o w n  identified  group  group  1; manager member  is  result,  there  will  manager  and  secondary  managers  4  be  the  and  two  the  assigned  the  X.  The  managers  group  is  5.  Since  the  multicast address X manager member list 1, 2, 3, 4, 5 primary manager M  Group view before  network  partition  then  1,2,3,4,5].  subgroups:  secondary  in  by  partition which separates hosts 4 a n d 5 f r o m  primary  Figure 3.5(a)  into  secondary m a n a g e r s  primary  manager  subgroup  ordered  w i t h four  primary  [multicast address X ; p r i m a r y  a  proposed  problem is w h a t happens w h e n the partitions  For  divides  one subgroup w i l l be left w i t h o u t a p r i m a r y  secondary  continue  the  managers underlying  subgroup 2  the A  and  3,  system  in  40  the p r i m a r y failure, This  it  m a n a g e r ' s host cannot distinguish between network  will  results  in  member  list  and  Thus,  3.  primary  inform the  the  the  manager  of  by  primary  view  manager  subgroup  member  will  be  1,2*3]. O n the  a m o n g t h e m s e l v e s . A s s u m e that secondary manager  then is  [multicast  illustrated  secondary  manager  M'  address X ; in  Figure  of  within  these  discussed i n C h a p t e r  subgroups reliable  do  period  of  time,  not  merge  to  form  cannot  multicast address manager  member  0  the a  be  and  5  the  will  partitioned  network  2 X;  the  primary  a  primary  4 has been selected in  member  partitioned  subgroups  address  select  view  have  single  group  Consider  multicast address 1,2,3  \  with  X  manager member list 4, 5  primary manager M  Figure 3.5(b)  manager  h a n d , secondary  that  will  4; manager  guaranteed.  X list  as  failed.  subgroup list  is  4,5]. T h i s  network the  B  does  same  not  reliable  Two.  some  communication  long  4  vultures  B . The group  manager  As  properties  After  subgroup  3.5(b).  communication  their  managers  primary  remerge,  by  other  manager  primary  notified  the  [multicast  has  and  be  from  host  a n d secondary m a n a g e r s  manager  the  will  list  A  M  in  failed  B  in  manager  4 and 5  managers  as  subgroup  secondary managers  the  group 1;  and  prober t h a t secondary managers 4 and 5 h a v e  removal  maintained  partition  o O( o primary manager M'  Group view after  network partition  may one a  remerge.  primary UGSEND  If  the  manager, activity  41  initiated b y secondary m a n a g e r 5 i n subgroup B after First,  this  will  resend  message will  the message  the first  be received  message and  secondary m a n a g e r  sent  time by  multicasts  to  those  around.  This  managers  address  h a s remerged.  the message to the address X . T h e n subgroup  will  a l l the members  to the multicast  the secondary  in  the n e t w o r k  B  guarantee  in  subgroup  X  will  i n subgroup  that  failed  that  to  receive  the U G S E N D  B. However,  it  the  message  the  UGSEND  also be received b y the  A , as they  a r e also  listening  is  the  that  primary o n the  s a m e multicast address.  ' T h e other multiple those  primary  wishing  therefore merge well  problem  after  necessary  that  with  the group  the subgroups  a single group  a s a l l the secondary  group v i e w after  due to  m a n a g e r s listening on the same multicast  to interact  to f o r m  merging  a s illustrated  formed  w i t h one p r i m a r y  managers  of this  during  fact  address t h u s in Figure  network  manager.  single group  multicast address  X  manager member list  1. 2, 3  should  X  manager member list 4, 5  primary manager M  primary manager M*  Figure 3.5(c) Group view after network remerge  are  confusing  3.5(c).  partition  It  have  is  should  A l s o the p r i m a r y  the merge.  multicast address  there  as  the same  42 In the follwing section we describe how our scheme handles the messages initiated by the primary or the secondary managers of one subgroup but received by the managers belonging to another subgroup. Section 3.5.2 describes how to merge multiple subgroups after the partitioned network has remerged. 3.5.1 D i s c a r d i n g  Messages from Different  Subgroups  After the network has partitioned, the communication  within the subgroups  will have the reliable properties discussed in Chapter Two. After the network has remerged, these  subgroups  will  be merged  together to form  However, in the mean time, message transmission initiated  a single  from  group.  one subgroup  may be received by the managers belonging to a different subgroup. This kind of reception  should  either  be avoided  or if it cannot  be avoided, the received  messages should be detected and discarded. A for  each  multicast  simple scheme to attempt to avoid receiving these messages would be subgroup  which  address. When  selects  a new  the primary  primary  manager  to choose  a new  manager is selected, it chooses  a new  multicast address and informs it's secondary managers about it. For example, in Figure 3.5(b), subgroup  B which selects a new primary manager chooses a new  multicast address X'. Thus the group view maintained by the primary manager M' and it's secondary manager 5 will be [multicast address X'; primary manager 4; manager  member  transmission initiated  list  4,5]. If the network  remerges,  then  the message  by the primary or the secondary managers in subgroup  B  will never be received by the managers in subgroup A as they will be listening on  multicast  address  X.  One  disadvantage  of this  scheme  is that  a  new  multicast address has to be selected whenever a new primary manager is chosen. This may happen even when the network is not partitioned (i.e., just a primary manager failure). Also, there is no guarantee that independently chosen multicast  43 addresses on different partitions are distinct. This will cause problems when the partitions remerge. In the proposed mechanism, we  use another scheme where the messages  received by the primary or secondary managers of a subgroup will be discarded if they were not initiated from within their subgroup. In this scheme when a manager transmits a message, it specifies it's primary manager's identifier in the message header. Whenever a manager receives a OGSEND or UGSEND message, it  compares  specified  it's primary  in the message  manager's  pid against  header. If they  the primary  manager's pid  are different, the message  discarded. For example, if secondary manager  will be  5 transmits a message after the  merger of the partitioned network (but before the merger of the subgroups) to the multicast address X, even if the primary or secondary manager of subgroup A  receives this message, it will be discarded as the primary  subgroup A  will be different from  the primary  manager  M'  manager specified  M  of  in the  message headert. 3.5.2  Merging Subgroups  When a partitioned partition  have to be  network merges, the subgroups formed due to this  merged. The  primary  managers  of different  subgroups  listening on the same multicast address must detect that more than one primary manager exists for the same group and reach an agreement as to which should become the leader. 3.5.2.1 D e t e c t i o n o f  Subgroups  t Here, we are assuming either the hardware address is used as the identifier or that new primary manager will be chosen only from those secondary managers that already exist at the time of network partition. This is the most likely scenario and will guarantee the primary managers in different subgroups will have different identifiers.  44  In  the proposed  mechanism, the prober process periodically broadcasts a  RESOLVE message to the group's multicast address. This message contains the primary  manager's  pid and are discarded by the secondary  primary  managers respond  consider  Figure  to the RESOLVE message. To explain this scheme,  3.5(a). When  primary manager M  managers. Only  the network  is not partitioned,  the prober of  will be broadcasting the RESOLVE message which will be  discarded by secondary  managers 2 through 5. After network partition (Figure  3.5(b)) there are two subgroups; one with primary manager M  and the other  with primary manager M'. The RESOLVE message broadcast by the probers of primary and  managers M  and M' will be discarded by the secondary  3 in subgroup  A  and secondary  manager  5 in subgroup  managers 2 B. After the  network remerges, the multicast address of the two subgroups  will still be X.  Thus,  be received by  the RESOLVE  primary  manager  RESOLVE  message  M'  of primary  and vice  message,  it knows  manager  versa. When that  M  a primary  the network  may  manager  has been  receives a  partitioned and  remerged. It also knows the identity of the other primary manager.  3.5.2.2 R e s o l v i n g  the  Leadership  Once a primary manager detects that there exists other primary managers with the same multicast address, all except one of them has to renounce the leadership. We explain merging  will make use of the finite state model shown in Figure 3.6 to  the scheme  used  to resolve  the leadership. The scheme  works with  a pair of primary managers at a time. In explaining this scheme we  focus only on the state of a primary manager, say pm(i). In the figure, circles represent states, arrows represent transitions. A transition occurs upon the arrival of  a message  or the occurence  transition is shown on the upper  of an event. The event  which  causes the  part of the label associated with the arrow.  45  MERGE (after timeout)  * as soon as a primary manager enters the merging state, it initiates procedure to merge with the chosen primary manager and becomes a secondary manager when the merge is complete (see Section 3.5.2.2)  Figure 3.6 State transition diagram of primary managers resolving leadership upon network remergence  The  lower  transition.  part of t h e label shows the message t h a t i s sent A null label indicates  Description active  at the time of the  that no message is sent or received.  of t h e States  :  Normally  pm(i) w i l l  be i n the a c t i v e  state.  If  it receives  a RESOLVE  message  46  in this state from another primary manager, say pm(j), it sends a RENOUNCE message to pm(j), notes down the pid of pm(j) in it's contender_pid enters the r e s o l v i n g state. The RENOUNCE  field and  message also includes the number  of secondary managers in the manager member list of pm(i) and indicates the intention of the sender to contend for leadership. Sometimes it is possible that pm(i) in the a c t i v e state may receive a RENOUNCE message from pm(j) before it receives the RESOLVE message. When a primary manager  receives a RENOUNCE  message, whether it gives up  it's  leadership or not depends on the number of secondary managers in it's manager member number  list.  In the proposed  scheme, the primary  manager  with  the most  of secondary managers will become the leader. If both contenders have  equal number  of secondary  managers, then the one with the larger  pid will  assume the leadership. Thus, if pm(i) has less number of secondary managers (or equal number of secondary managers but smaller pid), then pm(i) will reply with an  ACCEPT  RENOUNCE  REQ  message  and  enters  the  merging  state.  However, if pm(i) has more number of secondary managers (or equal number of secondary REJECT  managers RENOUNCE  While another  but  larger  pid), then  it  will  reply  with  a  REQ message and remains in the a c t i v e state.  in the a c t i v e state, pm(i) may  receive a MERGE  manager. Pm(i) simply joins the merging  manager  request from  in it's group (see  m e r g i n g state). resolving :  We have seen under a c t i v e how a primary manager enters the r e s o l v i n g state from  the a c t i v e  state  after  sending a RENOUNCE  message  to pm(j). If it  receives an ACCEPT_RENOUNCE_REQ message from pm(j) in response to it's RENOUNCE message, pm(i) will return to the a c t i v e state. Pm(i) also returns to  47  the  active  state  if pm(j) fails.  On  the other  hand  if pm(i) receives a  REJECT_RENOUNCE_REQ message from pm(j), it enters the m e r g i n g state. Pm(i) which has entered the r e s o l v i n g state after sending a RENOUNCE message to pm(j) may receive a RENOUNCE  message. This message may be  from the primary manager pm(j)t or from a third primary manager, say pm(k). Let us first consider the case of receiving this message from determines  that  pm(j) is the eligible  contender,  then  pm(j). If pm(i)  it replies  with an  ACCEPT_RENOUNCE_REQ message and enters the m e r g i n g state. However, if pm(i) finds that REJECT  pm(j) is not the eligible  contender  then  it replies  with a  REQ message and returns to the a c t i v e state.  RENOUNCE  If the RENOUNCE message was sent by pm(k), pm(i) will detect this by comparing it's contender pid with the pid of the process sending the RENOUNCE message. This may happen if the network partition had divided the group into more than two subgroups which subsequently remerged. In this case pm(i) replies with  a  TRY AGAIN  message  meaning  that  pm(i) is busy  resolving the  leadership with another contender and therefore pm(k) should wait and try again later. merging :  The primary manager pm(i) enters this state from the a c t i v e state or from the r e s o l v i n g state as explained earlier. While in the m e r g i n g state, pm(i) sends a MERGE secondary  request to it's new primary manager who will accept pm(i) as a new manager  in it's group  and exchange  with  it new  group  view  information (same procedure as the case of a new secondary manager joining the group  - see Section 3.1). It is possible that pm(i) may receive a RENOUNCE  t It may appear as if there will be a communication dead lock, but this can be overcome as explained in Section 4 . 6 .  48  while in the m e r g i n g  message I AM  NOT  state. In such  an event, it replies  with a  CONTENDER message which includes the pid of the new primary  manager. The sender of the RENOUNCE message will then try to resolve the leadership with it's new contender. Also, it is possible that pm(i) may receive a MERGE request while in the merging  state for which it replies with a I AM  message includes the pid of the new  primary  NOT  PMGR message. This  manager  so that  the merging  manager may request to merge with the new primary manager.  When pm(i) sends a MERGE request to a primary manager which is in the r e s o l v i n g state, the latter replies with a TRY  AGAIN  message and pm(i)  will retransmit the MERGE request after some specified time period. When the new primary manager receives a MERGE request from a manager it accepts the merging the  manager as a new  group  view  secondary manager in it's group, i.e., it transfers  information to the merging  manager, updates  it's manager  member list with the pid of the merging manager and informs all the secondary managers of the group to update their lists as well. 3.6  Chapter  Summary  The group communication mechanism requires some form of coordination to realize  the reliable properties. In the proposed  primary  manager  to  coordinate  the  group  mechanism each management  and  activities. In order to ensure survivability in case of primary  group  has a  communication  manager failure,  the primary manager is replicated in all the member sites and a new primary manager  is selected  from  among  these  secondary  managers. The  managers do not take part in any group management  secondary  activities, but may  take  part in communication activities when ordering is not a requirement. The prober  49  process executing in the primary manager host detects any secondary manager failure and notifies the primary manager. The primary manager must finish up any incomplete message transmission initiated by the failed secondary manager. A vulture process executing in each of the secondary manager hosts detects primary manager failure then  and notifies it's secondary  select a new  Network  partition  address.  The  subgroups networks  primary may  proposed  manager  result  using a succession list selection scheme.  in subgroups  mechanism  manager. The secondary managers  of sites with the same multicast  ensures  will continue to exibit the reliable remerge, the proposed  that  communication  within  these  properties. When the partitioned  mechanism detects the different subgroups and  merges them to form a single group.  50  Chapter Four Implementation Details of the  Proposed Group Communication  Mechanism This Chapter describes the implementation details of the proposed group communication implementing  mechanism and it's performance. One the proposed  has basically two choices in  mechanism; either to implement  kernel of a distributed system or to implement  it on  it as part  of the  top of an existing well  tested kernel. Because of time constraint, and since the primary object is to test the feasibility of the proposed mechanism rather than it's performance, we  have  chosen the second approach. The proposed mechanism is built on top of the V Distributed  System  running on  a  cluster  of workstations in our  Distributed  System Research Laboratory. In implementing the proposed mechanism, three major issues have to be dealt with; group  management, group  communication,  and  failure detection  and  recovery procedures. Group management addresses such issues as group creation and  processes joining or leaving a group. Group communication  deals with the  issue of transferring a message from a source to all the members of the group with  the reliable  properties  discussed  in Chapter  Two.  Failure  detection  and  recovery is essential for the proposed mechanism to provide continuous service despite host failures and network partitioning. The  proposed mechanism is structured as a set of cooperating processes;  the primary and secondary managers are examples of such processes. In addition to these management processes, there are w o r k e r processes to help the manager processes to achieve the desired reliable properties. The v u l t u r e and p r o b e r are two examples of the worker processes.  51 This Chapter describes the implementation details of the above aspects is  divided into the  details  of  group  processes  following sections. Section 4.1  management. In  maintain  order  to  and  describes the implementation  manage  a  group,  some group management information. The  it's manager  manager member  list in which the pids of the primary as well as the secondary managers for the group are maintained Section 4.2 4.3,  is an important  part of the group management information.  describes the organization of the manager member list. In Section  implementation details of the group send primitives ugsend()  are described. Section 4.4 detection  and  partitioning  recovery  are  ogsend()  describes the details of the worker processes. Failure  procedures  explained  and  in  in the  Sections  event of host  4.5  and  4.6  evaluation of the group send primitives provided by  failures and  network  respectively. Performance the proposed mechanism is  given in Section 4.7, and Section 4.8 concludes the Chapter.  4.1  Group  The  Management  proposed group communication mechanism provides facilities to transfer  a message from a source to a set of processes called a group. Thus, in addition to  communication,  processes  to  mechanism  create, join,  functionalities, mechanism  the  the  run  a  and  should  to  leave  hosts  which  support  process  called  the  create or to join a group invoke  provide a  facilites  group. In the  group  order  proposed server.  for the application to provide  group  Processes  these  communication which wish to  the group management stub routines which in  turn send appropriate requests to the group server. The by  following section describes the c r e a t e g r o u p O  a process  wishing  routine which is invoked  to create a group. Once a group is created, the group  must be associated with a logical name (i.e., group name) so that processes will be  able  to  interact  with  the  group  using  this  logical  name. Section  4.1.2  52  describes the detail of registering a group with the name service. Sections 4.1.3 and  4.1.4 describe  the j o i n g r o u p O  and  leavegroupO  routines  invoked  by  process  invokes the  processes wishing to join or leave a group respectively.  4.1.1  Creating  A  a  new  creategroupO  Group  group  is dynamically  created  when  a  routine. This is a stub routine which sends a CREATE  GROUP  request along with the invoker's (initial member) pid to the group server. The group server creates a primary manager for the group and sends this request to it. The primary manager simply adds the initial member's pid to the local group member list. The group server then returns the primary  manager's pid to  the  invoker of the c r e a t e g r o u p routine.  4.1.2  Registering  In  a  Group  order to make the primary manager available to the processes wishing  to interact with the group, it's pid should be associated with a logical id. Since the prototype of our mechanism is built on top of the V Kernel, we make use of the name service facility provided by the underlying system. In the V Kernel, when  a  process  wants  to associate it's pid with  a  logical  id, it invokes  SetPidQogical id, pid, scope). If the specified scope is LOCAL, then the pid is registered locally so that only processes executing in the same host can obtain the pid from  the name service. On the other hand  if the specified  scope is  ANY, then the pid is registered globally so that processes executing in any host in the network will be able to obtain this pid from the name service.  When a process wants to find out the pid associated with the logical id, it  invokes  GetPid(logical  id, scope)  which  returns  the pid of the process  registered in the name service using the S e t P i d routine. If the specified scope is  53  LOCAL, then the name service returns a pid of a process locally registered to the invoker's host. However, if the scope is ANY, then the name service first looks for a locally registered process. If one is not found, it broadcasts a request to other hosts in the network requesting them to send it the pid associated with the logical id, if there is any. In order to associate the primary manager of the group with a group name, the creator of the group invokes registergroup(groupname, pmgr-pid, type). If the specified type is LOCAL, then only processes residing on the same host as the primary manager will be able to obtain the primary manager's pid from the name service. However, if the type is GLOBAL, then processes from any hosts in the network will be able to obtain the primary manager's pid from the name  service.  Registergroup  is  a  stub  routine  which  sends  a  REGISTER GROUP request to the primary manager whose pid is specified by pmgr-pid. The primary manager invokes the SetPid routine to register the group name in the name service. Normally the primary manager will register with type GLOBAL unless the group is meant to be a local group only. As described in  the next section, secondary managers also register with the  registergroup  routine, but the type is always LOCAL. 4.1.3  Joining a  Group  Processes wishing to join a group first find the group id associated with the group name and then invokes joingroup(group id). The group id returned by the name service may be the pid of the primary or secondary manager for that group, depending on the location of the joining process as explained below. Let  us consider the case where a member joins the group from a host  where neither the primary manager nor a secondary manager resides. In this  54  case, the group id returned by the name service will be the primary manager's pid (assuming that the group has been registered as GLOBAL). After obtaining the group id, the joining member invokes j o i n g r o u p This is a stub routine which sends a JOIN  routine to join the group.  GROUP request to the groupserver  along with the group id and the joining process' pid. The group server creates a secondary manager for this group in the joining process' host and informs the primary manager (group id) about the new seconadry manager. The primary manager then transfers it's group management information to the new secondary manager which includes the manager member list and the group name. The primary manager then updates it's manager member list with the new secondary manager's pid and informs all the other secondary managers of the group to update their lists as well. When a secondary manager receives the group management information, it associates it's pid with the received group name and registers service with LOCAL scope. Thus, later on, when a process  in the name  residing in the  secondary manager's host requests the name service for the pid associated with the group name, it will obtain the secondary manager's pid. After transferring the group management information to the newly created secondary manager, the primary manager notifies the group server. The group server then sends a JOIN MEMBER request to the secondary manager along with the joining process' pid. The new secondary manager simply adds this pid to it's local group member list. When a member joins the group from a host where a secondary manager for  this group already exists, then the group id specified  in the  joingroup  routine will be the local secondary manager's pid. Thus, it is not necessary to create a new secondary manager in the joining process' host. The group server  55  simply sends a JOIN MEMBER request to the local secondary manager along with the joining process' pid. The local secondary manager adds the new member to it's local group member list. 4.1.4 Leaving a Group  A  process  wishing to leave a group invokes the leavegroup(group id)  routine. This is a stub routine which sends a LEAVE GROUP request to the process specified by the group id. This group id corresponds to the pid of either the primary or a secondary manager for the group depending on the location of the  exiting  process  as  explained in Section  secondary manager receives the LEAVE  4.1.3. When the  primary or a  GROUP request along with the exiting  process' pid, it simply deletes the member's pid from it's local group member list. If the exiting process is the only member in the secondary manager's local group member list, then the secondary manager sends a NO LOCAL MEMBERS message to the primary manager which deletes this secondary manager from it's manager member list and informs all other secondary managers of the group to do the same. Finally, the primary manager sends a COMMIT SUICIDE message to the memberless  secondary manager which deletes it's pid from the name  service and ceases execution. However, if the leaving process is the only member in the primary manager's local group member list, it has to make sure that it's manager member list is empty before ceasing execution. If the manager member list is not empty, then the primary manager simply deletes the leaving member's pid  from it's  local group member list and continues  to function. When the  primary manager ceases execution, the group becomes nonexistent. 4.2 Organization of the Manager Member List  56 The  group  management  information  transferred from the primary  manager  to a newly created secondary manager includes the pids of the primary as well as all the secondary managers. These pids are maintained in a table called the manager identifier table. To find a particular manager's identifier, one can use that manager's manager index to index into the manager identifier table. The manager indices are assigned by the primary manager. These indices are also part  of the group  management  information transferred  to a  newly created  secondary manager.  Initially  the primary  manager  is assigned manager  index  0. The first  secondary manager to join the group will be assigned manager index 1 and the following secondary manager to join will be assigned manager index 2 and so on. Suppose  the secondary  manager  makes  manager  manager index  with 1  manager  invalid,  and  index informs  managers to do the same. Later, manager index 1 may  1 fails,  the primary  the other  secondary  be assigned to a new  secondary manager by the primary manager. Manager indices are maintained in a manager index list in such a way  0  pido  1  Invalid  2  pld2  3  pid3  4  invalid  5  pld5  that the primary manager's index will  manager index list  n-1  Invalid  n  invalid manager Identifier table  Figure 4.1 Manager member list  5 7  always be first in the list and the index of the last secondary manager to join will be at the end of the list. This ordering is essential for the selection scheme to choose a primary manager in case the old primary manager fails as described in Section 3.4.1. The manager indentifier table and the manager index list are together called the manager member list, as illustrated in Figure 4.1. 4.3 Group Communication This  section  describes  the  implementation  details  of  the  ogsend  and  ugsend primitives used by the applications to transfer a message from a source to the members of a group. 4.3.1 Ugsend Implementation Processes group id,  send  msgtype)  UGSEND  to  the  type  members  messages of  the  by  invoking  group whose  ugsend(msg,  group  name  is  associated with the specified group_id. If the IMMEDIATE REPLY bit is set in msgtype,  then  the  mechanism before  sender the  may be  unblocked by  message is delivered to the  the  group communication  members  of the group.  Otherwise the sender will be unblocked only after the members have received and  acknowledged  primary  or  a  the  message. The specified  secondary  manager's  group id may be  pid depending on the  either  the  sender's location.  Ugsend is a stub routine which sends a UGSEND MSG request embedded with the message to the specified group id. The primary or the secondary manager which receives this request has two ways to transmit the message to other managers depending on the underlying network architecture. If it supports only unicast  facility, then the  messages are sent on a one-to-one basis to the  individual managers. However, if the underlying network also supports broadcast facility then the messages can be first multicast to the managers in a datagram  58  fashion and later resent on a one-to-one basis to those who fail to receive the message the first time around. Since the underlying network architecture of our environment supports the broadcast facility, we use the second scheme. Thus, when the primary or the secondary manager receives UGSEND MSG, it first multicast the message to the rest of the managers. It then waits for a specified period  of  time  to  receive  acknowledgements  from  the  recipients.  If  acknowledgements are not received from some managers at the expiration of the time interval, the message is resent to them using one-to-one IPC's. On the receiving side, the recipient managers can send acknowledgements back to the sending manager immediately after receiving the message or only after delivering the message to their local members depending on whether the IMMEDIATE  REPLY BIT is set or not in the opcode specified in the message  header. 4.3.2  Ogsend  Implementation  Messages sent from different sources by invoking the o g s e n d routine will be delivered in the same order to all the members of the group. We have seen in Chapter Two that this ordering can be easily  achieved by funneling the  OGSEND messages through a single process. In the proposed mechanism the primary manager acts as the funnel process. Processes wanting to send OGSEND messages invoke ogsend(msg, group id, msgtype). The specified group id may be the primary or secondary manager's  pid depending on the sender's location.  Therefore, the stub routine o g s e n d has to first find the primary manager's pid and  then  send  the  message  to  it  for  transmission.  Thus,  it  sends  a  GET PMGR PID request to the process specified by group id. Since all the managers of the group know the pid of the primary manager, this information is available whatever the specified group id. Once the primary manager's pid is  59  obtained, the stub routine sends the OGSEND message to the primary manager which transmits this message to all the secondary managers in a similar fashion explained for UGSEND transmission. 4.3.3 D e t e c t i o n o f D u p l i c a t e s  In the UGSEND or OGSEND message transmission, if the the message is transmitted to the rest of the managers using one-to-one IPC, then the recipients will not receive duplicate messages. However if the message is multicast first in datagram fashion and later resent on a one-to-one basis, then some recipients may receive duplicate messages as explained in Section 3.2. This section describes the implementation details of the duplicate detection scheme. We will describe the scheme from the UGSEND message transmission point of view. Similar technique is used in OGSEND message transmission. The ussno  primary and the secondary managers maintain an integer variable  which is the sequence number of the next UGSEND message. They also  maintain r e c e i v e  buffers  where  the  last  UGSEND  message  sent by other  managers can be stored. Receive buffers corresponding to a particular manager is indexed by it's manager index. Each receive buffer has two fields: a message buffer to store the last UGSEND message and an integer variable, u r s n o , to keep track of the transaction identifier of the next incoming UGSEND message sent by the manager whose manager index indexes into this receive buffer as illustrated in Figure 4.2. When the primary or secondary manager transmits a UGSEND ussno  message, the message header contains the sender's  manager index,  and the pid of the primary manager for that group. When an UGSEND  message is received by a manager, it first checks the primary manager pid specified in the message header against it's primary manager's pid. If it is different,  then  this  message must  have  been  transmitted  from  a manager  60  ursno  0  msg ursno  1  msg  •  n  ursno msg  OGSENDJNDEX  orsno msg  F i g u r e 4.2  belonging  to another  happened  as  However,  i f the message t r a n s m i s s i o n  the  a  receiving  against index.  accepted accepts  result  manager  the u r s n o If  this else  a  is it  it  of  having the  the  compares  will  expected be  it  replies  only  Processes  network  ussno  message  set  after  from  Once  to in the  the  This  partitioning  and  the  in  the  indexed  sending  primary  sending  have  remerging.  has  of  or  the been  message by  the  manager  manager  opcode  message  may  w i t h i n the same group,  buffer  the  the  address.  specified  to the receive  discarded.  is  multicast  w a s initiated  the  acknowledged b y it's local m e m b e r s .  4.4 W o r k e r  the s a m e  c a n reply  REPLY_BIT  buffers  underlying  corresponding  message,  IMMEDIATE Otherwise  group  Receive  it  secondary immediately message delivered  then  header manager will  be  manager if  the  header. to  and  61  The pertaining improve  primary  and  to the group concurrency  the manager  some  their  of  examples courier to  network  a n d the group improves some  addition  to  responsible  worker  in  activities  In  order  of a  processes  described  these  for  communication.  the performance  and probers  processes. I n  to  resolve  remerges.  manager  to  distributed  to  carry  out  3.3.1  are  Section  processes,  This  and thus  m a i n a t i n e d b y their  i n case of p r i m a r y  the leadership section  processes. T h e w o r k e r  managers  managers  employ  have  in  situations  describes  the  processes share  read  access  to  manager  such  as w h e n  implementation the same  the  group  failure,  resolver  a  partitioned  details  address  of  space  management  these  a s their  information  managers.  Courier  Couriers message  are responsible  transmission  activities  for  carrying  on  behalf  out the of  description of the courier process is illustrated by  are  processes to c a r r y out the message transmission activities, a i d e processes  processes  4.4.1  managers  employ  Vultures  chose a new p r i m a r y  worker  normally  processes  tasks.  of such  management  which  program,  secondary  the p r i m a r y  a n d secondary  managers  their  UGSEND managers.  A  OGSEND  high  level  i n F i g u r e 4.3. C o u r i e r s a r e created  when  they  are initialized.  Type : courier Task : sending a group message FOREVER DO Begin  ReceiveSpecific (from primary manager) Send (to ail managers) { group send if broadcast  available}  Reply (to primary manager) End Figure 4.3 Courier  and  process  S i n c e the  62  primary as well as the secondary managers can take part in the UGSEND activity,  each  manager  has  a  UGSEND  courier.  Also,  both  primary and  secondary managers have a local courier to help them deliver the messages to their local members. In additon to UGSEND activity, the primary manager is responsible for transmitting OGSEND messages and group management messages. Thus, the primary manager has  an additional OGSEND courier to assist in  transmitting these type of messages. When  the  transmission,  it  primary first  or  checks  secondary if  the  manager  courier  receives  appropriate  a  message  for  for  handling  the  transmission is free. If the courier's status indicates that it is FREE, then the message is handed over to it for transmission to all the managers of the group. Once the message has been handed over to the courier, it's status is set to BUSY. After the message has been delivered to and acknowledged by all the operational managers for the group, the courier notifies it's manager which then sets the courier's status to FREE. If the primary or secondary manager receives a message for transmission while the appropriate courier is busy handling the previous  message,  then  the  new  message is  queued first-in-first-out in the  courier's message queue. After it completes the message transmission, the courier picks up the next message from it's message queue if it is nonempty. 4.4.2  Prober  The  prober process  is created by the primary manager to probe the  secondary managers of the group to determine if they are still operational. The prober periodically (every  30  message ARE YOU ALIVE  seconds in our implementation) to all the  sends a probe  secondary managers in the manager  member list using one-to-one IPC. The operational secondary managers reply with a I AM ALIVE message to this probe. If a secondary manager has failed, then  63 the underlying system will notify the prober that it is trying to send a message to a nonexistent process. When the prober learns about this failure, it sends a SMGR  FAILED  secondary  message  manager's  pid.  to  the  primary  In  addittion  to  manager this  along  probing,  with  the  the  prober  failed is  also  responsible for multicasting a RESOLVE message periodically (every two minutes in our implementation). This message is necessary any  to detect whether  there are  other groups listening on the same multicast address which may happen if  the underlying network partitions and remerges as explained in Section 3.5.  4.4.3  Vulture  Each secondary manager for a group employs  a vulture to  detect the  failure of the primary manager for that group. A high level description of the vulture process is shown in Figure 3.2. The vulture process makes use of the ReceiveSpecific  primitive provided by  the  underlying V Kernel  to  detect the  failure of the primary manager. Basically the vulture is simply receive blocked on  the  primary  manager.  It will be unblocked only if the  primary  manager  sends a message to it or if the primary manager fails. In the latter case, the underlying system from  a  informs the  nonexistent  process.  vulture that it is trying to receive When  manager's failure, it sends a PMGR  the  vulture  learns  about  a message  the  primary  FAILED message to it's secondary manager  which then takes part in the selection of a new primary manager. After a new primary manager has been selected, the secondary managers inform their vultures to look for the failure of the newly selected primary manager.  4.5  Failure Detection a n d Recovery  This section describes the implementation details  of failure detection and  recovery procedures in case of secondary manager failure (4.5.1) and primary  64 manager failure (4.5.2). 4.5.1 Secondary  Manager  Failure  We have described in Section 4.4.2 how the prober detects the failure of a secondary manager. If a secondary manager fails in the middle of a UGSEND message transmission, some members may not receive the message. Thus, when the prober informs the primary manager that a particular secondary manager has failed, the primary manager sends a SEND LAST MSG request to all the other operational secondary managers. This message contains the failed secondary manager's index. The operational secondary managers which receive this request send the last UGSEND message received from the failed secondary manager. This message is  stored in the  receive  manager  by the  secondary's  indexed  messages as  well  as  failed  the  buffer of each operational secondary  last UGSEND  manager  index.  If the returned  message received by the primary  manager from the failed secondary manager have the same ussno, then the failed secondary manager has either successfully  completed it's last UGSEND  message transmission activity or no member has received it's last UGSEND message. Either of these outcomes assures atomicity. However, if there is a discrepancy among the ussnos, then the primary manager takes the message with the highest ussno  and retransmits it to the secondary managers. Those  secondary managers that have already received this message simply discard the duplicates  as. their  ursnos  will  not match the  ussno  of the retransmitted  message. However, those that have not received the message from the failed secondary manager receive this message and deliver it to their local members. After finishing the incomplete UGSEND message transmission, the primary manager deletes the failed secondary managers's pid from it's manager identifier table and invalidates that secondary's manager index. This information is also  65 sent to all the operational secondary managers for them to update their lists. 4.5.2  Primary  A  group  Manager  cannot  Failure  function without  a primary  manager. Thus, when the  primary manager of a group fails, the secondary managers for that group must detect this failure and select a new We  have described in Section  primary manager from  4.4.3, how  the secondary  among themselves.  managers  detect the  failure of the primary manager through their vultures, and in Section 3.5 about the  selection scheme used to choose the new primary manager. In order to avoid  possible communication  deadlocks, each  election creates an a i d e  process which  secondary  manager  participating  in the  takes part in the message transmission  activities (pertaining to elction) on behalf of it's secondary manager. These aide processes destroy themselves once their task is complete. Once a new manager  is chosen,  it must  message  or messages initiated  finish  any  incomplete  by the failed  primary  transmission of OGSEND  primary  manager  to inform the  secondary managers of changes in the group view such as a secondary manager has joined the group or has failed.  When the secondary managers receive these messages, they store them in a fixed receive buffer indexed by a value called OGSEND managers. OGSEND  INDEX common to all  INDEX. This index does not change with primary managers.  Thus, when a new primary manager is chosen, it sends the SEND  LAST  MSG  request to all the operational secondary managers. This request contains the fixed OGSEND  INDEX  instead  of  a  manager  index.  The  operational  secondary  managers then return the last message stored in their receive buffers indexed by OGSEND  INDEX.  If the returned  messages  as  well  as  the last  message  received by the new primary manager have the same o s s n o s , then this message has either been successfully delivered to all the secondary managers or to none  66  of  them. However, if there is a discrepancy, then the primary manager takes  the message with the highest o s s n o and resends it to the secondary managers in  a  similar  fashion  explained for secondary  manager  failure.  Once  this is  completed, the new primary manager reregisters it's pid with the name service with type GLOBAL so that processes executing on other hosts will be able to obtain it's pid from the name service. The new primary manager then creates a prober and an OGSEND courier processes, sets it's u s s n o (to be assigned to the next OGSEND  message) to the value of the last u s s n o plus one, and resumes  normal operation.  4.6 N e t w o r k  We with  Partition  have seen  in Section 3.4 how network partition creates subgroups  the same multicast  network remerges. We  address. This causes  problems when the partitioned  have seen in Section 3.5.1 how the primary managers  detect that there are more than one primary manager listening on the same multicast address, and in Section 3.5.2 about the leadership resolution scheme used  to resolve  the leadership.  In order  to avoid  possible  communication  deadlocks, each primary manager participating in the leadership resolution creates a  resolver  process  which  takes  part  in the message  transmission activities  (pertaining to leadership resolution) on behalf of it's primary manager.  The  primary manager which gives up it's leadership changes it's type to  SECONDARY  and informs the secondary managers in it's manager member list  about their new primary manager. Each of the secondary managers then change it's  state  to MERGING  and send  a MERGE  request  to the new  primary  manager. When a primary manager receives a MERGE request from a secondary manager, it transfers the group  view  adds  member  it's pid to it's manager  information to the merging list.  Also,  it informs  manager and the secondary  67 managers that are already in it's manager member list to update their lists as well. Once a secondary manager has merged, it changes it's state to ACTIVE and resumes normal operation. 4.7  Performance  We  of  the  Group  Send  Primitives  have done some preliminary measurements on send primtives o g s e n d  the group  ugsend.  and  the elapsed time for  Elapsed time is the length of  time during which a sender remains blocked after invoking an u g s e n d or o g s e n d routine. Elapsed time  for these  primitives  including the underlying system's  on  a  number  workload, number of secondary  the group, and whether the IMMEDIATE is set in msgtype. The  depends  of factors,  managers for  REPLY bit (explained in Section 4.3.1)  elapsed time is also dependent  on  the speed  of the  processor and the type of network interface. For our measurements we used four 16 MHz  68020 based SUN  workstations, each connected to a  lOMbs Ethernet  interface with 32 receive buffers. The message  measurements  transmission N  were times  made and  by  performing  dividing  OGSEND  the total  and  UGSEND  elapsed time by  N  to  obtain a reasonably accurate estimate for a single operation. Table 4.1 gives the elapsed time for the UGSEND and OGSEND message transmissions as a function of the size of the remote group members. In this case the process which invokes the o g s e n d or u g s e n d primitives resides in the host where the primary manager of the group executes. Table 4.2 is similar to Table 4.1, except that the process which invokes the primitives resides in a host where  a  secondary  manager  for that  group  executes. In both cases the receiving managers acknowledge a message only after the  message  is delivered  to and  acknowledged  by  their  local  members (i.e.,  68 Table 4.1 No. of members  ogsend  ugsend  9.5  8.8  1 2  19.2  19.9  3  21.2  21.8  4  23.7  24.0  Elapsed time (milli seconds) for ugsend and ogsend. Sending process in the same host as the primary manager.  Table 4.2  No. of members  ogsend  ugsend  -  -  2  19.2  21.2  3  21.7  22.4  4  23.2  25.1  1  Elapsed time  s  (milli seconds) for ugsend and ogsend.  Sending process in the same host as a secondary manager.  IMMEDIATE  REPLY  bit  is  off).  N  is  chosen  to  be  30,000  for  both  measurements.  The primitives increase  first  observation f r o m these  doubles w h e n in  the  figures  a remote member time  for  the elapsed time  for both  is added to the group. H o w e v e r , the not  very  significant. T h i s behaviour is understandable, because the u n d e r l y i n g network  is a  b r o a d c a s t network  elapsed  is that  additional  a n d the time to t r a n s m i t  remote  members  is  a group message to one remote  site  69  or multiple remote sites is the same, assuming that the probability of a packet loss is negligible. This may be a valid assumption since the network interface has 32 receive buffers which considerably reduces the chances of losing a packet. The second observation is that the elapsed time for the UGSEND message transmission is less than that for the OGSEND  transmission. However, this  difference is not very significant when the process which invokes these primitives resides in the same host as the primary manager. The reason is that the UGSEND message transmission is carried out by either the primary or secondary manager for the group executing locally, but the OGSEND message is sent to the primary manager which may be executing in a host different from that of the sending process. 4.8 C h a p t e r S u m m a r y  The proposed mechanism is structured into a set of cooperating processes. Each host runs a group server process. Processes wishing to create or to join the group invoke the appropriate group management stub routines which in turn send appropriate requests to the group server. When a group is created, a primary manager for that group is created in the initial member's host, when a process joins the group from a host where neither the primary nor secondary manager for. this group executes, a secondary manager for this group is created in that host. Both the primary as well as the secondary managers maintain group management information necessary  to coordinate group management and  group communication activities. When a message is sent to a group, the proposed mechanism makes sure that the message is delivered to all the managers of the group each of which will then deliver the message to the local members of the group. In order to improve the concurrency of the manager processes, each of them employ some worker processes such as couriers, probers and vultures. Any  70  incomplete message transmission as a result of primary or secondary manager failure will be detected and completed by the proposed mechanism. Network partition results in subgroups. Communication within these subgroups will be ordered and atomic. When the partitioned network remerges, the mechanism detects this and merges the subgroups to form a single group. Mesaurements on the elapsed time for ogsend and ugsend indicate that ordered group send has some overhead compared to unordered group send.  71  Chapter Five Conclusions This  work  describes  the  design  and  implementation  details  of  a  reliable  group communication mechanism. A group communication mechanism is reliable if it  has  the  two  aspects of  ensure  full delivery the  group.  If  the  reliability: full  sender  underlying  must  network  message can be first multicast retransmitted time  on  a  message  can  correctness  be  are  sent  on  a  atomicity,  the  supports  and correctness.  identities multicast  of the or  In order to  members  broadcast  basis  to  those  that  underlying network one-to-one basis  order  and  to  failed  to  receive  survivability.  recipients.  it  Issues  Atomicity ensures  the  and then the  supports only unicast, the  of the  then  in a datagram fashion to the recipients  one-to-one  around. However, if the  know  delivery  first  then  the  related that  to  every  message sent to a group will be delivered to all operational members or to none of them. Order guarantees to  all the  process,  that messages will be delivered in the same sequence  operational members.  Survivability assures continuous  operation despite  host or network failures.  Full  delivery  does  not  necessarily  guarantee  correctness.  Partial  delivery  may occur if the sender fails in the middle of a transmission. Also, if the group membership is dynamic, it is difficult for the to date and  sending process  to maintain an up  membership information. Furthermore, in a system with multiple senders  multiple  destination  receivers,  before  the  a  message  arrival  of  a  sent message  from from  a  sender another  may sender;  arrive however  at  a this  order may be reversed at another destination. This violates the order prpoerty.  In processes,  order to some  provide  form  of  communication mechanism.  the  reliable  coordination If the  properties is  transparent  necessary  in  the  to  the  application  underlying  group  underlying mechanism provides a process  which  72 knows the identities of the group members, the messages from multiple senders can be funneled through this process thus ensuring full delivery as well as order. In  order to ensure  atomicity  and survivability  in case of the failure  of the  funnel process, this process may be replicated at different sites. In the proposed mechanism, replicated  each  group  has a  primary  manager  (funnel  at all member sites (secondary managers).  process) which is  If the primary  manager  fails, a new primary manager is selected from among the secondary managers. The  new  initiated  primary by  manager  the failed  will  finish  primary  any incomplete  manager.  This  message transmission  guarantees  atomicity  and  member  lists  survivability. The ordered messgage transmission is called OGSEND.  Both  primary  and secondary  managers  maintain manager  which contain the pids of all the managers. The manager member list is updated only by the primary manager. Whenever the  a secondary  manager joins or leaves  group, the primary manager is notified. The primary manager  updates it's  manager member list and informs the secondary managers to update as well. Each  manager  their lists  (primary as well as secondary) also maintains a local  group member list which contains the pids of the members of the group local to their respective hosts. Each manager is responsible for updating it's local group member  list.  When  the primary  message  to all the secondary  manager  managers  receives each  a message, it sends the  of which  in turn  delivers the  message to their local members. This requires less space, less network traffic and reduced code complexity than the case of replicating the entire membership information in the primary and all the secondary managers.  Some  applications  do not require  an ordered delivery  but only atomic  delivery. The unordered message transmission is called UGSEND. These messages need  not be funneled through the primary  manager. Since all the secondary  73 managers know the pids of all the managers of the group, a UGSEND message is first sent to a secondary process' host which  manager  for the group  residing in the sending  then transmits this message to all other managers. If a  secondary manager for the group is not executing in the sending process' host then the message is sent to the primary manager which then transmits it to the  rest of the managers. If a secondary manager fails, the primary manager  detects this and finishes any incomplete UGSEND message transmission initiated by  the failed  secondary  manager.  Also,  the primary  manager  deletes  this  secondary manager's pid from it's manager member list and informs the other operational secondary managers to update their lists as well. As mentioned  earlier, when the primary manager fails, a new  manager is selected from use  among the secondary managers. Secondary  a succession list selection scheme to select the new  primary managers  primary manager. In  this scheme the oldest secondary manager forces the younger ones into accepting it  as the new  primary  manager. This  can be easily  done in the proposed  mechanism since all the secondary managers have the same manager memberlist ordered by the time they joined the group. Network partition creates subgroups with the same multicast address. This causes problems  when the network remerges. However, the proposed mechanism  detects this and merges these subgroups together to form a single group. Some performance  measurements were made on the group send primitives  ogsend and ugsend provided by the proposed group communication  mechanism.  From the measurements, one observes that the ordered group send takes more time to complete than the unordered group send when the sending process does not  reside in the same host as the primary manager. This overhead is due to  the  fact that the message has to be funneled through  the primary  manager  74  residing in a different host, whereas UGSEND secondary manager  messages are sent to the local  which then transmits them to the rest of the managers of  the group.  The mechanism has been implemented on the V SUN-3/50  workstations  interconnected  by  an  Ethernet.  Kernel running on four The  system  expected and some performance data have been reported in the thesis.  works as  75  Bibliography [I] .  K.P.Birman  Failures.  and  ACM  Reliable  T.A.Joseph,  Communication in  the  Presence of  Transactions on Computer Systems, Volume 5, Number 1,  February 1987. [2].  J.M.Chang  and  N.F.Maxemchuk,  Reliable  Broadcast  Protocols.  ACM  Transactions on Computer Systems, Volume 3, Number 1, February 1985. [3]  J.M.Chang  and  N.F.Maxemchuck,  A  Broadcast  Protocol  for  Broadcast  Networks. Proceedings of GLOBCOM, December 1983. [4]  S.T.Chanson and K.Ravindran, A Distributed Kernel Model for Reliable Group  Communication.  Proceedings  of the IEEE-CS  Symposium  on  Realtime  Systems, New Orleans, December 1986. [5].  D.R.Cheriton,  The  Thoth  System: Multi-Process Structuring  and  Portability.  American Elsevier, NY, 1982. [6].  D.R.Cheriton, The V Kernel: A Software Base for Distributed Systems. IEEE  Software, Volume 1, Number 2, April 1981. [7].  D.R.Cheriton  ACM [8].  and W.Zwaenepol, Distributed Process Groups in the V Kernel.  Transactions on Computer Systems, Volume 3, Number 2, May 1985.  F.Cristian, H.Aghili, R.Strong and D.Dolev, Atomic Broadcast: From Simple Message Diffusion  to Byzantine Agreement. Technical  Report RJ  4540(48668),  IBM, October 1984. [9].  A.Frank,  L.D.Wittie  and  A.J.Bernstein,  Group  Communication  on  Netcomputers. Proceedings of the 4th International Conference on Distributed Computing Systems, San Francisco, CA, May 1984. [10]. H.Garcia-Molina,  Elections  in  a  Distributed  Computing  System.  IEEE  Transaction on Computers, Volume C-31, January 1982. [II] . H.Garcia-Molina  and  A.K.Abbot, Reliable Distributed  Database Management.  Technical Report CS-TR-047-86, Department of Computer Science, Princeton  76 University, August 1986. [12]. R.Gusella  and  Synchronization  S.Zatti,  Program.  An  Election  Proceedings  Algorithm  for  a  Distributed  of the 6th IEEE-CS  Clock  International  Conference on Distributed Computing Systems, Cambridge, MA, May 1986. [13]. L.Lamport, Time, Clocks and the Ordering of Events in a Distributed System.  Communications of the ACM, Volume 21, Number 7, July 1978. [14]. K.Ravindran and S.T.Chanson, Process Alias Based  Structuring  Techniques for  Distributed Computing Systems. Proceedings of the 6th IEEE-CS International Conference on Distributed Computing Systems, Cambridge, MA, May 1986. [15]. F.Schneider, D.Gries and R.Schlicting, Reliable Broadcast Protocols. Science of Computer Programming, Volume 3, Number 2, March 1984. [16]. D.Skeen,  Determining  the  Last  Process  to  Fail  .  ACM  Computer Systems, Volume 3, Number 1, February 1985.  Transactions  on  77  Appendix A IPC Primitives of V Kernel This Chapter is  used  describes the I P C operations  a s the u n d e r l y i n g  mechanism  described  systems  -  Thoth  because  i t ' s facilities  machines single  [5].  interface,  machines.  program  environment  The V  a n d name  Kernel  uniformly  network.  for the most  A connected  the proposed  The V . Kernel  a r e available  b y a local  multiple  to build  i n the thesis.  a n d Verex  connected  kernel  system  provided b y the V K e r n e l  evolved  is referred  it provides  successfully  set of machines  communication  from  t w o previous  to a s distributed  a n d transparently  Thus,  part  group  across  multiple  the appearence  masking  that provides  of a  the existence of  a single  space is called a V D o m a i n  which  V  Kernel  [6] a s depicted i n  Figure A . I .  The  major  communication server process sender  facilities  between  processes w h i l e communication a n d receiver  provided  by  the  V  Kernel  processes. I n such a n environment, client  processes communicate  ( I P C ) primitives  o f a n D?C activity  WORKSTATION  WORKSTATION  V Kernel  V Kernel  a r e processes a n d  services are offered b y  w i t h servers  to negotiate  a n d receive  are specified b y their  ORK- • • •WSTATION V Kernel  LOCAL NETWORK Ft_E SERVER MACHINE  RLE SERVER MACHINE  Figure A.1 V domain of local  PRINTER SERVER MACHINE  network-connected  machines.  using  the inter  services. T h e  process  identifiers  78 (pids).  The  executes  most  a  Send  suspended. T h e operation to  the  common  communication  operation  to  message eventually  client. an  This  example  R e p l y operations  operations,  of  which  causes the  we  Send  is  a s s u m e t h a t the same  shown  in detail.  respectively.  do not reside on the  A  While  a  in  is  server  execution  client  process  process  is  Receive  to send a r e p l y  message  A . 2.  to  The  of  and  the  referred  Figure  A  as  a  message  following  section  explaining  the in  Send,  the  Receive and  communication  Reply  activities  host.  Operation  process  wishing  to  send  a  to  id  process is  whose  suspended and  the send operation  kernel  server's  processes involved  m s g is a pointer  primitive  to  follows.  Sections A . 2 a n d A . 3 describe the Receive  Send(msg, pid), where a  as  message  S e n d - R e c e i v e - R e p l y activity  describes the Send operation  A.1  a  is  to be completed. T h e server executes a R e p l y  transaction  and  transmit  scenario  is  specified it  to  to a 32 byte in  resumes  message  pid.  operation  The  another  process  message to be process  when  ihe  invoking receiver  invokes  transmitted the  replies  Send or  if  fails.  W h e n the  S e n d p r i m i t i v e is invoked, the sender is suspended. T h e sender's  transmits  a  SEND  inter-kernel  packet  w i t h the  message embedded  Mocked  TIME  message transaction  Figure A.2  Send - receive - reply  message transaction  to  the  79 receiver's kernel. The but  the  one  underlying hardware assures that all the network interfaces  at the  packet is received by  desired destination discard the  message. When  a  SEND  the receiver's kernel, it first checks for the existence of  the receiver. If the receiver does not exist, the receiver's kernel replies to the sender's kernel with a NON_EXISTENT_PROCESS message. The  sender's kernel  then unblocks the sender and informs it of the outcome. However, if the receiver is alive, then the receiver's kernel queues the message first-in-first-out in the receiver's message queue.  If the  receiver does not  reply to the  sender  within  a  specified  time  period, the sender's kernel retransmits the SEND packet to the receiver's kernel. If the receiver's kernel finds that there is already queued up  an  identical SEND packet  in the receiver's message queue, it discards the incoming packet and  replies with a BREATH  OF  LIFE message to the sender's kernel which resets  it's retransmission count. If after a number of retransmissions the sender's kernel does not receive any receiver's  site has  response from the receiver's kernel, it assumes that the  failed  or the  therefore unblocks the sender and Unfortunately network  the  partition  network has  this  sender's kernel  informs it that the Send operation has failed.  sender's kernel cannot unless  partitioned. The  distinguish between  information  is  available  in  site the  failure  and  underlying  communication medium such as in some ring-type networks. Figure A.3  depicts  the Send operation in various scenarios. By receiver delivered  invoking the Send primitive, one can be assured that as long as the is alive to  and  the  network  it's destination. Also  acknowledgement  and  retransmission  process, host or network failure.  is not it allows for  partitioned, the the  reliable  sender  message to  delivery or  will  be  exploit positive determination  of  80  RECEIVER  SENDER S«nd invoked Transmit SEND Packet  SEND Packet queued  Timeout, Retransmit SEND Packet  Previous SEND Packet already in Receiver's queue. Reply with BREATHjDFUFE  Reset timer & rexmit count  Receive invoked  Reply invoked Time out Retransmit  Transmit REPLY Packet and save a copy of it  Reply received  Already replied, retransmit the saved copy  Unblock Sender  Discard REPLY Packet  Discard the saved copy  Figure A.3  A.2  Receive  The  V  Send operation in V  Operation  Kernel  provides  two  Receive  and R e c e i v e S p e c i f i c . W h e n  32  message a t  byte  from  any  receiver  process until  a  ReceiveSpecific(msg, whose  the in  the  pid)  suspends  a  message the  invoker.  is  It  at  The  arrives  from  used  to  a first  by  some a  32  process there is  may  kernel  sender. byte  simply the  message invokes the a  messages; to  receive  On  (sender), already  receiving  Receive(msg)  m s g , it  a process (receiver)  if  for  invokes  receiver's  obtain  particular checks  primitives  process  pointed  pid. W h e n  from  a  domain.  message  id is specified b y  receive  location  different  obtain  a  message  suspends other  from  a  a  the hand,  process  ReceiveSpecific to receiver's  SEND  packet  kernel queued  81  up  in the receiver's message queue from the specified sender. If there is none,  the receiver's kernel sends a RECEIVE inter-kernel packet to the sender's kernel. When the RECEIVE packet is received by the sender's kernel, it checks whether the  sender  exists or not. If the sender  NON_EXISTENT_PROCESS  message  does  not exist, it replies  to the receiver's kernel  which  with  a  in turn  unblocks the receiver and informs it of the outcome. On the other hand if the sender  exists,  it's kernel  replies  with  a  BREATH  OF  LIFE  message.  receiving this message, the receiver's kernel resets it's retransmission  On  count. If  the receiver's kernel does not receive a SEND packet from the sender within a specified time period, it repeats the ReceiveSpecific procedure again. In case the sender's site fails or if the network partitions, the BREATH  OF  LIFE message  from the sender's kernel will not be received by the receiver's kernel. Therefore, after a maximum  number of retransmissions  the receiver's kernel unblocks the  receiver and reports that the ReceiveSpecific operation has failed. As in the Send operation, the receiver's kernel cannot distinguish between network partition and host failures. Figure A.4 illustrates the ReceiveSpecific operation. By  invoking the ReceiveSpecific primitive, a process can detect the failure  of another process  or the site on which the process  ReceiveSpecific primitive, the v u l t u r e  resides. Thus, using the  scheme described in Section 3.3.2 can be  easily implemented. A.3  Reply  A SEND  Operation  process packet.  may only reply to a process This  is necessary  to maintain  from tight  which it has received a synchronization  of the  S e n d - R e c e i v e - R e p l y activity. When a process (replier) invokes Reply(msg, pid) to reply with a 32 byte message pointed at by msg to a process specified by pid, it's kernel first checks to see whether the replier had already received a SEND  82  SENDER  RECOVER ReceiveSpecific invoked Checks (or SEND Packet Transmit RECEIVE Packet  -j  RECEIVE Packet received  -i -  Receiver is alive, reply with BREATH_OF_LIFE  1  Reset timer & rexmtt count  r-  Timeout Checks lor SEND Packet  Retransmit RECEIVE Packet RECEIVE Packet received Receiver is alive, reply with BREATHJOFJJFE Send invoked  Reset timer & rexmit count  Transmit SEND Packet  SEND packet received, unblock Receiver  £  Figure A.4 ReceiveSpecific  packet  from  this  process. If the replier  specified process, then packet  operation  in V  d i d not receive a S E N D  the R e p l y operation  packet from the  fails. Otherwise, a R E P L Y  inter-kernel  is sent to the specified process' k e r n e l . W e have already seen, under the  Send  operation,  h o w the sender's  kernel  until  a REPLY  packet  or timeout,  REPLY  packet  i s received  h a s been  sent,  a copy  keeps  of it  transmitting whichever  is kept  the S E N D  occurs  first.  i n the replier's  packet  After  the  kernel for  83  REPUER Reply  SENDER  invoked  Retransmit SEND Packet  Transmit REPLY Packet and save a copy of H  Reply received, unblock the Sender  Already replied, retransmit the saved copy  Discard the saved copy  cz •  Old Reply, discard  Figure A.5 Reply operation in V  sometime.  The replier's kernel retransmits  this  saved  copy  in  response  to  retransmitted SEND packets. The copy is discarded after a specified time period. The Reply operation is illustrated in Figure A . 5  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051921/manifest

Comment

Related Items