UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Group communication in distributed system T-Shoshin See, Helen Lim 1988

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-UBC_1988_A6_7 S43.pdf [ 3.31MB ]
JSON: 831-1.0051942.json
JSON-LD: 831-1.0051942-ld.json
RDF/XML (Pretty): 831-1.0051942-rdf.xml
RDF/JSON: 831-1.0051942-rdf.json
Turtle: 831-1.0051942-turtle.txt
N-Triples: 831-1.0051942-rdf-ntriples.txt
Original Record: 831-1.0051942-source.json
Full Text

Full Text

G R O U P C O M M U N I C A T I O N I N D I S T R I B U T E D S Y S T E M T-SHOSHIN By H E L E N LLM SEE B . S c , University of British Columbia, 1984 A THESIS S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS F O R THE D E G R E E OF M A S T E R OF SCIENCE in T H E F A C U L T Y O F G R A D U A T E STUDIES ( D E P A R T M E N T OF C O M P U T E R SCIENCE)  We accept this thesis as conforming to the required standard  T H E UNIVERSITY OF BRITISH C O L U M B I A April 1988 © Helen L . See, 1988  In  presenting  degree freely  at  this  the  available  copying  of  department publication  of  in  partial  fulfilment  University  of  British  Columbia,  for  this or  thesis  reference  thesis by  this  for  his thesis  and  study.  scholarly  or  her  for  purposes  gain  COMPUTE  The University of British 1956 Main Mall Vancouver, Canada V6T 1Y3 Date  DE-6G/81)  APRIL  '2 ,  SCl£S)C£  Columbia  W  requirements  shall  that  agree  may  representatives.  financial  the  I agree  I further  permission.  Department of  of  be  It not  is  that  the  Library  permission  granted  by  understood be  for  allowed  an  advanced  shall for  the that  without  make  it  extensive  head  of  my  copying  or  my  written  A b s t r a c t  T-Shoshiri is a distributed operating system for developing and testing distributed software. It provides a flexible and extensive one-to-one (1-1) interprocess communication (IPC) facility. However a number of applications, such as notification and query, require the use of group communication in which a process sends a message simultaneously to a group of processes. This thesis discusses how the 1-1 I P C facility on T-Shoshin was extended to provide a group interprocess communication service. The notion of groups will first be described, focusing on how they are referenced, formed and maintained. Finally a semantic model for group communication is presented, that is simple and general enough to accommodate most forms of applications that require group communication. Then an implementation model is discussed to show how the semantic model was implemented, followed by a performance evaluation of the implementation.  ii  Contents  Abstract  ii  List of Tables  v  L i s t of Figures  vi  Acknowledgement 1  2  3  4  vii  Introduction  1  1.1 1.2  1 6  Background and Motivations Thesis Outline  Process G r o u p s  8  2.1 2.2  8 9  Process Group Definition Group Management Operations  Semantic M o d e l of G r o u p C o m m u n i c a t i o n  13  3.1 3.2 3.3  13 16 21 21 23  Overview of Model Group Communication Operations Semantics for Failure and Abnormal Conditions 3.3.1 Semantics for Failure 3.3.2 Abnormal Conditions  Implementation  29  4.1 4.2  29 33 33 34 36  4.3  Ethernet Interface Group Management 4.2.1 Group Identifiers 4.2.2 Group Membership Group I P C Operations  iii  4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 4.3.6 5  6  Group Communications Manager, Ethernet Manager and Communications Manager Group Send Group Receive Group Reply Termination of a Group Communication Certain Implementation Issues  36 40 41 43 44 46  Results a n d A n a l y s i s  47  5.1 5.2 5.3  48 49 58  Measurement Method and Environment Performance Measurements Analysis of Implementation  Concluding Remarks  63  6.1 6.2  63 65  Thesis Summary Future Work  Bibliography  66  iv  List of Tables 3.1  Set of Abort Status and Their Interpretation  22  5.1 5.2 5.3 5.4 5.5  Elapsed Time for 1-1 < b s e n d , b r e c a n y > pair Elapsed Time for Group bsend - A l l Remote Members Elapsed Time for Group bsend - A l l Local Members . . Elapsed Time for Group bsend - Empty Group Group request with One Remote Member  50 51 52 55 56  v  List of Figures 1.1  Network Configuration of T-Shoshin  2  2.1  T-Shoshin Group Ids  9  3.1 3.2  Group Inter-Process Communication Model How Abnormal Conditions are Handled under Various Stages of a Group Communication  14  4.1  Access to the Ethernet Resource by various Communications Managers  38  5.1  I P C Timing Method  62  vi  28  Acknowledgement  I would like to thank my supervisor, Dr. Son Vuong, for his patience, guidance and advice on this thesis and Dr. Sam Chanson for reading the final draft. Special thanks to a number of people whom I have consulted on more than a number of occasions. They are Barry Brachman, Audivox M a , Rick Sample, and Frank Pronk. Thanks also to many of my fellow graduate students with whom I have had a very pleasant learning experience together.  vii  Chapter  1  Introduction  1.1  Background and Motivations The T-Shoshin distributed operating system was designed for developing, experi-  menting, and testing distributed software. It was originally developed at the University of British Columbia and is currently running on a collection of Sun Workstations in1  terconnected by a 10 Mbps Ethernet  2  [ACTON85,WANG86] [Fig. 1.1]. T-Shoshin  is an enhanced version of Shoshin[TOKU83a,TOKU84] which was developed at the University of Waterloo. T-Shoshin supports process abstraction whereby a process is a logical component of a distributed program possessing its own code segment, data segment, and stack space and executes independently of other processes.  T-Shoshin also supports one-to-one  (1-1) direct communication among these processes through a set of flexible interpro1  SUN Workstation is a trademark of Sun Microsystems Inc.  Ethernet is a trademark of Xerox Corporation  2  1  CHAPTER  1.  SUN  VAX  10  VAX  2  INTRODUCTION  750  Mbps  780  SUN  Ethernet  SUN  Figure 1.1: Network Configuration of T-Shoshin cess communication (IPC) primitives[ACTON86j. By using the 1-1 I P C operations, any pair of processes can synchronize or communicate with each another via message passing. The only requirement is that they know each other's process ids in advance. The process abstraction and 1-1 I P C were all part of Tokuda's original design for Shoshin in building an environment for distributed computation[TOKU84]. Tokuda's definition of a distributed program was that of "a group of relatively small processes that often exhibit a dynamic pattern of process cooperation, including dynamic creation and destruction of processes". Moreover, "cooperation among processes occurs without the use of shared variables" and "all processes are executed in mutually disjoint address spaces and communicate using message passing without any shared variables". In other words he intended to make the I P C facility to be the main if not the only  CHAPTER  1.  INTRODUCTION  3  means by which processes would pass information. The I P C facility of T-shoshin is one of its basic tools for achieving well-coordinated distributed computation. However, when a group of processes synchronize and communicate through I P C , a large amount of overhead is incurred through context switching and byte copying from one address space to another. T-Shoshin eliminated part of this problem by adding Cheriton's concept of team [CHER79a]. A team is defined as a group of processes that share the same code segment, global data structures and address space but each process having its own stack space for execution. In T-Shoshin, a group of processes can be collectively created as a team so that they are all in the same context and address space. This way context switching among team members is totally eliminated and byte copy overhead is reduced because byte copying within the same context is cheaper. It turns out that team members can also synchronize and communicate with each other via shared global variables. Thus in T-Shoshin, the team abstraction and the 1-1 I P C provide two simple and alternate means by which processes can communicate and work together to accomplish certain tasks or computation. Information is passed around through shared global data structures or through message passing. Synchronization can be achieved likewise. Nonetheless, there are certain computations which require yet another type of interprocess communication. This type amounts to multiple-send, that is being able to send a message simultaneously to several processes with a single send operation.  CHAPTER  1.  INTRODUCTION  4  There are a number of higher level applications which would require such a basic facility for achieving distributed computation. Such applications include querying a group of servers, notifying a group of processes of computational updates, or synchronizing among a number of processes. As an example, a distributed system may have a group of printer servers providing printing service. To determine which printers are available for printing a job, a client process may want to check with all printer servers to determine which one is most suitable for taking on the job. As another example, consider the scenario where there are x number of processes that have to synchronize with each other and any one of them can be the trigger for starting the synchronization. W i t h the existing facility this has to be done indirectly by either having all the processes be team members and wait on some global variable, or having them know each other's process id a priori and do multiple 1-1 sends. If a group communication facility is in place it can be done with a single multiple-send. The sending process would issue a single multiple-send operation to a previously defined group of processes upon which the system would automatically deliver the message to each process. There are a number of reasons why a multiple-send or group send should not be simply realized by a set of 1-1 I P C operations. First, the sender must know the ids of all the members of the group. This is impractical for a lot of application users. For example, to print a job a user needs to have at hand the identities of all the printer servers available for printing its job. If done with group communication, a set of static  CHAPTER  1.  INTRODUCTION  5  well-known group ids could be allocated to allow reference to certain system servers as a logical entity via a group i d . Second, not all the receivers have an equal chance of receiving the message first. It is always biased by the process order in which the sender sends out the message. Third, the message is not sent out concurrently to all the members. It can only be done serially. Finally, when the group contains members across several machines, more network traffic is incurred when a single multiple-send is replaced by a series of 1-1 sends. The lower level implementation uses broadcasting to deliver the message to all the machines on the network. In cases where there is a widespread of members across the network, broadcasting the message once onto the network rather than singly to each host can significantly reduce the amount of network traffic. There can be further reduction in overhead if the lower implementation uses multicast hardware. Based on all the above reasons, equipping T-Shoshin with a group interprocess communication facility will strongly enhance its usage as an operating system that supports distributed computation.  The implementation of a group communication  facility on T-Shoshin will open up a number of areas where further work can be done. First, there can be closer examination of the client-server model which T-Shoshin ideally supports. With group operations and group IPCs in place a server can easily make its services available by joining a well-known server group. It can just as easily withdraw its services by leaving the group. On the other hand clients need not know  CHAPTER  1.  INTRODUCTION  6  in advance the identities of the server (s). Neither do they need to keep up-to-date as to the status of the server (s). A l l they need to do is to send their requests to the well-known server group id and the messages will automatically be delivered to all the server instances that are ready to provide the service. Second, certain system functions such as load balancing and process migration become easier to implement. Each machine can broadcast or multicast its status to all the machines on the network or to a specific set of machines. The choice of grouping is done entirely through the group maintenance operations.  Third, application programs that required a group  communication capability can now be much readily realized. These could range from distributed programs that need to pass computational updates to each other and to a database program that will update its database only when all the components have simultaneously committed to a transaction operation. There is a further advantage from the programmer's point of view. Using the group send will require less coding and maintenance effort since a sequence of 1-1 sends is replaced with a single multiple-send.  1.2  Thesis  Outline  Excluding the introduction, the thesis consists of five chapters as follows. Chapter 2 provides a description of process groups, group operations, and group management. Chapter 3 presents the semantic model of group communication and also defines the  CHAPTER  1.  INTRODUCTION  7  set of I P G primitives used for group communication. Chapter 4 discusses how group management and group communication was implemented. In Chapter 5 performance measurements and analysis of the implementation are presented. Finally Chapter 6 provides some concluding remarks and offers suggestions for future work.  Chapter 2 Process  Groups  Since the goal of the thesis is to provide a group communication facility on TShoshin, it is important to first have a clear notion of process groups. In this chapter, the definition and representation of process groups will be provided, followed by a description of group management operations.  2.1  Process Group  Definition  Groups in T-Shoshin are very similar to V process oroups[CHER85]. A group is a set of zero or more processes that are collectively referenced by a groupJd.  A  groupJd is similar in syntax to T-Shoshin process_ids but different in semantics. Each group J d has two parts, the first part of which refers to a class of related groups and the second part to a specific group within that class. For instance, one might have a set of statically allocated group_ids all belonging to the class system [Fig. 2.1]. Processes within a group may span a machine or any number of machines. A l l the members  8  CHAPTER 2. PROCESS  GROUPS  9  of a group are deemed to be equal and every member can potentially receive all the messages sent to the group. Parts of a G r o u p Id class  group  i  g r o u p or p r o c e s s  Examples  •  id bit  of g r o u p ids:  system printer-servers system : file-servers system : disk-servers U B C : grad-students U B C : faculty-members U B C : staff  Figure 2.1: T-Shoshin Group Ids  2.2  Group  Management  Operations  A set of group management operations was set up such that group manipulation is simple and has minimal restrictions on grouping rules. It has some degree of control over who joins the group, and hence who receives messages sent to the group.  It  however, does not have control over who send messages to the group. Given the objective as such, the rules are: • •  any process can create a group and supply it with a control password any process can join, leave or destroy a group provided it knows the control password  CHAPTER  2. PROCESS  GROUPS  •  there is no fixed limit on the size of a group  •  members of a group will receive all messages sent to the group  •  10  any process can send messages to a group. There is no control over which processes can send messages to a group. The following operations are used for processes to create, join, leave, or destroy  groups. A new group is created by group J d = create_grp (control password). The system looks for a unique and unused groupJd and returns it as the new groupjd.  A control password must be supplied as a means of control over which  processes can join the group or destroy the group. The control password is a string of characters up to 64 characters long. If a password is deemed unnecessary it can be bypassed by supplying a null string as password. A newly created group has zero members in it. After a group has been created any process can try to join by issuing the status = j o i n ^ g r p (groupjd, password) primitive. The password parameter is used for matching the control password. Status returned by join_grp can be O K , G R O U P _ D O E S _ N O T _ E X I S T , W R O N G - P A S S W O R D , or some other error code. When a process joins a group it will start having messages sent to the group delivered to it. After a process joins the group, it can leave the group by issuing another primitive, status = leave_grp (group-id, password).  CHAPTER  2. PROCESS  GROUPS  11  If the process leaving the group is already a member of groupJd and the password parameter matches the control password, then it is removed as a member of the group. Status returned by leave^grp can be O K , G R O U P _ D O E S J M O T _ E X I S T , W R O N G JP ASS W O R D , N O T _ M E M B E R , or some other error code. Once a process leaves a particular group it stops receiving messages sent to that group delivered to it. Note that a process J d wasn't supplied as a parameter to both j o i n and leave primitives. This prevents processes from making other processes join a group or removing other processes from a group. A group is not automatically destroyed when it becomes empty.  It has to be  explicitly removed by the destroy primitive. status = destroy_grp (groupJd, password). As before the supplied password must match the control password. When the destroy primitive is invoked all the members of the group are removed and the group J d is deallocated. The return status will be O K is everything went fine. Otherwise it can say G R O U P J D O E S J M O T _ E X I S T or some other error code. This primitive has not been implemented yet. There are a number of criticisms that can be legitimately raised with regards to the way groups are formed and managed. There is no protection against bogus senders. Groups may become too large. There is also no elaboration on recovery of group information in the event of a crash or mishap. It is dangerous in to allow any process  CHAPTER  2.  PROCESS  GROUPS  12  to destroy a group as long as it is able to acquire the control password. The focus of this thesis is to concentrate effort in building a group communication facility in T-Shoshin. The issues of group membership can be addressed in the future when it is shown that group communication is viable and effective in T-Shoshin. A t the moment we only need minimal rules to get groupings to work.  Chapter  3  Semantic M o d e l of G r o u p Communication  3.1  O v e r v i e w of  Model  Group communication in T-Shoshin is done through message-passing. When a process initiates a group send it is simultaneously in communication with all the members in the group as illustrated in Figure 3.1. A sender starts up a new group communication by addressing a message to a group. It then blocks until a copy of the message is delivered to the "message queue" of all reachable members of the group. Reachable usually means those existing members whose host is still alive and resides in the network partition as the sender at the time the message went out for delivery. The total count of the reachable members is returned to the sender as it is unblocked. The sender must set the number of replies it wants back. As replies come in, they will be queued up to the number of replies wanted. The rest are discarded to avoid taking up unnecessary buffer space. The sender can obtain these replies one by one through a 13  CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION  14  new primitive provided. After the group messages get queued up in the message queue, group members receive and reply to group messages as if they were 1-1 messages.  Figure 3.1: Group Inter-Process Communication Model  Group communication was done through message-passing because it seemed very compatible to extend the 1-1 I P C . The set of 1-1 I P C primitives is almost complete to support most types of applications. The only primitive that is missing is the non-blocking send.  For certain argued reasons, the non-blocking send wasn't  implemented[ACTON85]. The extension of 1-1 I P C to include the semantics for group communication does not seem awkwardly contrived or unnatural. In fact it seems to fit appropriately. Instead of sending a message to one process, the same I P C operation can be used to send to a group. Likewise, instead of waiting a message from a specific process, the receiver can wait for a message directed to a specific group that it is a  CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION  15  member of. When a group communication starts up, an "attempt" is made by the system to deliver the message to every member in the group. Drawing an analogy to datagram protocol, it is a best effort attempt with no guarantee of success. In other words there is minimal reliability. For lack of better terminology, it is termed as a "broadcast datagram". In a client-server paradigm, such degree of reliability may be sufficient. Consider sending a job out to be printed. The client is really interested in communicating only with those printer servers that are available, that is, those that are alive and those that can be reached. In many applications not knowing the identity and not being able to have contact with the remaining printer members is acceptable since those printer servers can't do the job anyways. Moreover the degree and kind of reliability is usually application dependent and best left to upper communication and/or application layers to determine what suitable reliability measures to incorporate[CHER85]. In contrast to V send, which is 1-reliable, the senders get unblocked when a copy of the message is delivered to the "message queue" of every reachable member in the group. A t the time of unblocking, the count of all reachable members is returned. This way the sender can have the flexibility of performing other computations and block to get a reply when it is available. The count returned by the group send is not exact but is very useful in providing an estimate for the minimum number of replies that can potentially come back. The count is an indication of all those members which got  CHAPTER  3. SEMANTIC  MODEL  OF GROUP  COMMUNICATION  16  a delivery of the message and whose acknowledgement of that delivery reached the sender. There may be other members who got the message but their acknowledgement of the delivery never reached the sender (see discussion in next section for the definition of delivered). With this count, the sender can set an upper bound to the number of replies it wants back. This saves the sender from waiting needlessly for replies which may not be available.  3.2  Group Communication Operations  In designing the I P C primitives for group communication, care was taken to ensure that they are simple, clear and have straightforward and intuitive semantics. The approach taken was to redefine the 1-1 I P C primitives so that the same semantics can be used for group communication rather than introducing an entirely new set of primitives. The primitives for group communication consist of the ones used for direct 1-1 I P C plus two additional ones. The semantics of the 1-1 I P C have remained exactly the same. Only some syntactic changes were made to make it consistent for use with group communication. In the discussion that follows, a distinction is made between a message that is "delivered" and one that is "received". A message that is delivered means that it has been "dropped into the receiver's mailbox". In our implementation, this means queued up in the receiver's message queue. A message that is received by a member means that the member has "executed a receive primitive to obtain the  CHAPTER  SEMANTIC  3.  MODEL  OF  GROUP  COMMUNICATION  17  message". The following is a list of the I P C primitives when used for group communication.  cnt  =  bsend(togid,&msg,m,mtag)  (pid,nr)  =  brec(fromgid,&buf,n)  (pid,nr)  =  nrec(fromgid,&buf,n,mtag)  (pid,nr)  =  brecany(&buf,n,mtag)  (pid,nr)  =  nrecany(&buf,n,mtag)  ns  =  reply (pid,&msg,m)  (pid,nr)  =  request (togid,&msg,m,&buf,n,mtag)  np  =  totreply(gid,p)  (pid,nr)  =  getreply(gid,&buf,n,timeout)  "togid" stands for the destination group address while "fromgid" stands for the source group address, "gid" is group J d and usually refers to the current group communication taking place, "pid" is process j d . "&msg" and "&buf" specify, respectively, the addresses of the message to be sent and the buffer area to place a received message in. The parameter "m" indicates the size, in bytes, of the message to send and "n" specifies the size of the buffer area for the received message, "mtag" is used for matching selective receives and sends, "p" is used for setting the number of replies the sender wants back and "timeout" is used for setting the timeout value in waiting for a reply to be received.  CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION  18  A l l the I P C primitives are defined as functions and they return some value relevant to the function,  "nr" refers to the actual number of bytes received. If "nr" is zero  or less than zero then it becomes a status indication. The same interpretation goes for "ns", the actual number of bytes sent, "cnt" refers to the minimum number of members that got the message when its value is nonnegative. Otherwise it becomes a status indication, "np" is the maximum number of replies the user can get back. If it's zero then the group communication terminates. If it's negative then it is a status indication. There is a restriction set on the maximum length of group messages. This restriction is imposed for reasons of efficiency. The maximum data size that can be sent is 1418 bytes = 15H(standard packet) - lA(Ethernet header) - 2Q(IP header) - 8(UDP header) - 54(Shoshin header). Having a long message sent to a group results in the message using up a lot of buffer resource at each host. B s e n d is used for sending out the group message to a group by referencing the groups's id. The request primitive can also be used to send a group message but it blocks until a reply comes back and then the communication terminates. A member process can receive a message by executing one of the four receive primitives. B r e c a n y and nrecany are blocking and non-blocking receive any and used to receive messages from any processes sending to any group the receiver is a member of. B r e c a n d nrec are more explicit. B r e c is a blocking receive for a specific group while nrec  CHAPTER  3. SEMANTIC  MODEL  OF GROUP  COMMUNICATION  19  is a non-blocking receive for a specific group. A l l of the receive functions return the process J d of the sender process. R e p l y generates a reply back to the sender process. The reply for a group send/request is different from a 1-1 reply in that the member gets blocked until the sender executes a getreply (covered below) and is ready to accept this particular reply. This implies two things. First the member will be blocked until the sender is ready to receive its reply. Second the member can get blocked forever if the sender never issues a getreply. In the latter case, the member will unblock if and when the group communication terminates. The reply operation will fail with a status indicating that the group communication has terminated. Two new primitives were added for managing replies. The first is totreply which sets the total number of replies expected back. B y default the total number of replies that come in and are queued is initially set to zero. It remains zero until set (or reset) by totreply. To end a group communication immediately totreply can be used with a value of 0. This allows the user to send a message to all the members without expecting any reply back, although there is no guarantee that all members will receive the message. The other new primitive is getreply which gets back a reply for the sender one by one as needed. It allows a timeout value to be set to prevent the sender from blocking forever. A group communication is terminated in one of four ways. First, when the sender explicitly sets the totreply value to zero. Second, when all the expected replies for the  CHAPTER 3. SEMANTIC MODEL OF GROUP  COMMUNICATION  20  group communication have been satisfied. Third, when the sender dies. This is equivalent to the situation where the host of the sender crashes. Lastly, a communication is implicitly terminated when it is superceded by another one. Superceded means that the sender initiated a new group communication to exactly the same group while the previous one was still going on. When a communication terminates, the members can no longer receive the messages for that communication. Neither can they reply to it. If they do the I P C will fail returning with it a status indicating that the communication has terminated. A process can initiate more than one group communication at a time but these outstanding communications has to be addressed to different groups. They are kept separate by the group j d s . If a sender initiates a new group communication to the same group when the previous one hasn't terminated, the previous one will be overridden. The new message will be delivered to all the members but kept distinct from the previous one. However for efficiency reasons the old message is not removed from the member's mailbox. The member will get a group communication termination status instead of the message when it tries to receive the old message. Replies made to the old messages are detected and they fail as well.  CHAPTER  3.3  3.  SEMANTIC  MODEL  OF GROUP  COMMUNICATION  21  Semantics for Failure a n d A b n o r m a l Conditions  The previous section described what would transpire in a group send if everything went smoothly in the communication process. However there are always occasions where something would go wrong. A significant part of the semantics goes into defining what would happen under these abnormal conditions and the kind of status indication the primitives return. They are just as important as the semantics defined for normal conditions as the application users have to know and anticipate them. This is more so in the current implementation because the group communication only provides minimum reliability and it is up to the upper layers to build the required amount of reliability that is desired. 3.3.1  Semantics  for  Failure  The failure semantics are used by upper layers for identifying as closely as possible the nature of the problem so that they may devise an appropriate mechanism for handling the failure.  A communication  abort concept was used by Tokuda and  Manning[TOKU83b] to simplify the detection of unsuccessful or illegal communication and to avoid communication deadlocks at the same time. Whenever one of these conditions are detected, the communication is aborted and the I P C returns a negative value indicating the cause of the abort. The original set of status was extended to include new ones required to handle  CHAPTER  3.  SEMANTIC  MODEL  OF GROUP  COMMUNICATION  22  certain status specific only for group communication. As well, some of the original status changes slightly in meaning when interpreted for group communication. The set of abort status that can be returned are shown in Table 3.1.  NO.COMM: ADDR_ERROR: HARDWARE-FAIL: TIMED_OUT: NOVEMBERS: CM_ERROR: HOST_DEAD: PROODEAD: NO_BUFF: CM-DEADLOCK: CM-MISMATCH:  group communication has terminated illegal message address hardware or network failure IPC timed out empty group unknown error host at other end is dead process at other end is dead no buffer communication deadlock between sender and receiver mismatch of I P C between sender and receiver  Table 3.1: Set of Abort Status and Their Interpretation  The first status is N O . C O M M and it is used as follows. When a group communication terminates (either normally or abnormally) not all the involved process may be informed about it. Subsequently a process may want to receive a message, send or receive a reply associated with this terminated communication. Under these circumstances, the N O . C O M M status is returned. When an address or bus error as a result of supplying a wrong address occur, A D D R - E R R O R is returned. H A R D W A R E - F A I L signals some sort of failure that can be determined to stem from the hardware or network failure. T I M E D _ O U T is used to indicate that an I P C has timed out. This is used  CHAPTER  3.  SEMANTIC  MODEL  OF GROUP  COMMUNICATION  23  by g e t _ r e p l y to avoid being blocked forever. When a group send/request is made to a group that is empty, the NOJVTEMBERS status is returned. This status is made distinct from a status of 0 to emphasize the fact that the group communication is being aborted. C M - E R R O R indicates unknown errors which the communications manager cannot figure out. It is usually a result of an inappropriate I P C being issued. The other five statuses enumerated are used only in the context of 1-1 interprocess communication. Both the H O S T J D E A D and P R O C J D E A D condition are determined through probe timeouts. The timeout value is a system parameter that can be tuned for optimum performance.  The N O _ B U F F status indicates that either the sender  or receiver's host ran out of buffers to contain system parameters and/or messages. C M _ D E A D L O C K attempts to inform the invoker that a deadlock was detected between the sender and receiver. Both process' I P C are aborted. C M _ M I S M A T C H on the other hand detects a reply was issued to another process which does not have an outstanding communication. It is interesting to note that further studies could be done on the failure semantics to determine whether they are "rich" enough for most if not all types of application. 3.3.2  A b n o r m a l  Conditions  In the current implementation some of abnormal conditions considered are: •  sender or member dies  •  a host in the network crashes  1  CHAPTER  •  3.  SEMANTIC  MODEL  OF GROUP  COMMUNICATION  24  the network is partitioned  [Fig. 3.2] shows what happens to the sender and the members under various abnormal conditions at various stages of the group send. The first stage is "send initiated". This refers to the stage where the sender has just initiated a group send and it is being broadcasted onto the network. The second stage is "message delivered". This refers approximately to the time where the remote hosts pick up the broadcast message and queues it up to each member's message queue. The third stage is "message received". At this stage some of the members have already picked up the message by executing a receive primitive. The next stage is "member replies". Some of the members start replying to the sender. The fifth and final stage is "communication being terminated". To track what is going on in a group communication is not simple because there are several open lines of communication between the sender and its members. However there are a number of general rules which prevail. Firstly, the group communication is in the form of a broadcast datagram.  When the group message is broadcasted  through the network there is no telling who will get it and who won't. It all depends on what the conditions of the entire system is at the time of the group send. In effect not all the members will get a delivery of the message and out of those that got the message not all of them will be counted in. Secondly, when a member executes a receive primitive to accept a group message, the acceptance will be successful only if the group communication is still ongoing. The same thing goes for a group reply.  CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION  25  Thirdly, when a message is being terminated the system broadcasts a T E R M I N A T E message to all the members on the network. This T E R M I N A T E message may not be picked up by certain members. However eventually these members will be properly informed in one of several ways. Subsequently they could try to receive the message or reply to it and find that it has already terminated. Another way that a member can tell that a group communication has already terminated is through probing. The member will continually probe the sender to determine whether it is alive and whether the communication is still going on. If the group communication has already terminated then the probe will fail after a maximum number of tries. (The probe is done in the background by the system on behalf of the user.) Referring to the figure, in the first stage of the group communication a group send is initiated. When the sender dies or its host crashes, it is unpredictable how far the group communication will proceed. Some or all of the members may get a delivery of the message and receive it as well. But this is as far as it goes. None of the member will be able to reply to it. It will terminate eventually because the member will probe and find out that the sender or the sender's host is dead. A number of abnormal conditions may cause a member not to be counted in at the time of the group send.  These  conditions include the member dying off, the member's host crashing or the member's host being temporarily severed from the network or the member's host residing in a partition that doesn't contain the sender's host.  CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION  26  The next stage of the group communication is identified as the member getting a copy of the message in its message queue. A t this point the member can potentially receive the message as long as the group communication is deemed to be still alive when the member execute the receive primitive. If conditions occur that cause the member to perceive that the communication terminated then the member will never get a chance to receive the message. Similarly if the member dies off or its host crashed, the sender will never be able to obtain back a reply. The next stage is naturally the member executing a receive primitive to accept the message.  A t this stage the member is allowed to reply and the hence most of  the considerations regarding what may occur centers around the reply. If the sender becomes permanently unreachable, the reply simply fails. O n the other hand if all the member becomes permanently unreachable getjreply fails. If the abnormal condition is temporary such as short severance from the network the reply may or may not be successful. The next stage is when the member replies. Under most abnormal conditions it is bound to fail. It is successful only if the condition is temporary like a host being temporarily severed from the network. Lastly, when the group communication is terminated a broadcast T E R M I N A T E message is sent out. If everybody gets the termination message then the communication terminates consistently. Otherwise it could be inconsistent in the sense that  CHAPTER 3. SEMANTIC MODEL OF GROUP  COMMUNICATION  27  some of the members may not know about it. This problem is solved by probes from those members. One way or the other the communication will eventually terminate consistently.  CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION  ABNORMAL CONDITIONS  1  •STAGES OF GROUP COMMUNICATION send initiated  group comm eventually sender dies terminates, unpredictable how far it aoes sender's host group comm eventually crashes permanently terminates, unpredictable or for a how far it very 'long' goes time  message delivered  message received  member may receive the message, group comm eventually terminates member may receive the message, group comm eventually terminates  member can't reply to message, group comm eventually terminates member can't reply to message, group comm eventually terminates  member replies  reply fails in the middle, group comm eventually terminates  OK  reply fails in the middle, group comm eventually terminates  group comm maybe in an inconsistent state but eventually terminates  sender never sender never gets back a gets back a reply reply  reply fails in the middle  OK  member's host crashes member not permanently counted in or for a very 'long' time  sender never sender never gets back a gets back a reply reply  reply fails in the middle  OK  message may be recroadcasted or group comm terminated  group comm on remote host may terminate  member's reply may fail  member's host temporarily severed from the network  some members maynot be included in the count  group comm on remote host may terminate  member's reply may fail  network partitioned  only those members on same partition as sender gets counted  group comm not on sender's partition ma) terminate  can't obtain reply from members who are on different partition  reply may or maynot reach the sender  •  communication being terminated  member not member dies counted in  sender's host temporarily severed from the network  28  group comm maybe in an inconsistent state but eventually terminates  group comm maybe in an inconsistent state but eventually terminates group comm reply from maybe in an members not inconsistent on sender's state but partition are eventually missed terminates reply may or maynot reach the sender  Figure 3.2: How Abnormal Conditions are Handled under Various Stages of a Group Communication  Chapter  4  Implementation  The implementation of group communication on T-Shoshin was on a collection of Sun 2/50s. These workstations had a different Ethernet controller from the older Sun 2 machines. The first section describes the I n t l 82586 Ethernet controller device e  driver. Subsequent sections discuss how the group implementation was done.  4.1  Ethernet  Interface  T-Shoshin was reinterfaced with another Ethernet controller to run on a collection of Sun 2/50s which were available for implementing group communication. The previous version of T-Shoshin ran on Sun 2/l20s which have a 3 C O M Ethernet Controller while the newer models, Sun 2/50s have an I n t l 82586 e  1  Ethernet controller. The  controllers were substantially different from each other. Between the two controllers, the 3 C O M had simpler features and was easier to op1  Int l is a registered trademark of Intel Corporation e  29  CHAPTER 4.  IMPLEMENTATION  30  erate. It has two fixed-length receive buffers and one fixed-length transmit buffer. Byte transfer to and from the Ethernet were directly taken out of these buffers by the 3 C O M controller. Coupled with these buffers are a number of registers used for controlling the Ethernet by setting interrupt modes and for reading the status of a frame transmission/reception. For each frame that comes in or goes out, the C P U is immediately notified through an interrupt signal. On the reception of a frame (or frames) the C P U must place high priority in processing the frame (or frames) before they get overwritten by a subsequent frame arrival or before an incoming frame is lost because the 3 C O M was suspended in a " N O - R E C E I V E " mode. The buffers and registers are all located in fixed system memory addresses. To avoid copying the contents of these buffers and registers from system address space to the Communications Manager's address space every time a transmission or reception took place, memory blocks within the Communications Manager's address space were allocated and memory-mapped into the system address space at initialization time. This way the Communications Manager can access the buffers and registers through its own address space while the 3 C O M controller can likewise access it through the system address space without any need for copying. The I n t l 82586 is a more intelligent Ethernet controller having far more features e  than the 3 C O M . Although these features made the controller more flexible and powerful, they also made it more complicated and more involved to operate. The I n t l e  82586 can support several types of network configuration through parameterization.  CHAPTER  4.  IMPLEMENTATION  31  The default configuration at power-up is the I E E E 802.3 10base5 standard. Even with the default configuration in place, the I n t l 82586 still has to be programmed to make it e  work properly with the C P U . Programming the I n t l 82586 is similar to programming a e  specialized hardware using some form of assembly language. The device driver written for the I n t l 82586 must initialize the controller by establishing a predefined interface e  between the host C P U and the controller. This interface is termed "shared memory structure" and it contains different kinds of command blocks and buffer descriptions that the host C P U would use to operate the controller and that the controller would use to inform the C P U of its operation status. Programming the I n t l 82586 consists of setting up chained command blocks ine  side the "shared memory structure" which contain command codes and parameters. These command blocks are then passed to the I n t l 82586 for execution. Whenever e  the I n t l finishes execution it would generate an interrupt and pass back status and/or e  received frames to the C P U . A l l these status and frames are likewise contained within the "shared memory structure". One of the commands is to start the controller for transmission of frames onto the Ethernet. The device driver has to tell the controller the source address of the transmission frames. In particular it must pass system addresses to the controller. The scheme used in the original version of T-Shoshin for allocating buffers inside the Communications  Manager and memory-mapping to the  system address space is also used here. There are two things to note though. First  CHAPTER  4.  IMPLEMENTATION  32  the system address space used for the buffers is not fixed and can therefore be chosen to be anywhere within the system address space that is not utilized for anything else. Second, the device driver must keep a mapping of buffer address that map from the Communications Manager to the system address. One of the more useful feature that the I n t l 82586 has is to allow for a variable e  number of receive buffers to be set up by the host C P U . In addition each of these "buffers" used for storing a single frame is made up of a chain of linked buffers. A t initialization time the device driver allocates a set of arbitrarily-sized buffers and chain them together onto a free list. Whenever a frame arrives the I n t l 82586 would grab an e  appropriate number of buffers off the free list to store the frame. Afterward the I n t l e  82586 would interrupt the C P U . The controller can also be set up to interrupt after every frame arrival or after x number of frame arrivals. This feature has a number of advantages. First the host C P U can allocate more receive buffers as necessary to minimize loosing frames during heavy traffic. Also by allocating more receive buffers there is less chance of frames being overwritten when the C P U doesn't respond fast enough. Second, space inefficiency can be reduced by allocating smaller buffers. Third, the number of receive buffers allocated can be catered to specific system configuration and load. The I n t l controller can function a lot on its own. It can manage transmission e  and reception, relink buffers, gather statistics, and arbitrate collision all without the  CHAPTER  4.  IMPLEMENTATION  33  intervention of the host C P U . In other words, a large range of operations goes on within the I n t l 82586 beyond the control of the C P U . Sometimes it becomes difficult e  to ascertain the nature of certain problems when something is amiss. As an example, at one point in the implementation certain frames were lost in the Ethernet and others weren't. The controller was reporting that everything is fine. It turned out that frames less than the minimum configuration length were being dropped by the controller automatically without any indication. Only through several trial runs using all sorts of packets and a very careful reading of the manual was the problem determined. The complexity of operating the controller as well as potential problems such as the one reported above encourages the use of only basic I n t l 82586 features. e  4.2 4.2.1  Group G r o u p  Management Identifiers  A group J d is a 32 bit unique identifier that has a syntax similar to process j d s . It is differentiated from processJd in that the group bit is set to 0 rather than 1. Like the V group Jds, the group bit allows the kernel and communications manager to distinguish between process and group Jds. It also ensures that the process and group J d s ' space are disjoint. A group j d has two parts. The first part designates the class, that is a category of groups which are related in some fashion. For example system servers such as printers servers, file servers and disk servers can all belong to the class "system".  CHAPTER  4.  IMPLEMENTATION  34  The second part designates a specific group within that class. There is a single server process in the system (group-id s e r v e r ) which looks after allocating system-wide  unique  group Jds. Requests for creating new groups would be  directed to the group-id server which would come up with a new group j d and associate it with the control password for the group. Any host in the system which is not familiar with a group J d and needs information about it would inquire from the group-id server. This central group-id server dispenses the need for a distributed agreement protocol to come up with a unique and unused group j d . When a group is destroyed, its group j d is deallocated by the group-id server. The group-id server maintains a table of <group j d , control password> pair. A group J d is considered allocated if it is in the table and not allocated if it is not listed in the table. 4.2.2  G r o u p  M e m b e r s h i p  The membership information of a group is not centrally stored but distributed across all the hosts which has members belonging to the group. Each host maintains information only about local processes and these information are stored in a table of <group J d , process J d > pair. A process p is a member of the group g if there is a  <p,g> pair in the group database of the local host. When a process wishes to join a group, the database managed by the local group server is first examined to see if there are existing members in the group already. If there is then the new member is added  CHAPTER  4.  IMPLEMENTATION  35  in. If not then the local group server obtains information from the group-id server. When a member p leaves a group g, then the <p,g> pair is deleted from the local group database. A message is delivered only to those processes which are members of the group at the time the group communication was initiated. Members which join the group after a group message was sent may or may not receive the message. This section describes how the destroy group could be implemented. Since all the information about a group is scattered across the system, it is very difficult to make sure that all the membership information regarding a group is removed when a group gets destroyed. When a request for group destroy is made, it is directed to the group-id server. The group-id server broadcasts that message to every local group server to delete its local information. The group-id server then waits for an acknowledgement from all the local group servers. If all the acknowledgements come back after the first broadcast then the operation is successful. Otherwise the broadcast is repeated. After a maximum number of repeats it is assumed that either a network partition has occurred or certain hosts have crashed. For a host crash it is assumed that the host can only be crash-fail. When a crash host restarts, it looses all its previous information. However fragments of a group may still exist if one of two conditions arise while a group destruction is taking place. First the network partitions. Second some host temporarily cannot send or receive messages. In either of these conditions occur there  CHAPTER 4.  IMPLEMENTATION  36  may be some hosts that fail to receive the group destroy message and hence membership information kept by those hosts are not deleted. In other words the group id remains allocated. The implication of having group fragments remain in existence is seen to be tolerable. Processes which wishes to join the group won't be able to do so unless they belong to one of those hosts that still keep fragments of the group. Messages which are send to this defunct group will be received by the remaining members but not by the other ones. A problem comes up when the group j d server wishes to allocate the id of this group as a new i d . A means to solve this problem is to broadcast another destroy message just before the allocation to further ensure that no duplication of group id exists. For the same group J d to be allocated to different groups it implies that the deallocation and reallocation must have occurred when a host was temporarily severed from the network or when the network is partitioned. There is a chance of this occurring but the chance is minimal.  4.3  Group I P C  4.3.1  G r o u p a n d  Operations  C o m m u n i c a t i o n s  C o m m u n i c a t i o n s  M a n a g e r ,  E t h e r n e t  M a n a g e r  M a n a g e r  The group I P C operations are implemented through a server process called the Group Communications Manager.  The Group Communications Manager (GCM) is  a distributed server that embodies all the tasks necessary to carry out the group IPC  CHAPTER  4.  IMPLEMENTATION  37  operations. Each host's G C M is responsible for all the group communications that any of its local processes are involved in. It keeps track of each outstanding group communication that takes place from the time that a group send has been initiated and the message broadcasted, to the time that replies get back to the sender. It obtains information from the local group server regarding membership information that it needs. It also shares the network interface to the ethernet and some communication state structures with the Communications Manager. A l l the local G C M s uses a mutually agreed on protocol to coordinate and manage the group I P C . In the original version of T-Shoshin it is the Communications Manager (CM) that manages all the remote 1-1 I P C as well as the ethernet resource. A l l the packets that needs to be transmitted and received to and from the network are handled solely by the C M . As such all the interrupts from the ethernet are also sent from a kernel interrupt process to the C M . In extending T-Shoshin to include a group I P C facility the G C M also needs to have access to the ethernet resource and be informed by the kernel interrupt process that packets have finished transmitting or that packets have arrived. To accommodate the needs of the G C M another manager was created to mediate the access of two communications manager for the same resource. This manager is called the Ethernet Manager (EM) [Fig. 4.1]. The E M gets all the interrupts from the kernel interrupt process through message passing. It then separates out transmit from receive interrupts. If  CHAPTER  4.  IMPLEMENTATION  38  the interrupt indicates that a packet's transmission has been finished the E M accesses the ethernet buffers and frees them. It then informs the appropriate communications manager by sending it a message. If the interrupt indicates that a packet has just been received the E M determines which communications manager should be handling the packet and informs that manager of the arrival through a message send. Subsequently when the message is picked by the C M or G C M it is processed accordingly by reading the ethernet buffers directly. Amidst all of these, whenever a packet needs to be transmitted by either communications manager the packet is formed and queued up for transmission onto the ethernet.  Ethernet (a)  Old  Configuration  (b)  New  Resource Configuration  Figure 4.1: Access to the Ethernet Resource by various Communications Managers  In order for the three managers to be able to access the same physical resource as separate processes, they all have to be created as part of a team. It makes no  CHAPTER  4.  IMPLEMENTATION  39  difference which manager is designated as the team root. In this implementation the C M is the team root and at boot time both the G C M  and the E M is created by  the C M . The ethernet resource structures such as buffers and queues are declared global to all the three team processes. The problem of inconsistency arising from several processes/modules accessing the same data structures is not present in T Shoshin because mutual exclusion among these processes is provided implicitly by the team implementation. Members of a team do not preempt each other during execution. A team member continues to execute until it gives up its control voluntarily. The only problem is that a team member may execute forever. Caution has to be taken that this does not occur. Right after its inception the G C M creates a timer process for its own timeouts. Then it blocks forever to wait for messages. The messages that it receives are the results of externally generated events. There are three externally generated events which the G C M must deal with.  1. A message from a clock process indicating that a fixed amount of time has expired. 2. A message issued by the kernel on behalf of a user process which indicates that a user process wishes to perform a group I P C . 3. A message from the Ethernet Manager indicating that a packet has been transmitted or has arrived.  CHAPTER  4.3.2  4.  G r o u p  IMPLEMENTATION  40  S e n d  When a group send is initiated by a user, it gets to the Group Communications Manager through a kernel message. The G C M then sets up an outstanding communication structure ( O U T C M ) for it. This structure is the key to identifying the group communication every time it's needed. It contains the vital pieces of information for managing an ongoing group communication. Such information include source process and destination group, state of the communication, number of replies wanted back, and so on. After assembling the message into a packet, the packet is queued for broadcast onto the network. When the packet gets onto the network the G C M waits for acknowledgements from all the other hosts indicating how many members in the group are on the remote hosts. When acknowledgements from remote hosts come in, the sender host checks them against the network configuration that it has. As soon as all acknowledgements come in the sender is unblocked. Otherwise it waits until the timer expires and the message is rebroadcasted. After a maximum number of rebroadcasts those hosts that didn't acknowledged are considered dead or temporarily down and their acknowledgements are ignored. The sender is unblocked with the accumulated count as the return parameter. Presumably after the sender unblocks it is waiting for replies from members. Every time a reply comes back the total number of replies expected back is reduced by one.  CHAPTER 4.  IMPLEMENTATION  41  While it is waiting the G C M does not do any probing whatsoever to see if the members are alive or whether they have received the messages or whether they are in the process of replying. The G C M only holds the necessary information such that when a reply comes in it can be properly recognized. The G C M on the other hand can and will be probed by remote G C M s as to the continued existence of this group communication. Furthermore the G C M checks on the sender process occasionally to see if it is still alive. This prevents the G C M from maintaining the O U T C M forever when in fact the sender is dead already. 4.3.3  G r o u p  Receive  Since broadcasting was used to transmit messages, all the hosts on the network pick up all broadcast packets that come by. When the broadcast packet contains a group message the G C M on the remote host sets up an O U T C M that is almost identical to the one on the sender's host. These remote O U T C M s acts as the "extension" of the original O U T C M on each remote host. The presence of a remote OUTCM oh a remote host signifies that the group communication is still ongoing as far as the remote host is concerned and its absence signifies the reverse. If the group message is an override of an existing one then the old O U T C M is simply updated (in particular the session number). After setting up a remote O U T C M the G C M queries the local group server for a  CHAPTER 4.  IMPLEMENTATION  42  list of members belonging to the group. W i t h the list it tries to queue up a copy of the message to each member on the list. Only those that have successfully been delivered a copy of the message are counted and the total count is acknowledged to the sender's host. A new message overidding an old one would also result in the new message being queued up without the old one being removed (assuming it hasn't been received). If by any chance the member executed a receive primitive to get the old message it will get an indication instead that the message is outdated and therefore cannot be received. The remote O U T C M would also keep a reverse list of all the members involved in that group communication as a means of tracing the members' activity. When a member gets a delivery of the message it's state is set to D E L I V E R E D . When messages queued up on the message queue of individual members gets picked up by the member when it executes one of the receive primitives the state of the member on the reverse list on the remote O U T C M is updated to R E C E I V E D . For each remote O U T C M that it is maintaining the G C M sends a probe to the sender's host regularly to verify that the group communication is still ongoing. This is to avoid the situation where the group communication has already terminated but some of the remote hosts aren't properly notified.  CHAPTER 4.  4.3.4  IMPLEMENTATION  G r o u p  43  R e p l y  When a member process executes a reply the I P C is set up such that the Communications Manager gets it instead of the Group Communications Manager.  The  idea behind it is that a reply from a member to a sender process is basically a 1-1 IPC exchange and hence should be handled by the 1-1 CM. Having the GCM handle it results in duplicated efforts. Close cooperation though is maintained between the two communications manager such that the reply will be directed properly to the sender and that the replies expected back is also reduced accordingly by the G C M . When the C M gets the reply message, it does not have the necessary information regarding the group communication. It seeks the information from the G C M . What the G C M does is check that it has a remote O U T C M corresponding to the group communication. Furthermore the member must be on the list and its state should be R E C E I V E D . If all of these conditions are satisfied the reply proceeds. A copy of the remote O U T C M is made from the G C M and moved to the C M ' s domain. From here the reply message is treated as a 1-1 exchange between the member and the sender. The 1-1 protocol for message exchange is applied. This happens to be a reliable protocol that mediates a reliable 1-1 exchange between two entities. On the other hand if not all of the conditions are satisfied the reply is treated as having failed and the replier is unblocked with the proper indication status. In delivering the reply back to the sender, the first thing that the C M does is send  CHAPTER  4.  IMPLEMENTATION  44  a control packet requesting to send a reply message. A t this point the sender may or may not be blocked waiting for a reply to come back. If it is blocked, then the reply message is immediately transferred to the sender. If the sender is not blocked an occasional probe will be sent out to check whether the sender is alive, whether the communication still exists, and whether the sender has already executed a get_reply. There is a possibility that the replying process would block forever waiting for its reply to be received by the sender. This problem is similar to using the 1-1 blocking send primitive. Both suffer from the same set of because both are blocking.  4.3.5  Termination of a Group Communication  A group communication terminates on one of the following conditions: 1. The total replies wanted back has been satisfied. 2. The total replies wanted back was explicitly set to zero. 3. The ongoing communication was succeeded by a new one. 4. The sender process died. A group communication is terminated when the user has set a number for the replies it wants back and all of them have arrived. A second way is when the sender forces the group communication to be terminated immediately by setting the replies count to zero. For both of the above the G C M will remove all traces of the group  CHAPTER  4.  45  IMPLEMENTATION  communication by removing the O U T C M associated with the group communication. It will also broadcast a T E R M I N A T E packet to all the hosts on the network to inform them that they should removed their remote O U T C M s as well.  Those that aren't  properly notified will carry on until it sends out a probe to the sender's host.  Note  that the replies wanted back can be set a number of times as long as the subsequent settings is less t h a n or equal to the previous setting. Otherwise it remains unchanged. A third way to terminate the group communication is b y issuing a group send to the same group. T h e new send overrides the old one. A l l the replies to the old one will fail as they are no longer applicable. T o avoid matching replies to the wrong group send a session number is associated with each group communication. E a c h time a process starts a new group communication it is assigned a new session number. E a c h reply that comes back must carry with it the correct session number to match the group send. A final way that a group communication terminates is when the sender dies a n d all the communication it is involved in becomes nonfunctional. T h e remote O U T C M s that reside on the remote hosts will eventually find out that the sender has died when it probes the sender's host. It will get an acknowledgement packet indicating that the sender has died. T h e O U T C M on the sender's host will also probe it's own sender to see if the sender is still alive. If it finds out that the sender has died unexpectedly it will initiate a group communication termination procedure.  CHAPTER 4.  4.3.6  IMPLEMENTATION  46  Certain Implementation Issues  A n upper limit has been set over the size of group messages such that it can fit in one network packet. This was done to minimize huge messages from being broadcasted and acknowledged by every host in the network. This is sufficient assuming that most applications would need to sent relatively small messages to groups of processes. Broadcasting or multicasting were the two alternative ways to get a message to all hosts on the network. V uses multicasting because it is taking advantage of the hardware multicast facility. The implementation here used broadcasting because not all ethernets have a multicasting facility. In particular the S U N Workstations on which the original T-Shoshin was ported to doesn't have any hardware multicasting facility.  Chapter 5 Results and Analysis The group communication model in T-Shoshin was formulated with simplicity and consistency as two of its major goals. The model is simple in that it is straightforward to track what occurs during the process of a group communication. This is also true in the case of abnormal conditions arising in the middle of an ongoing communication. It is consistent in that the delivery of messages to both local and remote members are done in exactly the same fashion. Changes made to the mechanism of delivery would be applicable both to local and remote sites. After the implementation a performance evaluation was undertaken. Measurements were obtained on the performance of the I P C primitives when used for group communication and the results reported here. These results show how good our model is under different parametric conditions. Through a careful examination of the results it is possible to determine where possible compromises could be made to the model to improve the efficiency of the I P C primitives. The results could also point out weaknesses in the  47  CHAPTER 5. RESULTS AND ANALYSIS  48  model which suggest possible improvements to the model itself.  5.1  Measurement M e t h o d and Environment The measurements were taken using five Sun Workstations connected through a 10  Mbps Ethernet. No attempt was made to take the measurements at "anti-social" times because such figures would give an indication of how fast the IPCs would perform at a situation which does not occur in reality. Instead the results show the performance of the primitives under a reasonable amount of network traffic. The measurements for all the group IPCs were taken following the outline of the two code fragments given in Figure 5.1. The first part corresponds to the sender process which continually produces group I P C messages. The bottom part corresponds to the member process which continually consumes the messages. Reply messages may be produced by the member process if required by the setup. The elapsed time calculated for an individual group send operation represents an average figure over the 1,000 sends. The overhead involved in making the two g e t t i m e system calls, the for loop control, and the assignment statements have not been deducted from the reported figures. The average overhead for the g e t t i m e call is 1.07 milliseconds which is insignificant when it is amortized over 1,000 I P C sends. The time for the loop control and the assignment statements are also insignificant compared to the time it takes to execute an I P C send.  CHAPTER  5.2  5.  RESULTS  AND  ANALYSIS  4 9  Performance Measurements  Measurements were taken for the 1-1 IPC primitives to see how the performance of the 1-1 remote communication is after the reinterface to an Int l Ethernet controller. e  Table 5.1 shows the measurements for various message sizes using <bsend,brecany> pair. Elapsed time refers to the time between the sending of a message by one process and the receiving of the message by another process. There is approximately a 3036 fold increase in the transmission time for 1-1 IPC send compared to the original T-Shoshin. Special attention is given to the case where the message size is 0. In the new T-Shoshin it takes 1.172 seconds to send a message size of 0. Acton made an analysis on the various execution costs of sending a packet. The possible cause in this case could range anywhere from processing overhead to operate and run the controller to costs of sending messages to coordinate various aspects of the transmission. Due to time limitation, no further analysis was done to determine the exact cause of such huge transmission time. However the next section contains a discussion on the analysis of the implementation. Some of the causes that give rise to higher elapsed time for 1-1 communication would also give rise to a higher elapsed time for group communication. With these causes in mind, it is expected that the transmission time for group communication would be fairly high as well. Measurements were carried out on the two group IPC primitives, bsend and re-  CHAPTER  5. RESULTS  AND  ANALYSIS  50  Message Size (bytes)  Elapsed Time (ms)  0 16 32 512 1024 2048 8192  1172 1173 1173 1179 1182 1730 3905  Table 5.1: Elapsed Time for 1-1 <bsend,brecany> pair quest. Most of the measurements were done on the bsend because it embodies the features of a broadcast datagram without the extra delays which are irrelevant to the actual sending of a group message. In contrast the elapsed time measurement of a request primitive is comprised of three components: the broadcasting of the sender's message, the waiting period that the sender must undergo for a reply to come back, and the time it takes for a member to send back a reply. The first and last components are the ones of interest. The first component is almost exactly similar to what the b s e n d does. The last component is a 1-1 reply which uses the 1-1 communication mechanism. The measurements for bsend are given in Tables 5.2 to 5.4. Table 5.2 reports the elapsed time figures when the number of remote members in a group is varied. In this table elapsed time refers to the time that the sender sends out a group message to the  CHAPTERS.  RESULTS  AND ANALYSIS  51  time that the sender is unblocked with the delivery count, x remote members means Number of Remote Members  32 bytes message elapsed time (ms)  1418 bytes message elapsed time (ms)  1 2 3  813.2 814 813  813.7 813 813  Table 5.2: Elapsed Time for Group bsend - A l l Remote Members  that there is a member in each of x number of hosts. There are two sets of figures, one for a message size of 32 bytes and a second set for a message size of 1418 bytes (the maximum allowable group message size). The figures show almost no variance but it is lower than that for 1-1  bsend.  The 1-1  bsend  is a reliable protocol and  involves more packet exchanges to transmit a message. It involves the sending of four packets between the sender and receiver host in the following order: R T S (Request to Send), C T S (Clear to Send), Data, Ack (Acknowledgement). In group communication, the sender's host broadcasts the message and all the hosts pick it up simultaneously. Subsequently they all send back an acknowledgement at roughly the same time. It remains for the sender's host to pick up and process these acknowledgements. Table 5.3 reports the elapsed time figures for different number of local members in a group. The elapsed time figure is similar to the one for Table 5.2. It is the time that the sender send out a group message to the time that the sender is unblocked with a  CHAPTER  5.  RESULTS  AND  52  ANALYSIS  delivery count. Since the implementation for local and remote members are exactly the same, the elapsed time is not expected to be any lower than that for sending group messages to remote members. The group message still has to go out on the Ethernet. It is then picked up by the sending host and processed just as if the sending host was receiving broadcast messages from other hosts. For this measurement the total number Number of Local Members  32 bytes message elapsed time (ms)  1418 bytes message elapsed time (ms)  1 3 5 7 9 11 13 15  819 828 834 868 854 888 872 888  827 844 869 872 881 922 940 935  Table 5.3: Elapsed Time for Group  bsend  - A l l Local Members  of remote members are kept at zero. Two sets of figures are also presented here, one for a message size of 32 bytes and another for a message size of 1418 bytes. The figures for both sets show an increase in the elapsed time as the number of local members increases. Furthermore the set of figures for a message size of 1418 bytes is higher than those for a message size of 32 bytes. The figures reflect how group communication is implemented. When a group com-  CHAPTER  5. RESULTS  AND  ANALYSIS  53  munication is initiated the message is broadcasted onto the network and the Group Communications Manager waits until an acknowledgement is received from every host in the network. As each host (the local and all the remotes) picks up the message it queues the message onto the "mailbox" of each member. This explains why the elapsed time increases as the number of local members increase. The more members there are the more processing time it takes to do the queuing before an acknowledgement can be sent back. It also explains why the processing time stays roughly the same when each remote host has only one member [Table 5.3]. The G C M on the sender's host has to wait until the slowest host sends back its acknowledgement. When there is only one remote member, the remote host containing that member is the last to send back an acknowledgement. When there are two or more remote members, the remote hosts containing these members are the last to send back their acknowledgements. A l l of them send it back at roughly the same time. The elapsed time is longer when the message is longer because it takes some time to copy the messages from the G C M ' s buffer to the receive buffer of each member process. This same delay is not seen in Table 5.3 because the members are on remote hosts and do not affect the sender's host. Several average values were calculated for Table 5.3. First, the average increase in time caused by having an additional local member in a group is 4.9 msecs if the message size is 32 bytes and 7.7 msecs if the message size is 1418 bytes. Second, transferring a message of 1418 bytes consistently takes a longer time than transferring a message of  CHAPTER  5.  RESULTS  AND  54  ANALYSIS  32 bytes. The average difference is 29.9 msecs. The results of Tables 5.2 and 5.3 together seem to imply that there is some delay involved in waiting for an acknowledgement to come back from each host even if there is not much overhead incurred in processing the acknowledgements. This leads to the question of whether it is worthwhile for the sender to wait for an acknowledgement from each host. The main reason for these acknowledgements is that they contain the number of remote members on a remote host. When all these acknowledgements return to the sender's host, the total number of remote members are added up and the totaled value is passed to the sender. This value is a very useful feature to include in the model. As the sender unblocks it will have an indication of how many members could potentially send back a reply. Based on this knowledge it could judiciously select an appropriate number of reply messages.  To determine the delay incurred by the  acknowledgements, a further experiment could be carried out which eliminates the need for each host to send back an acknowledgement and have the sender unblock as soon as the message is broadcasted onto the network. Table 5.4 reports the elapsed time figures when the group is empty, that is there are no local and no remote members. These figures are surprisingly high. This is attributable to the way the group send was implemented. When all the host acknowledgements arrive and it was found that the membership is zero, the G C M proceeds to terminate this communication. It broadcasts a T E R M I N A T E message to all the hosts.  CHAPTER  5.  RESULTS  AND  55  ANALYSIS  Message Size (bytes)  Elapsed Time (ms)  32 64 128 256 512 1418  1293 1342 1341 1340 1319 1342  Table 5.4: Elapsed Time for Group bsend - Empty Group This way all the system resources occupied by the communication can be reclaimed immediately. Furthermore there is no need for regular garbage collection. A n alternative is for the communication to remain in existence until either it is superceded by another group communication or the sender process dies. These alternatives presents another case of time versus resource trade-off. The first alternative shows a degradation in system performance but makes better use of system resource. The latter will result in better I P C performance but system resources are unnecessarily occupied. It cannot be definitely stated which is the better solution. The choice of which one to use is dependent on the needs and usage of the system. Table 5.5 shows the figures for a group send using the request primitive. After the message is send out, the request primitive blocks until one reply comes back. Here elapsed time refers to the time when the sender issues a request primitive to the time  CHAPTER  5. RESULTS AND  ANALYSIS  56  Message Size (bytes)  Elapsed Time (ms)  32 64 128 256 512  2146 2113 2090 2247 2249  Table 5.5: Group request with One Remote Member that it gets back a reply in return. A reply message going from a member to a sender is done using the 1-1 I P C mechanism because it is essentially a 1-1 I P C send. The 1-1 I P C mechanism uses a protocol that guarantees reliable delivery. This introduces a very long delay in the message transaction caused by the sender waiting for a reply. The delay is due to the way the 1-1 reply was implemented. When a member initiates a reply, a probe is send out to the sender to check whether the sender is ready to receive the reply. If the probe fails (that is the sender is not ready), the replier will sleep for a period of time and try again. In the meantime the sender could become ready for a reply but it has to wait until the replier sends the next probe before it could say Clear to Send. Acton reported that the remote I P C elapsed time for a bsend is 32.3 msecs for a message of 32 bytes and 36.2 for a message of 512 bytes [ACTON85], Assuming that a reply message would take roughly the same time, the average amount of time that  CHAPTER  5. RESULTS  AND  ANALYSIS  57  the sender waits for a reply message is 1.3 sees for a message of 32 bytes and 1.4 sees for a message of 512 bytes. These figures aren't very surprising considering that the replier sleeps for three seconds before it tries probing again. The delay introduced into the message transaction is certainly very high. There are a number of possible modifications to the model which could reduce this delay. The first modification is to buffer the reply message at the sender's host. This way the sender will immediately obtain the reply when it is ready. The second modification is to separate out the reply mechanism between a 1-1 and 1-many reply. The latter reply may not necessarily need the reliability that the 1-1 reply gives. The underlying network is such that packet lost rate and packet error rates are low. This alternative looses some of the advantages of using the 1-1 reply mechanism. The first disadvantage is that the replies are not guaranteed to arrive safely. The second disadvantage is that the reply messages cannot be arbitrary in size. If the reply message is arbitrarily long then the 1-many reply mechanism has to have some sort of protocol to segment and reassemble the message. It may even need to ensure that the component segments arrive at the destination site. These are all currently handled by the 1-1 reply mechanism. To do it in the 1-many reply mechanism is a duplication of effort.  CHAPTER  5.3  5. RESULTS  AND  ANALYSIS  58  Analysis of Implementation The figures reported here suggest that the group communication of T-Shoshin  needs further tuning and much closer examination. Results of the 1-1 I P C on the new implementation show that there is an inordinate amount of delay contributed by the implementation. The increase factor ranges 30-36 times over the original implementation. These increase, however, do not necessarily imply that the elapsed time for group I P C can be reduced by the same factor. It implies that the elapsed time may be reduced by an equivalent amount if implemented on the original machines. W i t h this in mind I've examined the machanism of group communication independent of the figures reported here to determine its efficiency. As a first step I've examined the process of sending out a group message to see what processing are needed: 1. the costs of sending messages to various processes to coordinate the message transaction. 2. cost of moving data from one address space to another or from one buffer to another. 3. actual transmission on the network. 4. processing overhead to operate the controller and acquire the ethernet. 5. processing overhead to run the protocol and build any necessary headers. 6. processing overhead to determine the local membership for.a group.  CHAPTER 5. RESULTS AND ANALYSIS  59  Acton has outline some of these same costs for 1-1 I P C and concluded that the largest component associated with sending a packet is that of local messages. Part of the poor performance of T-shoshin 1-1 remote I P C with respect to the V Kernel results from having the Communications Manager be a process separate from the kernel. This results in extra context switches and message passes that contribute significantly to the overhead of transmitting a message to a remote machine. These overheads are aggravated in the Group Communications Manager for a number of reasons. Firstly, for each packet that is sent and received on the network, there are two local I P C exchanges, one for the interrupt process that notifies the Ethernet Manager and one for the Ethernet Manager to notify the appropriate communications manager. It also involves a context switch from the kernel context to the communications manager context. With the current network configuration this amounts to six context switches and 12 local I P C exchanges on the sender's host. On top of that there is one context switching and two I P C exchanges on each of the receivers' host. Secondly, for each message that arrives on the remote machines, the message is first copied from the transmission buffer to the G C M ' s address space. Subsequently it is copied from the G C M ' s address space to each member's address space. Thirdly, at each remote site, the G C M has to send a request message to the local group database server to obtain information on local membership. This involves further local I P C exchange and full context switching between address spaces.  CHAPTER  5.  RESULTS  AND  ANALYSIS  6 0  A significant part of the overhead associated with a group communication in TShoshin is caused by having the Group ager, and the Group  Database  Server  Communications  Manager,  the Ethernet  Man-  as separate server processes. The coordination  amongst these three processes to complete a group message transaction entails a considerable amount of context switching, blocking and unblocking of processes, and local I P C exchanges. The first step toward fine tuning the system is to dispense with the Ethernet  Manager  so that one local I P C exchange can be eliminated each time a packet  arrives and each time a packet needs to be sent. This requires major restructuring in the design of the communications manager as well as the interface to the ethernet and interface to the kernel interrupt processes. The most straightforward solution is for the kernel interrupt process to read part of the incoming packet and determine which communications manager should get the packet. This approach has the undesirable effect of requiring that the kernel have some knowledge about some aspect of the communication process. It partly defeats the purpose of hiding all the details of communication inside the communications manager. A second approach would be to merge the Communications Communications  Manager  Manager  and  Group  as a single process and separate them through code modu-  larity. This approach is practical in terms of the functions these two servers carry out. They both do practically the same thing and they both share a number of common  CHAPTER 5. RESULTS AND ANALYSIS  61  resources. If this approach is taken, most of the work will center around rewriting the code so that the non-sharable data structures between managers are kept separate. The major drawback of this approach is that the definition of this new resulting communications manager becomes blurred. A third approach would be to port all of the processes into the kernel as kernel team processes. There will still be local I P C exchanges but they are between team processes and hence there will be no context switching. There will however still be the blocking and unblocking of team processes. It would be interesting to find out how such a porting effort can increase the performance both of the 1-1 remote I P C and the group IPCs. Furthermore it would be worthwhile to see if such increase in performance makes T-Shoshin's I P C comparable to V . This can definitely be considered as a potential area for further examination.  CHAPTER 5. RESULTS AND ANALYSIS  mainQ { gettime(start_time); for  ( i = 0;  i < 1 0 0 0 ; i++)  /* I s s u e IPC command */ gettime(end.time)  > ( a ) IPC p r o d u c e r  mainQ { while  (1)  send_id /* r e p l y  = brecany(  ... )  t o IPC i f r e q u i r e d  reply.id.hid  =  send_id.hid;  reply_id.lid  =  send_id.lid;  reply(reply_id,  */  ... )  } } (b) IPC consumer  Figure 5.1: IPC Timing Method  Chapter 6 Concluding Remarks 6.1  Thesis  Summary  A group communication model was developed which allows one process to simultaneously communicate with a number of other processes. This model fits very well into the client-server environment that T-Shoshin ideally supports. It is based on a broadcast datagram service model and provides minimum reliability in delivering the group message. Hence it leaves room for upper layers to develop their own reliability measures. The model is also flexible in that it provides two types of group send. One type allows nonblocking send in which the sender knows the number of members that got a delivery of the message. The sender can then obtain the replies at its own pace. The second type simply blocks the sender until the first reply comes back. The implementation of the model was kept simple and consistent. Once a group communication is initiated it is possible to track its behavior even under abnormal 63  CHAPTER  6.  CONCLUDING  REMARKS  64  conditions. Much attention and emphasis was given to provide it with a rich failure semantics. This way the sender would be able to tell what went wrong if the communication fails. The mechanism for delivering messages both to local and remote members were kept the same such that subsequent changes made to the mechanism would apply to both local and remote delivery. After porting T-Shoshin to the new SUNs it was found that there was a 30-35 fold increase in the elapsed time for sending messages.  This increase is expected to be  reflected in the figures for group communication as well. After the implementation performance measurement results were collected and analyzed. Elapsed time for G r o u p b s e n d range from 800-930 ms for groups of different sizes. Elapsed time for G r o u p R e q u e s t range from 2150-2250 ms. The figures reported were high partly due to the fact that they weren't collected at times when the network traffic is minimal. Based on the figures it was seen that a lot of fine tuning and improvements could be done to enhance the performance of the group IPC. In certain cases it involves changing parts of the model to dispense with certain features. Discussions were made as to what these implications will be. After some analysis it was found that a lot of the delay is due to having three separate server processes coordinate their functions through local IPC exchanges. The overhead is caused by both context switchings and actual IPC exchanges amongst the servers. Several suggestions were given as to how the performance could be improved. None seemed to be an ideal  CHAPTER 6.  CONCLUDING REMARKS  65  solution since each has a particular drawback. The "best" solution would probably be dictated by the needs and usage of the system. However there is ample room for making further studies as to which is the best solution to adopt for specific situations.  6.2  Future  Work  There were a number of issues mentioned earlier in the thesis which could be areas for further examination. One such issue involves examining the failure semantics of the group communication model as to its completeness relative to the various types of application that require group communication. Another issue is to determine whether certain features of the group model such as sending back host acknowledgements is worthwhile to be included in the model. A third issue is to consider various ways of improving the performance of both communications manager. This may involve studying whether it is worthwhile to move all the server processes (involved in processing a communication) inside the kernel.  Bibliography [ACTON85] D . Acton, Remote Interprocess Communication and its Performance in Team Shoshin, T R 85-16 Deparment of Computer Science, The University of British Columbia, Vancouver, B . C . , November 1985. [ACTON86] D . Acton, H . Wang and S. T . Vuong, " Experience with Interprocess Communication in the Distributed Operating System Team Shoshin", IEEE International Conference on Communications '86, pp. 1444-1448, June 1986. [BERG86]  E . J . Berglund, " A n Introduction to the V-System", IEEE Software, pp. 35-52, August 1986.  [CHER79a]  D . R . Cheriton, Multi-process Structuring and the Thoth Operating System, Ph.D Thesis, University of Waterloo, 1979.  [CHER84a]  D . R . Cheriton, W . Zwaenepoel, "One-to-Many Interprocess Communication in the V-System", ACMSigcomm '84 Symposium, 1984.  [CHER84b] D . R . Cheriton, "The V-Kernel: A Software Base of Distributed Systems", IEEE Software, pp. 19-42, April 1984. [CHER85]  D . R . Cheriton, W . Zwaenepoel, "Distributed Process Groups in the V Kernel", ACM Transactions on Computer Systems, pp. 77-107, May 1985.  [TOKU83a] Hideyuki Tokuda, Sanjay Radia, Eric Manning, "Shoshin OS: a Messagebased Operating System for a Distributed Software Testbed", Proceedings of the Sixteenth Annual Conference on System Sciences, 1983, pp. 329338, 1983. [TOKU83b] Hideyuki Tokuda, Eric Manning, " A n Interprocess Communications Model for a Distributed Software Testbed", Proceedings of SIGCOMM  66  BIBLIOGRAPHY  67  '88 Symposium on Communications Architectures and Protocols, pp. 205-212, March 1983. [TOKU84]  Hideyuki Tokuda, Shoshin: A Distributed Software Testbed, P h . D . Thesis, University of Waterloo, 1984.  [WANG86]  H . Y . Wang, Implementation of Team Shoshin: An exercise in porting and multiprocess structuring of the kernel, T R 86-8 Department of Computer Science, The University of British Columbia, Vancouver, B . C . , March 1986.  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items