UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Group communication in distributed system T-Shoshin See, Helen Lim 1988

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-UBC_1988_A6_7 S43.pdf [ 3.31MB ]
JSON: 831-1.0051942.json
JSON-LD: 831-1.0051942-ld.json
RDF/XML (Pretty): 831-1.0051942-rdf.xml
RDF/JSON: 831-1.0051942-rdf.json
Turtle: 831-1.0051942-turtle.txt
N-Triples: 831-1.0051942-rdf-ntriples.txt
Original Record: 831-1.0051942-source.json
Full Text

Full Text

G R O U P C O M M U N I C A T I O N IN D I S T R I B U T E D S Y S T E M T-SHOSHIN By H E L E N L L M SEE B . S c , University of British Columbia, 1984 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF M A S T E R OF S C I E N C E in T H E F A C U L T Y OF G R A D U A T E STUDIES ( D E P A R T M E N T OF C O M P U T E R SCIENCE) We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y OF BRITISH C O L U M B I A Apr i l 1988 © Helen L . See, 1988 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of COMPUTE SC l£ S)C£ The University of British Columbia 1956 Main Mall Vancouver, Canada V6T 1Y3 Date A P R I L '2 , W DE-6G/81) A b s t r a c t T-Shoshiri is a distributed operating system for developing and testing distributed soft-ware. It provides a flexible and extensive one-to-one (1-1) interprocess communication (IPC) facility. However a number of applications, such as notification and query, require the use of group communication in which a process sends a message simultaneously to a group of processes. This thesis discusses how the 1-1 IPC facility on T-Shoshin was extended to provide a group interprocess communication service. The notion of groups will first be described, focusing on how they are referenced, formed and maintained. Finally a semantic model for group communication is presented, that is simple and general enough to accommodate most forms of applications that require group com-munication. Then an implementation model is discussed to show how the semantic model was implemented, followed by a performance evaluation of the implementation. i i C o n t e n t s Abstract ii List of Tables v List of Figures v i Acknowledgement v i i 1 Introduction 1 1.1 Background and Motivations 1 1.2 Thesis Outline 6 2 Process Groups 8 2.1 Process Group Definition 8 2.2 Group Management Operations 9 3 Semantic M o d e l of Group Communicat ion 13 3.1 Overview of Model 13 3.2 Group Communication Operations 16 3.3 Semantics for Failure and Abnormal Conditions 21 3.3.1 Semantics for Failure 21 3.3.2 Abnormal Conditions 23 4 Implementation 29 4.1 Ethernet Interface 29 4.2 Group Management 33 4.2.1 Group Identifiers 33 4.2.2 Group Membership 34 4.3 Group IPC Operations 36 ii i 4.3.1 Group Communications Manager, Ethernet Manager and Com-munications Manager 36 4.3.2 Group Send 40 4.3.3 Group Receive 41 4.3.4 Group Reply 43 4.3.5 Termination of a Group Communication 44 4.3.6 Certain Implementation Issues 46 5 Results and Analysis 47 5.1 Measurement Method and Environment 48 5.2 Performance Measurements 49 5.3 Analysis of Implementation 58 6 Concluding Remarks 63 6.1 Thesis Summary 63 6.2 Future Work 65 Bibl iography 66 iv L i s t o f T a b l e s 3.1 Set of Abort Status and Their Interpretation 22 5.1 Elapsed Time for 1-1 <bsend,brecany> pair 50 5.2 Elapsed Time for Group bsend - A l l Remote Members 51 5.3 Elapsed Time for Group bsend - A l l Local Members . . 52 5.4 Elapsed Time for Group bsend - Empty Group 55 5.5 Group request with One Remote Member 56 v L i s t o f F i g u r e s 1.1 Network Configuration of T-Shoshin 2 2.1 T-Shoshin Group Ids 9 3.1 Group Inter-Process Communication Model 14 3.2 How Abnormal Conditions are Handled under Various Stages of a Group Communication 28 4.1 Access to the Ethernet Resource by various Communications Managers 38 5.1 IPC Timing Method 62 v i Acknowledgement I would like to thank my supervisor, Dr. Son Vuong, for his patience, guidance and advice on this thesis and Dr. Sam Chanson for reading the final draft. Special thanks to a number of people whom I have consulted on more than a number of occasions. They are Barry Brachman, Audivox M a , Rick Sample, and Frank Pronk. Thanks also to many of my fellow graduate students with whom I have had a very pleasant learning experience together. v i i C h a p t e r 1 I n t r o d u c t i o n 1 . 1 B a c k g r o u n d a n d M o t i v a t i o n s The T-Shoshin distributed operating system was designed for developing, experi-menting, and testing distributed software. It was originally developed at the University of British Columbia and is currently running on a collection of Sun Workstations 1 in-terconnected by a 10 Mbps Ethernet 2 [ACTON85,WANG86] [Fig. 1.1]. T-Shoshin is an enhanced version of Shoshin[TOKU83a,TOKU84] which was developed at the University of Waterloo. T-Shoshin supports process abstraction whereby a process is a logical component of a distributed program possessing its own code segment, data segment, and stack space and executes independently of other processes. T-Shoshin also supports one-to-one (1-1) direct communication among these processes through a set of flexible interpro-1SUN Workstation is a trademark of Sun Microsystems Inc. 2Ethernet is a trademark of Xerox Corporation 1 CHAPTER 1. INTRODUCTION S U N VAX 780 S U N 2 10 Mbps Ethernet VAX 750 S U N Figure 1.1: Network Configuration of T-Shoshin cess communication (IPC) primitives[ACTON86j. By using the 1-1 IPC operations, any pair of processes can synchronize or communicate with each another via message passing. The only requirement is that they know each other's process ids in advance. The process abstraction and 1-1 IPC were all part of Tokuda's original design for Shoshin in building an environment for distributed computation[TOKU84]. Tokuda's definition of a distributed program was that of "a group of relatively small processes that often exhibit a dynamic pattern of process cooperation, including dynamic cre-ation and destruction of processes". Moreover, "cooperation among processes occurs without the use of shared variables" and "all processes are executed in mutually disjoint address spaces and communicate using message passing without any shared variables". In other words he intended to make the IPC facility to be the main if not the only CHAPTER 1. INTRODUCTION 3 means by which processes would pass information. The IPC facility of T-shoshin is one of its basic tools for achieving well-coordinated distributed computation. However, when a group of processes synchronize and communicate through IPC, a large amount of overhead is incurred through context switching and byte copying from one address space to another. T-Shoshin eliminated part of this problem by adding Cheriton's concept of team [CHER79a]. A team is defined as a group of processes that share the same code segment, global data structures and address space but each process having its own stack space for execution. In T-Shoshin, a group of processes can be collectively created as a team so that they are all in the same context and address space. This way context switching among team members is totally eliminated and byte copy overhead is reduced because byte copying within the same context is cheaper. It turns out that team members can also synchronize and communicate with each other via shared global variables. Thus in T-Shoshin, the team abstraction and the 1-1 IPC provide two simple and al-ternate means by which processes can communicate and work together to accomplish certain tasks or computation. Information is passed around through shared global data structures or through message passing. Synchronization can be achieved like-wise. Nonetheless, there are certain computations which require yet another type of interprocess communication. This type amounts to multiple-send, that is being able to send a message simultaneously to several processes with a single send operation. CHAPTER 1. INTRODUCTION 4 There are a number of higher level applications which would require such a basic facil-ity for achieving distributed computation. Such applications include querying a group of servers, notifying a group of processes of computational updates, or synchronizing among a number of processes. As an example, a distributed system may have a group of printer servers providing printing service. To determine which printers are available for printing a job, a client process may want to check with all printer servers to deter-mine which one is most suitable for taking on the job. As another example, consider the scenario where there are x number of processes that have to synchronize with each other and any one of them can be the trigger for starting the synchronization. Wi th the existing facility this has to be done indirectly by either having all the processes be team members and wait on some global variable, or having them know each other's process id a priori and do multiple 1-1 sends. If a group communication facility is in place it can be done with a single multiple-send. The sending process would issue a single multiple-send operation to a previously defined group of processes upon which the system would automatically deliver the message to each process. There are a number of reasons why a multiple-send or group send should not be simply realized by a set of 1-1 IPC operations. First, the sender must know the ids of all the members of the group. This is impractical for a lot of application users. For example, to print a job a user needs to have at hand the identities of all the printer servers available for printing its job. If done with group communication, a set of static CHAPTER 1. INTRODUCTION 5 well-known group ids could be allocated to allow reference to certain system servers as a logical entity via a group id. Second, not all the receivers have an equal chance of receiving the message first. It is always biased by the process order in which the sender sends out the message. Third, the message is not sent out concurrently to all the members. It can only be done serially. Finally, when the group contains members across several machines, more network traffic is incurred when a single multiple-send is replaced by a series of 1-1 sends. The lower level implementation uses broadcasting to deliver the message to all the machines on the network. In cases where there is a widespread of members across the network, broadcasting the message once onto the network rather than singly to each host can significantly reduce the amount of network traffic. There can be further reduction in overhead if the lower implementation uses multicast hardware. Based on all the above reasons, equipping T-Shoshin with a group interprocess communication facility will strongly enhance its usage as an operating system that supports distributed computation. The implementation of a group communication facility on T-Shoshin will open up a number of areas where further work can be done. First, there can be closer examination of the client-server model which T-Shoshin ideally supports. With group operations and group IPCs in place a server can easily make its services available by joining a well-known server group. It can just as easily withdraw its services by leaving the group. On the other hand clients need not know CHAPTER 1. INTRODUCTION 6 in advance the identities of the server (s). Neither do they need to keep up-to-date as to the status of the server (s). A l l they need to do is to send their requests to the well-known server group id and the messages will automatically be delivered to all the server instances that are ready to provide the service. Second, certain system functions such as load balancing and process migration become easier to implement. Each machine can broadcast or multicast its status to all the machines on the network or to a specific set of machines. The choice of grouping is done entirely through the group maintenance operations. Third, application programs that required a group communication capability can now be much readily realized. These could range from distributed programs that need to pass computational updates to each other and to a database program that will update its database only when all the components have simultaneously committed to a transaction operation. There is a further advantage from the programmer's point of view. Using the group send wil l require less coding and maintenance effort since a sequence of 1-1 sends is replaced with a single multiple-send. 1 . 2 T h e s i s O u t l i n e Excluding the introduction, the thesis consists of five chapters as follows. Chapter 2 provides a description of process groups, group operations, and group management. Chapter 3 presents the semantic model of group communication and also defines the CHAPTER 1. INTRODUCTION 7 set of IPG primitives used for group communication. Chapter 4 discusses how group management and group communication was implemented. In Chapter 5 performance measurements and analysis of the implementation are presented. Finally Chapter 6 provides some concluding remarks and offers suggestions for future work. C h a p t e r 2 P r o c e s s G r o u p s Since the goal of the thesis is to provide a group communication facility on T-Shoshin, it is important to first have a clear notion of process groups. In this chapter, the definition and representation of process groups will be provided, followed by a description of group management operations. 2 . 1 P r o c e s s G r o u p D e f i n i t i o n Groups in T-Shoshin are very similar to V process oroups[CHER85]. A group is a set of zero or more processes that are collectively referenced by a groupJd. A groupJd is similar in syntax to T-Shoshin process_ids but different in semantics. Each group J d has two parts, the first part of which refers to a class of related groups and the second part to a specific group within that class. For instance, one might have a set of statically allocated group_ids all belonging to the class system [Fig. 2.1]. Processes within a group may span a machine or any number of machines. A l l the members 8 CHAPTER 2. PROCESS GROUPS 9 of a group are deemed to be equal and every member can potentially receive all the messages sent to the group. Parts of a Group Id c l a s s g r oup • i group or p rocess id bit Examp les of group ids: s y s t e m p r in te r - se rve r s s y s t em : f i l e - se rve r s s y s t em : d i s k - se r ve r s U B C : grad-students U B C : facu l t y -members U B C : staff Figure 2.1: T-Shoshin Group Ids 2 . 2 G r o u p M a n a g e m e n t O p e r a t i o n s A set of group management operations was set up such that group manipulation is simple and has minimal restrictions on grouping rules. It has some degree of control over who joins the group, and hence who receives messages sent to the group. It however, does not have control over who send messages to the group. Given the objective as such, the rules are: • any process can create a group and supply it with a control password • any process can join, leave or destroy a group provided it knows the control password CHAPTER 2. PROCESS GROUPS 10 • there is no fixed limit on the size of a group • members of a group will receive all messages sent to the group • any process can send messages to a group. There is no control over which processes can send messages to a group. The following operations are used for processes to create, join, leave, or destroy groups. A new group is created by group J d = create_grp (control password). The system looks for a unique and unused groupJd and returns it as the new groupjd. A control password must be supplied as a means of control over which processes can join the group or destroy the group. The control password is a string of characters up to 64 characters long. If a password is deemed unnecessary it can be bypassed by supplying a null string as password. A newly created group has zero members in it. After a group has been created any process can try to join by issuing the status = join^grp (groupjd, password) primitive. The password parameter is used for matching the control password. Sta-tus returned by join_grp can be O K , GROUP_DOES_NOT_EXIST, W R O N G - P A S S W O R D , or some other error code. When a process joins a group it will start having messages sent to the group delivered to it. After a process joins the group, it can leave the group by issuing another primitive, status = leave_grp (group-id, password). CHAPTER 2. PROCESS GROUPS 11 If the process leaving the group is already a member of groupJd and the pass-word parameter matches the control password, then it is removed as a member of the group. Status returned by leave^grp can be O K , GROUP_DOESJMOT_EXIST, W R O N G JP ASS W O R D , N O T _ M E M B E R , or some other error code. Once a process leaves a particular group it stops receiving messages sent to that group delivered to it. Note that a process J d wasn't supplied as a parameter to both j o i n and leave prim-itives. This prevents processes from making other processes join a group or removing other processes from a group. A group is not automatically destroyed when it becomes empty. It has to be explicitly removed by the destroy primitive. status = destroy_grp (groupJd, password). As before the supplied password must match the control password. When the destroy primitive is invoked all the members of the group are removed and the group J d is deallocated. The return status will be O K is everything went fine. Otherwise it can say GROUPJDOESJMOT_EXIST or some other error code. This primitive has not been implemented yet. There are a number of criticisms that can be legitimately raised with regards to the way groups are formed and managed. There is no protection against bogus senders. Groups may become too large. There is also no elaboration on recovery of group information in the event of a crash or mishap. It is dangerous in to allow any process CHAPTER 2. PROCESS GROUPS 12 to destroy a group as long as it is able to acquire the control password. The focus of this thesis is to concentrate effort in building a group communication facility in T-Shoshin. The issues of group membership can be addressed in the future when it is shown that group communication is viable and effective in T-Shoshin. At the moment we only need minimal rules to get groupings to work. C h a p t e r 3 S e m a n t i c M o d e l o f G r o u p C o m m u n i c a t i o n 3 . 1 O v e r v i e w o f M o d e l Group communication in T-Shoshin is done through message-passing. When a pro-cess initiates a group send it is simultaneously in communication with all the members in the group as illustrated in Figure 3.1. A sender starts up a new group communica-tion by addressing a message to a group. It then blocks until a copy of the message is delivered to the "message queue" of all reachable members of the group. Reach-able usually means those existing members whose host is still alive and resides in the network partition as the sender at the time the message went out for delivery. The total count of the reachable members is returned to the sender as it is unblocked. The sender must set the number of replies it wants back. As replies come in, they wil l be queued up to the number of replies wanted. The rest are discarded to avoid taking up unnecessary buffer space. The sender can obtain these replies one by one through a 13 CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 14 new primitive provided. After the group messages get queued up in the message queue, group members receive and reply to group messages as if they were 1-1 messages. Group communication was done through message-passing because it seemed very compatible to extend the 1-1 IPC. The set of 1-1 IPC primitives is almost com-plete to support most types of applications. The only primitive that is missing is the non-blocking send. For certain argued reasons, the non-blocking send wasn't implemented[ACTON85]. The extension of 1-1 IPC to include the semantics for group communication does not seem awkwardly contrived or unnatural. In fact it seems to fit appropriately. Instead of sending a message to one process, the same IPC operation can be used to send to a group. Likewise, instead of waiting a message from a specific process, the receiver can wait for a message directed to a specific group that it is a Figure 3.1: Group Inter-Process Communication Model CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 15 member of. When a group communication starts up, an "attempt" is made by the system to deliver the message to every member in the group. Drawing an analogy to datagram protocol, it is a best effort attempt with no guarantee of success. In other words there is minimal reliability. For lack of better terminology, it is termed as a "broadcast datagram". In a client-server paradigm, such degree of reliability may be sufficient. Consider sending a job out to be printed. The client is really interested in communicat-ing only with those printer servers that are available, that is, those that are alive and those that can be reached. In many applications not knowing the identity and not be-ing able to have contact with the remaining printer members is acceptable since those printer servers can't do the job anyways. Moreover the degree and kind of reliability is usually application dependent and best left to upper communication and/or application layers to determine what suitable reliability measures to incorporate[CHER85]. In contrast to V send, which is 1-reliable, the senders get unblocked when a copy of the message is delivered to the "message queue" of every reachable member in the group. At the time of unblocking, the count of all reachable members is returned. This way the sender can have the flexibility of performing other computations and block to get a reply when it is available. The count returned by the group send is not exact but is very useful in providing an estimate for the minimum number of replies that can potentially come back. The count is an indication of all those members which got CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 16 a delivery of the message and whose acknowledgement of that delivery reached the sender. There may be other members who got the message but their acknowledgement of the delivery never reached the sender (see discussion in next section for the definition of delivered). With this count, the sender can set an upper bound to the number of replies it wants back. This saves the sender from waiting needlessly for replies which may not be available. 3 . 2 G r o u p C o m m u n i c a t i o n O p e r a t i o n s In designing the IPC primitives for group communication, care was taken to ensure that they are simple, clear and have straightforward and intuitive semantics. The approach taken was to redefine the 1-1 IPC primitives so that the same semantics can be used for group communication rather than introducing an entirely new set of primitives. The primitives for group communication consist of the ones used for direct 1-1 IPC plus two additional ones. The semantics of the 1-1 IPC have remained exactly the same. Only some syntactic changes were made to make it consistent for use with group communication. In the discussion that follows, a distinction is made between a message that is "delivered" and one that is "received". A message that is delivered means that it has been "dropped into the receiver's mailbox". In our implementation, this means queued up in the receiver's message queue. A message that is received by a member means that the member has "executed a receive primitive to obtain the CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 17 message". The following is a list of the IPC primitives when used for group communication. cnt = bsend(togid,&msg,m,mtag) (pid,nr) = brec(fromgid,&buf,n) (pid,nr) = nrec(fromgid,&buf,n,mtag) (pid,nr) = brecany(&buf,n,mtag) (pid,nr) = nrecany(&buf,n,mtag) ns = reply (pid,&msg,m) (pid,nr) = request (togid,&msg,m,&buf,n,mtag) np = totreply(gid,p) (pid,nr) = getreply(gid,&buf,n,timeout) "togid" stands for the destination group address while "fromgid" stands for the source group address, "gid" is group J d and usually refers to the current group com-munication taking place, "pid" is process j d . "&msg" and "&buf" specify, respectively, the addresses of the message to be sent and the buffer area to place a received mes-sage in. The parameter "m" indicates the size, in bytes, of the message to send and "n" specifies the size of the buffer area for the received message, "mtag" is used for matching selective receives and sends, "p" is used for setting the number of replies the sender wants back and "timeout" is used for setting the timeout value in waiting for a reply to be received. CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 18 A l l the IPC primitives are defined as functions and they return some value relevant to the function, "nr" refers to the actual number of bytes received. If "nr" is zero or less than zero then it becomes a status indication. The same interpretation goes for "ns", the actual number of bytes sent, "cnt" refers to the minimum number of members that got the message when its value is nonnegative. Otherwise it becomes a status indication, "np" is the maximum number of replies the user can get back. If it's zero then the group communication terminates. If it's negative then it is a status indication. There is a restriction set on the maximum length of group messages. This restriction is imposed for reasons of efficiency. The maximum data size that can be sent is 1418 bytes = 15H(standard packet) - lA(Ethernet header) - 2Q(IP header) - 8(UDP header) - 54(Shoshin header). Having a long message sent to a group results in the message using up a lot of buffer resource at each host. Bsend is used for sending out the group message to a group by referencing the groups's id. The request primitive can also be used to send a group message but it blocks until a reply comes back and then the communication terminates. A member process can receive a message by executing one of the four receive primitives. Brecany and nrecany are blocking and non-blocking receive any and used to receive messages from any processes sending to any group the receiver is a member of. Brec and nrec are more explicit. Brec is a blocking receive for a specific group while nrec CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 19 is a non-blocking receive for a specific group. A l l of the receive functions return the process J d of the sender process. Reply generates a reply back to the sender process. The reply for a group send/request is different from a 1-1 reply in that the member gets blocked until the sender executes a getreply (covered below) and is ready to accept this particular reply. This implies two things. First the member will be blocked until the sender is ready to receive its reply. Second the member can get blocked forever if the sender never issues a getreply. In the latter case, the member will unblock if and when the group communication terminates. The reply operation wil l fail with a status indicating that the group communication has terminated. Two new primitives were added for managing replies. The first is totreply which sets the total number of replies expected back. By default the total number of replies that come in and are queued is initially set to zero. It remains zero until set (or reset) by totreply. To end a group communication immediately totreply can be used with a value of 0. This allows the user to send a message to all the members without expecting any reply back, although there is no guarantee that all members will receive the message. The other new primitive is getreply which gets back a reply for the sender one by one as needed. It allows a timeout value to be set to prevent the sender from blocking forever. A group communication is terminated in one of four ways. First, when the sender explicitly sets the totreply value to zero. Second, when all the expected replies for the CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 20 group communication have been satisfied. Third, when the sender dies. This is equiv-alent to the situation where the host of the sender crashes. Lastly, a communication is implicitly terminated when it is superceded by another one. Superceded means that the sender initiated a new group communication to exactly the same group while the previous one was still going on. When a communication terminates, the members can no longer receive the messages for that communication. Neither can they reply to it. If they do the IPC will fail returning with it a status indicating that the communication has terminated. A process can initiate more than one group communication at a time but these outstanding communications has to be addressed to different groups. They are kept separate by the group jds . If a sender initiates a new group communication to the same group when the previous one hasn't terminated, the previous one wil l be overridden. The new message will be delivered to all the members but kept distinct from the previous one. However for efficiency reasons the old message is not removed from the member's mailbox. The member will get a group communication termination status instead of the message when it tries to receive the old message. Replies made to the old messages are detected and they fail as well. CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 21 3 . 3 S e m a n t i c s f o r F a i l u r e a n d A b n o r m a l C o n d i t i o n s The previous section described what would transpire in a group send if everything went smoothly in the communication process. However there are always occasions where something would go wrong. A significant part of the semantics goes into defining what would happen under these abnormal conditions and the kind of status indication the primitives return. They are just as important as the semantics defined for normal conditions as the application users have to know and anticipate them. This is more so in the current implementation because the group communication only provides minimum reliability and it is up to the upper layers to build the required amount of reliability that is desired. 3 . 3 . 1 S e m a n t i c s f o r F a i l u r e The failure semantics are used by upper layers for identifying as closely as pos-sible the nature of the problem so that they may devise an appropriate mechanism for handling the failure. A communication abort concept was used by Tokuda and Manning[TOKU83b] to simplify the detection of unsuccessful or illegal communication and to avoid communication deadlocks at the same time. Whenever one of these con-ditions are detected, the communication is aborted and the IPC returns a negative value indicating the cause of the abort. The original set of status was extended to include new ones required to handle CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 22 certain status specific only for group communication. As well, some of the original status changes slightly in meaning when interpreted for group communication. The set of abort status that can be returned are shown in Table 3.1. N O . C O M M : group communication has terminated A D D R _ E R R O R : illegal message address H A R D W A R E - F A I L : hardware or network failure T I M E D _ O U T : IPC timed out N O V E M B E R S : empty group C M _ E R R O R : unknown error H O S T _ D E A D : host at other end is dead P R O O D E A D : process at other end is dead N O _ B U F F : no buffer C M - D E A D L O C K : communication deadlock between sender and receiver C M - M I S M A T C H : mismatch of IPC between sender and receiver Table 3.1: Set of Abort Status and Their Interpretation The first status is N O . C O M M and it is used as follows. When a group commu-nication terminates (either normally or abnormally) not all the involved process may be informed about it. Subsequently a process may want to receive a message, send or receive a reply associated with this terminated communication. Under these circum-stances, the N O . C O M M status is returned. When an address or bus error as a result of supplying a wrong address occur, A D D R - E R R O R is returned. H A R D W A R E - F A I L signals some sort of failure that can be determined to stem from the hardware or net-work failure. T I M E D _ O U T is used to indicate that an IPC has timed out. This is used CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 23 by get_reply to avoid being blocked forever. When a group send/request is made to a group that is empty, the NOJVTEMBERS status is returned. This status is made distinct from a status of 0 to emphasize the fact that the group communication is being aborted. C M - E R R O R indicates unknown errors which the communications manager cannot figure out. It is usually a result of an inappropriate IPC being issued. The other five statuses enumerated are used only in the context of 1-1 interprocess communication. Both the H O S T J D E A D and P R O C J D E A D condition are determined through probe timeouts. The timeout value is a system parameter that can be tuned for optimum performance. The N O _ B U F F status indicates that either the sender or receiver's host ran out of buffers to contain system parameters and/or messages. C M _ D E A D L O C K attempts to inform the invoker that a deadlock was detected between the sender and receiver. Both process' IPC are aborted. C M _ M I S M A T C H on the other hand detects a reply was issued to another process which does not have an outstanding communication. It is interesting to note that further studies could be done on the failure semantics to determine whether they are "rich" enough for most if not all types of application. 3 . 3 . 2 A b n o r m a l C o n d i t i o n s In the current implementation some of abnormal conditions considered are: • sender or member dies • a host in the network crashes 1 CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 24 • the network is partitioned [Fig. 3.2] shows what happens to the sender and the members under various abnor-mal conditions at various stages of the group send. The first stage is "send initiated". This refers to the stage where the sender has just initiated a group send and it is being broadcasted onto the network. The second stage is "message delivered". This refers approximately to the time where the remote hosts pick up the broadcast message and queues it up to each member's message queue. The third stage is "message received". At this stage some of the members have already picked up the message by executing a receive primitive. The next stage is "member replies". Some of the members start replying to the sender. The fifth and final stage is "communication being terminated". To track what is going on in a group communication is not simple because there are several open lines of communication between the sender and its members. However there are a number of general rules which prevail. Firstly, the group communication is in the form of a broadcast datagram. When the group message is broadcasted through the network there is no telling who will get it and who won't. It all depends on what the conditions of the entire system is at the time of the group send. In effect not all the members will get a delivery of the message and out of those that got the message not all of them will be counted in. Secondly, when a member executes a receive primitive to accept a group message, the acceptance will be successful only if the group communication is still ongoing. The same thing goes for a group reply. CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 25 Thirdly, when a message is being terminated the system broadcasts a T E R M I N A T E message to all the members on the network. This T E R M I N A T E message may not be picked up by certain members. However eventually these members wil l be properly informed in one of several ways. Subsequently they could try to receive the message or reply to it and find that it has already terminated. Another way that a member can tell that a group communication has already terminated is through probing. The member wil l continually probe the sender to determine whether it is alive and whether the communication is still going on. If the group communication has already terminated then the probe wil l fail after a maximum number of tries. (The probe is done in the background by the system on behalf of the user.) Referring to the figure, in the first stage of the group communication a group send is initiated. When the sender dies or its host crashes, it is unpredictable how far the group communication will proceed. Some or all of the members may get a delivery of the message and receive it as well. But this is as far as it goes. None of the member wil l be able to reply to it. It will terminate eventually because the member wil l probe and find out that the sender or the sender's host is dead. A number of abnormal conditions may cause a member not to be counted in at the time of the group send. These conditions include the member dying off, the member's host crashing or the member's host being temporarily severed from the network or the member's host residing in a partition that doesn't contain the sender's host. CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 26 The next stage of the group communication is identified as the member getting a copy of the message in its message queue. A t this point the member can potentially receive the message as long as the group communication is deemed to be still alive when the member execute the receive primitive. If conditions occur that cause the member to perceive that the communication terminated then the member will never get a chance to receive the message. Similarly if the member dies off or its host crashed, the sender will never be able to obtain back a reply. The next stage is naturally the member executing a receive primitive to accept the message. A t this stage the member is allowed to reply and the hence most of the considerations regarding what may occur centers around the reply. If the sender becomes permanently unreachable, the reply simply fails. On the other hand if all the member becomes permanently unreachable getjreply fails. If the abnormal condition is temporary such as short severance from the network the reply may or may not be successful. The next stage is when the member replies. Under most abnormal conditions it is bound to fail. It is successful only if the condition is temporary like a host being temporarily severed from the network. Lastly, when the group communication is terminated a broadcast T E R M I N A T E message is sent out. If everybody gets the termination message then the communi-cation terminates consistently. Otherwise it could be inconsistent in the sense that CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 27 some of the members may not know about it. This problem is solved by probes from those members. One way or the other the communication will eventually terminate consistently. CHAPTER 3. SEMANTIC MODEL OF GROUP COMMUNICATION 28 •STAGES OF GROUP COMMUNICATION • ABNORMAL CONDITIONS send init iated 1 message delivered message received member replies communication being terminated sender dies group comm eventually terminates, unpredictable how far it aoes member may receive the message, group comm eventually terminates member can't reply to message, group comm eventually terminates reply fails in the middle, group comm eventually terminates OK sender's host crashes permanently or for a very 'long' time group comm eventually terminates, unpredictable how far it goes member may receive the message, group comm eventually terminates member can't reply to message, group comm eventually terminates reply fails in the middle, group comm eventually terminates group comm maybe in an inconsistent state but eventually terminates member dies member not counted in sender never gets back a reply sender never gets back a reply reply fails in the middle OK member's host crashes permanently or for a very 'long' time member not counted in sender never gets back a reply sender never gets back a reply reply fails in the middle OK sender's host temporarily severed from the network message may be recroad-casted or group comm terminated group comm on remote host may terminate member's reply may fai l reply may or maynot reach the sender group comm maybe in an inconsistent state but eventually terminates member's host tempora-rily severed from the network some members maynot be included in the count group comm on remote host may terminate member's reply may fai l reply may or maynot reach the sender group comm maybe in an inconsistent state but eventually terminates network partitioned only those members on same parti-tion as sender gets counted group comm not on sender's partition ma) terminate can't obtain reply from members who are on different partit ion reply from members not on sender's partition are missed group comm maybe in an inconsistent state but eventually terminates Figure 3.2: How Abnormal Conditions are Handled under Various Stages of a Group Communication C h a p t e r 4 I m p l e m e n t a t i o n The implementation of group communication on T-Shoshin was on a collection of Sun 2/50s. These workstations had a different Ethernet controller from the older Sun 2 machines. The first section describes the Int e l 82586 Ethernet controller device driver. Subsequent sections discuss how the group implementation was done. 4 . 1 E t h e r n e t I n t e r f a c e T-Shoshin was reinterfaced with another Ethernet controller to run on a collection of Sun 2/50s which were available for implementing group communication. The previ-ous version of T-Shoshin ran on Sun 2/l20s which have a 3 C O M Ethernet Controller while the newer models, Sun 2/50s have an Int e l 82586 1 Ethernet controller. The controllers were substantially different from each other. Between the two controllers, the 3 C O M had simpler features and was easier to op-1Int el is a registered trademark of Intel Corporation 29 CHAPTER 4. IMPLEMENTATION 30 erate. It has two fixed-length receive buffers and one fixed-length transmit buffer. Byte transfer to and from the Ethernet were directly taken out of these buffers by the 3 C O M controller. Coupled with these buffers are a number of registers used for controlling the Ethernet by setting interrupt modes and for reading the status of a frame trans-mission/reception. For each frame that comes in or goes out, the C P U is immediately notified through an interrupt signal. On the reception of a frame (or frames) the C P U must place high priority in processing the frame (or frames) before they get overwritten by a subsequent frame arrival or before an incoming frame is lost because the 3 C O M was suspended in a " N O - R E C E I V E " mode. The buffers and registers are all located in fixed system memory addresses. To avoid copying the contents of these buffers and registers from system address space to the Communications Manager's address space every time a transmission or reception took place, memory blocks within the Commu-nications Manager's address space were allocated and memory-mapped into the system address space at initialization time. This way the Communications Manager can access the buffers and registers through its own address space while the 3 C O M controller can likewise access it through the system address space without any need for copying. The Int e l 82586 is a more intelligent Ethernet controller having far more features than the 3 C O M . Although these features made the controller more flexible and pow-erful, they also made it more complicated and more involved to operate. The In t e l 82586 can support several types of network configuration through parameterization. CHAPTER 4. IMPLEMENTATION 31 The default configuration at power-up is the I E E E 802.3 10base5 standard. Even with the default configuration in place, the Int e l 82586 still has to be programmed to make it work properly with the C P U . Programming the Int e l 82586 is similar to programming a specialized hardware using some form of assembly language. The device driver written for the In t e l 82586 must initialize the controller by establishing a predefined interface between the host C P U and the controller. This interface is termed "shared memory structure" and it contains different kinds of command blocks and buffer descriptions that the host C P U would use to operate the controller and that the controller would use to inform the C P U of its operation status. Programming the Int e l 82586 consists of setting up chained command blocks in-side the "shared memory structure" which contain command codes and parameters. These command blocks are then passed to the Int e l 82586 for execution. Whenever the In t e l finishes execution it would generate an interrupt and pass back status and/or received frames to the C P U . A l l these status and frames are likewise contained within the "shared memory structure". One of the commands is to start the controller for transmission of frames onto the Ethernet. The device driver has to tell the controller the source address of the transmission frames. In particular it must pass system ad-dresses to the controller. The scheme used in the original version of T-Shoshin for allocating buffers inside the Communications Manager and memory-mapping to the system address space is also used here. There are two things to note though. First CHAPTER 4. IMPLEMENTATION 32 the system address space used for the buffers is not fixed and can therefore be chosen to be anywhere within the system address space that is not utilized for anything else. Second, the device driver must keep a mapping of buffer address that map from the Communications Manager to the system address. One of the more useful feature that the In t e l 82586 has is to allow for a variable number of receive buffers to be set up by the host C P U . In addition each of these "buffers" used for storing a single frame is made up of a chain of linked buffers. At initialization time the device driver allocates a set of arbitrarily-sized buffers and chain them together onto a free list. Whenever a frame arrives the In t e l 82586 would grab an appropriate number of buffers off the free list to store the frame. Afterward the Int e l 82586 would interrupt the C P U . The controller can also be set up to interrupt after every frame arrival or after x number of frame arrivals. This feature has a number of advantages. First the host C P U can allocate more receive buffers as necessary to minimize loosing frames during heavy traffic. Also by allocating more receive buffers there is less chance of frames being overwritten when the C P U doesn't respond fast enough. Second, space inefficiency can be reduced by allocating smaller buffers. Third, the number of receive buffers allocated can be catered to specific system configuration and load. The Int e l controller can function a lot on its own. It can manage transmission and reception, relink buffers, gather statistics, and arbitrate collision all without the CHAPTER 4. IMPLEMENTATION 33 intervention of the host C P U . In other words, a large range of operations goes on within the Int e l 82586 beyond the control of the C P U . Sometimes it becomes difficult to ascertain the nature of certain problems when something is amiss. As an example, at one point in the implementation certain frames were lost in the Ethernet and others weren't. The controller was reporting that everything is fine. It turned out that frames less than the minimum configuration length were being dropped by the controller automatically without any indication. Only through several trial runs using all sorts of packets and a very careful reading of the manual was the problem determined. The complexity of operating the controller as well as potential problems such as the one reported above encourages the use of only basic Int e l 82586 features. 4 . 2 G r o u p M a n a g e m e n t 4 . 2 . 1 G r o u p I d e n t i f i e r s A group J d is a 32 bit unique identifier that has a syntax similar to process jds . It is differentiated from processJd in that the group bit is set to 0 rather than 1. Like the V group Jds, the group bit allows the kernel and communications manager to distinguish between process and group Jds. It also ensures that the process and group Jds ' space are disjoint. A group j d has two parts. The first part designates the class, that is a category of groups which are related in some fashion. For example system servers such as printers servers, file servers and disk servers can all belong to the class "system". CHAPTER 4. IMPLEMENTATION 34 The second part designates a specific group within that class. There is a single server process in the system (group-id server) which looks after allocating system-wide unique group Jds. Requests for creating new groups would be directed to the group-id server which would come up with a new group j d and associate it with the control password for the group. Any host in the system which is not familiar with a group J d and needs information about it would inquire from the group-id server. This central group-id server dispenses the need for a distributed agreement protocol to come up with a unique and unused group j d . When a group is destroyed, its group j d is deallocated by the group-id server. The group-id server maintains a table of <group j d , control password> pair. A group J d is considered allocated if it is in the table and not allocated if it is not listed in the table. 4 . 2 . 2 G r o u p M e m b e r s h i p The membership information of a group is not centrally stored but distributed across all the hosts which has members belonging to the group. Each host maintains information only about local processes and these information are stored in a table of <group J d , process Jd> pair. A process p is a member of the group g if there is a <p,g> pair in the group database of the local host. When a process wishes to join a group, the database managed by the local group server is first examined to see if there are existing members in the group already. If there is then the new member is added CHAPTER 4. IMPLEMENTATION 35 in. If not then the local group server obtains information from the group-id server. When a member p leaves a group g, then the <p,g> pair is deleted from the local group database. A message is delivered only to those processes which are members of the group at the time the group communication was initiated. Members which join the group after a group message was sent may or may not receive the message. This section describes how the destroy group could be implemented. Since all the information about a group is scattered across the system, it is very difficult to make sure that all the membership information regarding a group is removed when a group gets destroyed. When a request for group destroy is made, it is directed to the group-id server. The group-id server broadcasts that message to every local group server to delete its local information. The group-id server then waits for an acknowledgement from all the local group servers. If all the acknowledgements come back after the first broadcast then the operation is successful. Otherwise the broadcast is repeated. After a maximum number of repeats it is assumed that either a network partition has occurred or certain hosts have crashed. For a host crash it is assumed that the host can only be crash-fail. When a crash host restarts, it looses all its previous information. However fragments of a group may still exist if one of two conditions arise while a group destruction is taking place. First the network partitions. Second some host temporarily cannot send or receive messages. In either of these conditions occur there CHAPTER 4. IMPLEMENTATION 36 may be some hosts that fail to receive the group destroy message and hence membership information kept by those hosts are not deleted. In other words the group id remains allocated. The implication of having group fragments remain in existence is seen to be tolerable. Processes which wishes to join the group won't be able to do so unless they belong to one of those hosts that still keep fragments of the group. Messages which are send to this defunct group will be received by the remaining members but not by the other ones. A problem comes up when the group j d server wishes to allocate the id of this group as a new id. A means to solve this problem is to broadcast another destroy message just before the allocation to further ensure that no duplication of group id exists. For the same group J d to be allocated to different groups it implies that the deallocation and reallocation must have occurred when a host was temporarily severed from the network or when the network is partitioned. There is a chance of this occurring but the chance is minimal. 4 . 3 G r o u p I P C O p e r a t i o n s 4 . 3 . 1 G r o u p C o m m u n i c a t i o n s M a n a g e r , E t h e r n e t M a n a g e r a n d C o m m u n i c a t i o n s M a n a g e r The group IPC operations are implemented through a server process called the Group Communications Manager. The Group Communications Manager (GCM) is a distributed server that embodies all the tasks necessary to carry out the group IPC CHAPTER 4. IMPLEMENTATION 37 operations. Each host's G C M is responsible for all the group communications that any of its local processes are involved in. It keeps track of each outstanding group commu-nication that takes place from the time that a group send has been initiated and the message broadcasted, to the time that replies get back to the sender. It obtains infor-mation from the local group server regarding membership information that it needs. It also shares the network interface to the ethernet and some communication state structures with the Communications Manager. A l l the local G C M s uses a mutually agreed on protocol to coordinate and manage the group IPC. In the original version of T-Shoshin it is the Communications Manager (CM) that manages all the remote 1-1 IPC as well as the ethernet resource. A l l the packets that needs to be transmitted and received to and from the network are handled solely by the C M . As such all the interrupts from the ethernet are also sent from a kernel interrupt process to the C M . In extending T-Shoshin to include a group IPC facility the G C M also needs to have access to the ethernet resource and be informed by the kernel interrupt process that packets have finished transmitting or that packets have arrived. To accommodate the needs of the G C M another manager was created to mediate the access of two communi-cations manager for the same resource. This manager is called the Ethernet Manager (EM) [Fig. 4.1]. The E M gets all the interrupts from the kernel interrupt process through message passing. It then separates out transmit from receive interrupts. If CHAPTER 4. IMPLEMENTATION 38 the interrupt indicates that a packet's transmission has been finished the E M accesses the ethernet buffers and frees them. It then informs the appropriate communications manager by sending it a message. If the interrupt indicates that a packet has just been received the E M determines which communications manager should be handling the packet and informs that manager of the arrival through a message send. Subsequently when the message is picked by the C M or G C M it is processed accordingly by read-ing the ethernet buffers directly. Amidst all of these, whenever a packet needs to be transmitted by either communications manager the packet is formed and queued up for transmission onto the ethernet. Ethernet Resource (a) O l d Conf igurat ion (b) N e w C o n f i g u r a t i o n Figure 4.1: Access to the Ethernet Resource by various Communications Managers In order for the three managers to be able to access the same physical resource as separate processes, they all have to be created as part of a team. It makes no CHAPTER 4. IMPLEMENTATION 39 difference which manager is designated as the team root. In this implementation the C M is the team root and at boot time both the G C M and the E M is created by the C M . The ethernet resource structures such as buffers and queues are declared global to all the three team processes. The problem of inconsistency arising from several processes/modules accessing the same data structures is not present in T-Shoshin because mutual exclusion among these processes is provided implicitly by the team implementation. Members of a team do not preempt each other during execution. A team member continues to execute until it gives up its control voluntarily. The only problem is that a team member may execute forever. Caution has to be taken that this does not occur. Right after its inception the G C M creates a timer process for its own timeouts. Then it blocks forever to wait for messages. The messages that it receives are the results of externally generated events. There are three externally generated events which the G C M must deal with. 1. A message from a clock process indicating that a fixed amount of time has expired. 2. A message issued by the kernel on behalf of a user process which indicates that a user process wishes to perform a group IPC. 3. A message from the Ethernet Manager indicating that a packet has been trans-mitted or has arrived. CHAPTER 4. IMPLEMENTATION 40 4 . 3 . 2 G r o u p S e n d When a group send is initiated by a user, it gets to the Group Communications Manager through a kernel message. The G C M then sets up an outstanding commu-nication structure (OUTCM) for it. This structure is the key to identifying the group communication every time it's needed. It contains the vital pieces of information for managing an ongoing group communication. Such information include source process and destination group, state of the communication, number of replies wanted back, and so on. After assembling the message into a packet, the packet is queued for broadcast onto the network. When the packet gets onto the network the G C M waits for ac-knowledgements from all the other hosts indicating how many members in the group are on the remote hosts. When acknowledgements from remote hosts come in, the sender host checks them against the network configuration that it has. As soon as all acknowledgements come in the sender is unblocked. Otherwise it waits until the timer expires and the message is rebroadcasted. After a maximum number of rebroadcasts those hosts that didn't acknowledged are considered dead or temporarily down and their acknowledgements are ignored. The sender is unblocked with the accumulated count as the return parameter. Presumably after the sender unblocks it is waiting for replies from members. Every time a reply comes back the total number of replies expected back is reduced by one. CHAPTER 4. IMPLEMENTATION 41 While it is waiting the G C M does not do any probing whatsoever to see if the members are alive or whether they have received the messages or whether they are in the process of replying. The G C M only holds the necessary information such that when a reply comes in it can be properly recognized. The G C M on the other hand can and wil l be probed by remote G C M s as to the continued existence of this group communication. Furthermore the G C M checks on the sender process occasionally to see if it is still alive. This prevents the G C M from maintaining the O U T C M forever when in fact the sender is dead already. 4 . 3 . 3 G r o u p R e c e i v e Since broadcasting was used to transmit messages, all the hosts on the network pick up all broadcast packets that come by. When the broadcast packet contains a group message the G C M on the remote host sets up an O U T C M that is almost identical to the one on the sender's host. These remote O U T C M s acts as the "extension" of the original O U T C M on each remote host. The presence of a remote OUTCM oh a remote host signifies that the group communication is still ongoing as far as the remote host is concerned and its absence signifies the reverse. If the group message is an override of an existing one then the old O U T C M is simply updated (in particular the session number). After setting up a remote O U T C M the G C M queries the local group server for a CHAPTER 4. IMPLEMENTATION 42 list of members belonging to the group. With the list it tries to queue up a copy of the message to each member on the list. Only those that have successfully been delivered a copy of the message are counted and the total count is acknowledged to the sender's host. A new message overidding an old one would also result in the new message being queued up without the old one being removed (assuming it hasn't been received). If by any chance the member executed a receive primitive to get the old message it will get an indication instead that the message is outdated and therefore cannot be received. The remote O U T C M would also keep a reverse list of all the members involved in that group communication as a means of tracing the members' activity. When a member gets a delivery of the message it's state is set to D E L I V E R E D . When messages queued up on the message queue of individual members gets picked up by the member when it executes one of the receive primitives the state of the member on the reverse list on the remote O U T C M is updated to R E C E I V E D . For each remote O U T C M that it is maintaining the G C M sends a probe to the sender's host regularly to verify that the group communication is still ongoing. This is to avoid the situation where the group communication has already terminated but some of the remote hosts aren't properly notified. CHAPTER 4. IMPLEMENTATION 43 4.3.4 G r o u p R e p l y When a member process executes a reply the IPC is set up such that the Com-munications Manager gets it instead of the Group Communications Manager. The idea behind it is that a reply from a member to a sender process is basically a 1-1 IPC exchange and hence should be handled by the 1-1 CM. Having the GCM handle it re-sults in duplicated efforts. Close cooperation though is maintained between the two communications manager such that the reply will be directed properly to the sender and that the replies expected back is also reduced accordingly by the G C M . When the C M gets the reply message, it does not have the necessary information regarding the group communication. It seeks the information from the G C M . What the G C M does is check that it has a remote O U T C M corresponding to the group communication. Furthermore the member must be on the list and its state should be R E C E I V E D . If all of these conditions are satisfied the reply proceeds. A copy of the remote O U T C M is made from the G C M and moved to the CM's domain. From here the reply message is treated as a 1-1 exchange between the member and the sender. The 1-1 protocol for message exchange is applied. This happens to be a reliable protocol that mediates a reliable 1-1 exchange between two entities. On the other hand if not all of the conditions are satisfied the reply is treated as having failed and the replier is unblocked with the proper indication status. In delivering the reply back to the sender, the first thing that the C M does is send CHAPTER 4. IMPLEMENTATION 44 a control packet requesting to send a reply message. At this point the sender may or may not be blocked waiting for a reply to come back. If it is blocked, then the reply message is immediately transferred to the sender. If the sender is not blocked an occasional probe will be sent out to check whether the sender is alive, whether the communication still exists, and whether the sender has already executed a get_reply. There is a possibility that the replying process would block forever waiting for its reply to be received by the sender. This problem is similar to using the 1-1 blocking send primitive. Both suffer from the same set of because both are blocking. 4.3.5 Termination of a Group Communication A group communication terminates on one of the following conditions: 1. The total replies wanted back has been satisfied. 2. The total replies wanted back was explicitly set to zero. 3. The ongoing communication was succeeded by a new one. 4. The sender process died. A group communication is terminated when the user has set a number for the replies it wants back and all of them have arrived. A second way is when the sender forces the group communication to be terminated immediately by setting the replies count to zero. For both of the above the G C M will remove all traces of the group CHAPTER 4. IMPLEMENTATION 45 communication by removing the O U T C M associated with the group communication. It will also broadcast a T E R M I N A T E packet to all the hosts on the network to inform them that they should removed their remote O U T C M s as well. Those that aren't properly notified will carry on until it sends out a probe to the sender's host. Note that the replies wanted back can be set a number of times as long as the subsequent settings is less than or equal to the previous setting. Otherwise it remains unchanged. A third way to terminate the group communication is by issuing a group send to the same group. The new send overrides the old one. A l l the replies to the old one will fail as they are no longer applicable. To avoid matching replies to the wrong group send a session number is associated with each group communication. Each time a process starts a new group communication it is assigned a new session number. Each reply that comes back must carry with it the correct session number to match the group send. A final way that a group communication terminates is when the sender dies and all the communication it is involved in becomes nonfunctional. The remote O U T C M s that reside on the remote hosts will eventually find out that the sender has died when it probes the sender's host. It will get an acknowledgement packet indicating that the sender has died. The O U T C M on the sender's host will also probe it's own sender to see if the sender is still alive. If it finds out that the sender has died unexpectedly it will initiate a group communication termination procedure. CHAPTER 4. IMPLEMENTATION 46 4.3.6 Certain Implementation Issues A n upper limit has been set over the size of group messages such that it can fit in one network packet. This was done to minimize huge messages from being broadcasted and acknowledged by every host in the network. This is sufficient assuming that most applications would need to sent relatively small messages to groups of processes. Broadcasting or multicasting were the two alternative ways to get a message to all hosts on the network. V uses multicasting because it is taking advantage of the hardware multicast facility. The implementation here used broadcasting because not all ethernets have a multicasting facility. In particular the SUN Workstations on which the original T-Shoshin was ported to doesn't have any hardware multicasting facility. Chapter 5 Results and Analysis The group communication model in T-Shoshin was formulated with simplicity and consistency as two of its major goals. The model is simple in that it is straightforward to track what occurs during the process of a group communication. This is also true in the case of abnormal conditions arising in the middle of an ongoing communication. It is consistent in that the delivery of messages to both local and remote members are done in exactly the same fashion. Changes made to the mechanism of delivery would be applicable both to local and remote sites. After the implementation a performance evaluation was undertaken. Measurements were obtained on the performance of the IPC primitives when used for group communi-cation and the results reported here. These results show how good our model is under different parametric conditions. Through a careful examination of the results it is pos-sible to determine where possible compromises could be made to the model to improve the efficiency of the IPC primitives. The results could also point out weaknesses in the 47 CHAPTER 5. RESULTS AND ANALYSIS 48 model which suggest possible improvements to the model itself. 5 . 1 M e a s u r e m e n t M e t h o d a n d E n v i r o n m e n t The measurements were taken using five Sun Workstations connected through a 10 Mbps Ethernet. No attempt was made to take the measurements at "anti-social" times because such figures would give an indication of how fast the IPCs would perform at a situation which does not occur in reality. Instead the results show the performance of the primitives under a reasonable amount of network traffic. The measurements for all the group IPCs were taken following the outline of the two code fragments given in Figure 5.1. The first part corresponds to the sender process which continually produces group IPC messages. The bottom part corresponds to the member process which continually consumes the messages. Reply messages may be produced by the member process if required by the setup. The elapsed time calculated for an individual group send operation represents an average figure over the 1,000 sends. The overhead involved in making the two gett ime system calls, the for loop control, and the assignment statements have not been deducted from the reported figures. The average overhead for the get t ime call is 1.07 milliseconds which is insignificant when it is amortized over 1,000 IPC sends. The time for the loop control and the assignment statements are also insignificant compared to the time it takes to execute an IPC send. CHAPTER 5. RESULTS AND ANALYSIS 4 9 5.2 Performance Measurements Measurements were taken for the 1-1 IPC primitives to see how the performance of the 1-1 remote communication is after the reinterface to an Intel Ethernet controller. Table 5.1 shows the measurements for various message sizes using <bsend,brecany> pair. Elapsed time refers to the time between the sending of a message by one process and the receiving of the message by another process. There is approximately a 30-36 fold increase in the transmission time for 1-1 IPC send compared to the original T-Shoshin. Special attention is given to the case where the message size is 0. In the new T-Shoshin it takes 1.172 seconds to send a message size of 0. Acton made an analysis on the various execution costs of sending a packet. The possible cause in this case could range anywhere from processing overhead to operate and run the controller to costs of sending messages to coordinate various aspects of the transmission. Due to time limitation, no further analysis was done to determine the exact cause of such huge transmission time. However the next section contains a discussion on the analysis of the implementation. Some of the causes that give rise to higher elapsed time for 1-1 communication would also give rise to a higher elapsed time for group communication. With these causes in mind, it is expected that the transmission time for group communication would be fairly high as well. Measurements were carried out on the two group IPC primitives, bsend and re-CHAPTER 5. RESULTS AND ANALYSIS 50 Message Size Elapsed Time (bytes) (ms) 0 1172 16 1173 32 1173 512 1179 1024 1182 2048 1730 8192 3905 Table 5.1: Elapsed Time for 1-1 <bsend,brecany> pair quest. Most of the measurements were done on the bsend because it embodies the features of a broadcast datagram without the extra delays which are irrelevant to the actual sending of a group message. In contrast the elapsed time measurement of a request primitive is comprised of three components: the broadcasting of the sender's message, the waiting period that the sender must undergo for a reply to come back, and the time it takes for a member to send back a reply. The first and last components are the ones of interest. The first component is almost exactly similar to what the bsend does. The last component is a 1-1 reply which uses the 1-1 communication mechanism. The measurements for bsend are given in Tables 5.2 to 5.4. Table 5.2 reports the elapsed time figures when the number of remote members in a group is varied. In this table elapsed time refers to the time that the sender sends out a group message to the CHAPTERS. RESULTS AND ANALYSIS 51 time that the sender is unblocked with the delivery count, x remote members means Number of 32 bytes message 1418 bytes message Remote Members elapsed time (ms) elapsed time (ms) 1 813.2 813.7 2 814 813 3 813 813 Table 5.2: Elapsed Time for Group bsend - A l l Remote Members that there is a member in each of x number of hosts. There are two sets of figures, one for a message size of 32 bytes and a second set for a message size of 1418 bytes (the maximum allowable group message size). The figures show almost no variance but it is lower than that for 1-1 b s e n d . The 1-1 b s e n d is a reliable protocol and involves more packet exchanges to transmit a message. It involves the sending of four packets between the sender and receiver host in the following order: RTS (Request to Send), CTS (Clear to Send), Data, Ack (Acknowledgement). In group communication, the sender's host broadcasts the message and all the hosts pick it up simultaneously. Subsequently they all send back an acknowledgement at roughly the same time. It remains for the sender's host to pick up and process these acknowledgements. Table 5.3 reports the elapsed time figures for different number of local members in a group. The elapsed time figure is similar to the one for Table 5.2. It is the time that the sender send out a group message to the time that the sender is unblocked with a CHAPTER 5. RESULTS AND ANALYSIS 52 delivery count. Since the implementation for local and remote members are exactly the same, the elapsed time is not expected to be any lower than that for sending group messages to remote members. The group message still has to go out on the Ethernet. It is then picked up by the sending host and processed just as if the sending host was receiving broadcast messages from other hosts. For this measurement the total number Number of 32 bytes message 1418 bytes message Local Members elapsed time (ms) elapsed time (ms) 1 819 827 3 828 844 5 834 869 7 868 872 9 854 881 11 888 922 13 872 940 15 888 935 Table 5.3: Elapsed Time for Group bsend - A l l Local Members of remote members are kept at zero. Two sets of figures are also presented here, one for a message size of 32 bytes and another for a message size of 1418 bytes. The figures for both sets show an increase in the elapsed time as the number of local members increases. Furthermore the set of figures for a message size of 1418 bytes is higher than those for a message size of 32 bytes. The figures reflect how group communication is implemented. When a group com-CHAPTER 5. RESULTS AND ANALYSIS 53 munication is initiated the message is broadcasted onto the network and the Group Communications Manager waits until an acknowledgement is received from every host in the network. As each host (the local and all the remotes) picks up the message it queues the message onto the "mailbox" of each member. This explains why the elapsed time increases as the number of local members increase. The more members there are the more processing time it takes to do the queuing before an acknowledgement can be sent back. It also explains why the processing time stays roughly the same when each remote host has only one member [Table 5.3]. The G C M on the sender's host has to wait until the slowest host sends back its acknowledgement. When there is only one remote member, the remote host containing that member is the last to send back an acknowledgement. When there are two or more remote members, the remote hosts containing these members are the last to send back their acknowledgements. A l l of them send it back at roughly the same time. The elapsed time is longer when the message is longer because it takes some time to copy the messages from the G C M ' s buffer to the receive buffer of each member process. This same delay is not seen in Table 5.3 because the members are on remote hosts and do not affect the sender's host. Several average values were calculated for Table 5.3. First, the average increase in time caused by having an additional local member in a group is 4.9 msecs if the message size is 32 bytes and 7.7 msecs if the message size is 1418 bytes. Second, transferring a message of 1418 bytes consistently takes a longer time than transferring a message of CHAPTER 5. RESULTS AND ANALYSIS 54 32 bytes. The average difference is 29.9 msecs. The results of Tables 5.2 and 5.3 together seem to imply that there is some delay involved in waiting for an acknowledgement to come back from each host even if there is not much overhead incurred in processing the acknowledgements. This leads to the question of whether it is worthwhile for the sender to wait for an acknowledgement from each host. The main reason for these acknowledgements is that they contain the number of remote members on a remote host. When all these acknowledgements return to the sender's host, the total number of remote members are added up and the totaled value is passed to the sender. This value is a very useful feature to include in the model. As the sender unblocks it wil l have an indication of how many members could potentially send back a reply. Based on this knowledge it could judiciously select an appropriate number of reply messages. To determine the delay incurred by the acknowledgements, a further experiment could be carried out which eliminates the need for each host to send back an acknowledgement and have the sender unblock as soon as the message is broadcasted onto the network. Table 5.4 reports the elapsed time figures when the group is empty, that is there are no local and no remote members. These figures are surprisingly high. This is attributable to the way the group send was implemented. When all the host acknowl-edgements arrive and it was found that the membership is zero, the G C M proceeds to terminate this communication. It broadcasts a T E R M I N A T E message to all the hosts. CHAPTER 5. RESULTS AND ANALYSIS 55 Message Size Elapsed Time (bytes) (ms) 32 1293 64 1342 128 1341 256 1340 512 1319 1418 1342 Table 5.4: Elapsed Time for Group bsend - Empty Group This way all the system resources occupied by the communication can be reclaimed immediately. Furthermore there is no need for regular garbage collection. A n alter-native is for the communication to remain in existence until either it is superceded by another group communication or the sender process dies. These alternatives presents another case of time versus resource trade-off. The first alternative shows a degrada-tion in system performance but makes better use of system resource. The latter will result in better IPC performance but system resources are unnecessarily occupied. It cannot be definitely stated which is the better solution. The choice of which one to use is dependent on the needs and usage of the system. Table 5.5 shows the figures for a group send using the request primitive. After the message is send out, the request primitive blocks until one reply comes back. Here elapsed time refers to the time when the sender issues a request primitive to the time CHAPTER 5. RESULTS AND ANALYSIS 56 Message Size Elapsed Time (bytes) (ms) 32 2146 64 2113 128 2090 256 2247 512 2249 Table 5.5: Group request with One Remote Member that it gets back a reply in return. A reply message going from a member to a sender is done using the 1-1 IPC mechanism because it is essentially a 1-1 IPC send. The 1-1 IPC mechanism uses a protocol that guarantees reliable delivery. This introduces a very long delay in the message transaction caused by the sender waiting for a reply. The delay is due to the way the 1-1 reply was implemented. When a member initiates a reply, a probe is send out to the sender to check whether the sender is ready to receive the reply. If the probe fails (that is the sender is not ready), the replier will sleep for a period of time and try again. In the meantime the sender could become ready for a reply but it has to wait until the replier sends the next probe before it could say Clear to Send. Acton reported that the remote IPC elapsed time for a bsend is 32.3 msecs for a message of 32 bytes and 36.2 for a message of 512 bytes [ACTON85], Assuming that a reply message would take roughly the same time, the average amount of time that CHAPTER 5. RESULTS AND ANALYSIS 57 the sender waits for a reply message is 1.3 sees for a message of 32 bytes and 1.4 sees for a message of 512 bytes. These figures aren't very surprising considering that the replier sleeps for three seconds before it tries probing again. The delay introduced into the message transaction is certainly very high. There are a number of possible modifications to the model which could reduce this delay. The first modification is to buffer the reply message at the sender's host. This way the sender wil l immediately obtain the reply when it is ready. The second modification is to separate out the reply mechanism between a 1-1 and 1-many reply. The latter reply may not necessarily need the reliability that the 1-1 reply gives. The underlying network is such that packet lost rate and packet error rates are low. This alternative looses some of the advantages of using the 1-1 reply mechanism. The first disadvantage is that the replies are not guaranteed to arrive safely. The second disadvantage is that the reply messages cannot be arbitrary in size. If the reply message is arbitrarily long then the 1-many reply mechanism has to have some sort of protocol to segment and reassemble the message. It may even need to ensure that the component segments arrive at the destination site. These are all currently handled by the 1-1 reply mechanism. To do it in the 1-many reply mechanism is a duplication of effort. CHAPTER 5. RESULTS AND ANALYSIS 58 5.3 Analysis of Implementation The figures reported here suggest that the group communication of T-Shoshin needs further tuning and much closer examination. Results of the 1-1 IPC on the new implementation show that there is an inordinate amount of delay contributed by the implementation. The increase factor ranges 30-36 times over the original implemen-tation. These increase, however, do not necessarily imply that the elapsed time for group IPC can be reduced by the same factor. It implies that the elapsed time may be reduced by an equivalent amount if implemented on the original machines. With this in mind I've examined the machanism of group communication independent of the figures reported here to determine its efficiency. As a first step I've examined the process of sending out a group message to see what processing are needed: 1. the costs of sending messages to various processes to coordinate the message transaction. 2. cost of moving data from one address space to another or from one buffer to another. 3. actual transmission on the network. 4. processing overhead to operate the controller and acquire the ethernet. 5. processing overhead to run the protocol and build any necessary headers. 6. processing overhead to determine the local membership for.a group. CHAPTER 5. RESULTS AND ANALYSIS 59 Acton has outline some of these same costs for 1-1 IPC and concluded that the largest component associated with sending a packet is that of local messages. Part of the poor performance of T-shoshin 1-1 remote IPC with respect to the V Kernel results from having the Communications Manager be a process separate from the kernel. This results in extra context switches and message passes that contribute significantly to the overhead of transmitting a message to a remote machine. These overheads are aggravated in the Group Communications Manager for a num-ber of reasons. Firstly, for each packet that is sent and received on the network, there are two local IPC exchanges, one for the interrupt process that notifies the Ethernet Manager and one for the Ethernet Manager to notify the appropriate communications manager. It also involves a context switch from the kernel context to the communi-cations manager context. With the current network configuration this amounts to six context switches and 12 local IPC exchanges on the sender's host. On top of that there is one context switching and two IPC exchanges on each of the receivers' host. Secondly, for each message that arrives on the remote machines, the message is first copied from the transmission buffer to the G C M ' s address space. Subsequently it is copied from the G C M ' s address space to each member's address space. Thirdly, at each remote site, the G C M has to send a request message to the local group database server to obtain information on local membership. This involves further local IPC exchange and full context switching between address spaces. CHAPTER 5. RESULTS AND ANALYSIS 6 0 A significant part of the overhead associated with a group communication in T-Shoshin is caused by having the Group Communications Manager, the Ethernet Man-ager, and the Group Database Server as separate server processes. The coordination amongst these three processes to complete a group message transaction entails a con-siderable amount of context switching, blocking and unblocking of processes, and local IPC exchanges. The first step toward fine tuning the system is to dispense with the Ethernet Manager so that one local IPC exchange can be eliminated each time a packet arrives and each time a packet needs to be sent. This requires major restructuring in the design of the communications manager as well as the interface to the ethernet and interface to the kernel interrupt processes. The most straightforward solution is for the kernel interrupt process to read part of the incoming packet and determine which communications manager should get the packet. This approach has the undesirable effect of requiring that the kernel have some knowledge about some aspect of the communication process. It partly defeats the purpose of hiding all the details of communication inside the communications manager. A second approach would be to merge the Communications Manager and Group Communications Manager as a single process and separate them through code modu-larity. This approach is practical in terms of the functions these two servers carry out. They both do practically the same thing and they both share a number of common CHAPTER 5. RESULTS AND ANALYSIS 61 resources. If this approach is taken, most of the work will center around rewriting the code so that the non-sharable data structures between managers are kept sepa-rate. The major drawback of this approach is that the definition of this new resulting communications manager becomes blurred. A third approach would be to port all of the processes into the kernel as kernel team processes. There will still be local IPC exchanges but they are between team processes and hence there wil l be no context switching. There wil l however still be the blocking and unblocking of team processes. It would be interesting to find out how such a porting effort can increase the performance both of the 1-1 remote IPC and the group IPCs. Furthermore it would be worthwhile to see if such increase in performance makes T-Shoshin's IPC comparable to V . This can definitely be considered as a potential area for further examination. CHAPTER 5. RESULTS AND ANALYSIS mainQ { g e t t i m e ( s t a r t _ t i m e ) ; f o r ( i = 0 ; i < 1 0 0 0 ; i++) /* Issue IPC command */ gettime(end.time) > (a) IPC producer mainQ { while ( 1 ) send_id = brecany( ... ) /* r e p l y t o IPC i f r e q u i r e d */ r e p l y . i d . h i d = s e n d _ i d . h i d ; r e p l y _ i d . l i d = s e n d _ i d . l i d ; r e p l y ( r e p l y _ i d , ... ) } } (b) IPC consumer Figure 5.1: IPC Timing Method Chapter 6 Concluding Remarks 6 . 1 T h e s i s Summary A group communication model was developed which allows one process to simul-taneously communicate with a number of other processes. This model fits very well into the client-server environment that T-Shoshin ideally supports. It is based on a broadcast datagram service model and provides minimum reliability in delivering the group message. Hence it leaves room for upper layers to develop their own reliability measures. The model is also flexible in that it provides two types of group send. One type allows nonblocking send in which the sender knows the number of members that got a delivery of the message. The sender can then obtain the replies at its own pace. The second type simply blocks the sender until the first reply comes back. The implementation of the model was kept simple and consistent. Once a group communication is initiated it is possible to track its behavior even under abnormal 63 CHAPTER 6. CONCLUDING REMARKS 6 4 conditions. Much attention and emphasis was given to provide it with a rich failure semantics. This way the sender would be able to tell what went wrong if the communi-cation fails. The mechanism for delivering messages both to local and remote members were kept the same such that subsequent changes made to the mechanism would apply to both local and remote delivery. After porting T-Shoshin to the new SUNs it was found that there was a 30-35 fold increase in the elapsed time for sending messages. This increase is expected to be reflected in the figures for group communication as well. After the implementation performance measurement results were collected and an-alyzed. Elapsed time for G r o u p bsend range from 800-930 ms for groups of different sizes. Elapsed time for G r o u p Request range from 2150-2250 ms. The figures re-ported were high partly due to the fact that they weren't collected at times when the network traffic is minimal. Based on the figures it was seen that a lot of fine tuning and improvements could be done to enhance the performance of the group IPC. In certain cases it involves changing parts of the model to dispense with certain features. Discussions were made as to what these implications will be. After some analysis it was found that a lot of the delay is due to having three separate server processes co-ordinate their functions through local IPC exchanges. The overhead is caused by both context switchings and actual IPC exchanges amongst the servers. Several suggestions were given as to how the performance could be improved. None seemed to be an ideal CHAPTER 6. CONCLUDING REMARKS 65 solution since each has a particular drawback. The "best" solution would probably be dictated by the needs and usage of the system. However there is ample room for making further studies as to which is the best solution to adopt for specific situations. 6 . 2 F u t u r e W o r k There were a number of issues mentioned earlier in the thesis which could be areas for further examination. One such issue involves examining the failure semantics of the group communication model as to its completeness relative to the various types of application that require group communication. Another issue is to determine whether certain features of the group model such as sending back host acknowledgements is worthwhile to be included in the model. A third issue is to consider various ways of improving the performance of both communications manager. This may involve study-ing whether it is worthwhile to move all the server processes (involved in processing a communication) inside the kernel. Bibliography [ACTON85] D . Acton, Remote Interprocess Communication and its Performance in Team Shoshin, T R 85-16 Deparment of Computer Science, The University of British Columbia, Vancouver, B . C . , November 1985. [ACTON86] D . Acton, H . Wang and S. T. Vuong, " Experience with Interprocess Communication in the Distributed Operating System Team Shoshin", IEEE International Conference on Communications '86, pp. 1444-1448, June 1986. [BERG86] E . J . Berglund, " A n Introduction to the V-System", IEEE Software, pp. 35-52, August 1986. [CHER79a] D.R. Cheriton, Multi-process Structuring and the Thoth Operating Sys-tem, Ph.D Thesis, University of Waterloo, 1979. [CHER84a] D.R. Cheriton, W. Zwaenepoel, "One-to-Many Interprocess Communica-tion in the V-System", ACMSigcomm '84 Symposium, 1984. [CHER84b] D . R. Cheriton, "The V-Kernel: A Software Base of Distributed Sys-tems", IEEE Software, pp. 19-42, Apr i l 1984. [CHER85] D . R. Cheriton, W. Zwaenepoel, "Distributed Process Groups in the V Kernel", ACM Transactions on Computer Systems, pp. 77-107, May 1985. [TOKU83a] Hideyuki Tokuda, Sanjay Radia, Eric Manning, "Shoshin OS: a Message-based Operating System for a Distributed Software Testbed", Proceedings of the Sixteenth Annual Conference on System Sciences, 1983, pp. 329-338, 1983. [TOKU83b] Hideyuki Tokuda, Eric Manning, "An Interprocess Communications Model for a Distributed Software Testbed", Proceedings of SIGCOMM 66 BIBLIOGRAPHY 67 '88 Symposium on Communications Architectures and Protocols, pp. 205-212, March 1983. [TOKU84] Hideyuki Tokuda, Shoshin: A Distributed Software Testbed, Ph.D. The-sis, University of Waterloo, 1984. [WANG86] H . Y . Wang, Implementation of Team Shoshin: An exercise in porting and multiprocess structuring of the kernel, T R 86-8 Department of Computer Science, The University of British Columbia, Vancouver, B . C . , March 1986. 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items