UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The role of exception mechanisms in software systems design Atkins, Margaret Stella 1985

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1985_A1 A83.pdf [ 8.5MB ]
Metadata
JSON: 831-1.0051908.json
JSON-LD: 831-1.0051908-ld.json
RDF/XML (Pretty): 831-1.0051908-rdf.xml
RDF/JSON: 831-1.0051908-rdf.json
Turtle: 831-1.0051908-turtle.txt
N-Triples: 831-1.0051908-rdf-ntriples.txt
Original Record: 831-1.0051908-source.json
Full Text
831-1.0051908-fulltext.txt
Citation
831-1.0051908.ris

Full Text

T H E R O L E OF E X C E P T I O N MECHANISMS IN S O F T W A R E SYSTEMS DESIGN by M . S T E L L A A T K I N S B.Sc. Nottingham University (England), 1966 M.Phil. Warwick University (England), 1976 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F D O C T O R O F PHILOSOPHY m T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard. T H E U M V E R S I T Y O F BRITISH C O L U M B I A October 1985 © M.Stella Atkins, 1985 \ In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, 1 agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department The University of British Columbia 1956 Main Mall Vancouver, Canada V6T 1Y3 ILL. Oe><?W DE-6(3/81) Abstract Exception handling is a crucial aspect of practical programming, particularly in systems allowing logical concurrency such as multi-process distributed systems. First, a survey of existing exception handling mechanisms in operating systems is performed, which shows a diversity of implementations, depending on the process model and the method of inter-process communication. The thesis then develops a model for designing software which exploits the different mechanisms for handling normal and exceptional events. The model is applicable in many multi-process programming environments, and not only preserves modularity, but also enhances efficiency and reliability, while often increasing concurrency. To derive such a model, exceptions in multi-process software are classified primarily accord-ing to the program level at which they are detected and handled. Server-to-client exceptions are of particular interest because of their ubiquity; these are exceptions detected by a server and han-dled by a client. The model treats systems programs as event driven, and proposes dividing the events into normal or exceptional, according to the cost and mechanisms for handling them. Techniques are described for designing software according to three criteria: minimising the average run-time, minimising the exception processing time, and incrementally increasing the program's functional-ity. Many examples are given which illustrate the use of the general model. Program paradigms in several languages and in several systems are introduced to model features which are system dependent, through illustrative examples for asynchronous i/o multi-plexing, and for exception notification from a server to its client or clients. Finally, some programs which have been implemented according to the rules of the model are described and compared with their more conventional counterparts. These programs illustrate the practicality and usefulness of the model for diverse systems and concurrent environments. ii iii Table of Contents Abstract 11 List of Tables vii List of Figures * be Acknow ledgement 1 Introduction 1 1.1 Introduction 1 1.2 What is an Exception? 4 1.3 Goals and Scope of the Thesis 7 1.4 Motivation 9 1.5 Synopsis of the Thesis 10 2 Exceptions in Multi-process Operating Systems 12 2.1 Introduction 12 2.2 General Exception Mechanisms in Operating Systems 16 2.2.1 UNIX 16 2.2.1.1 UNIX Processes 16 2.2.1.2 UNIX Signals as Exception Mechanisms 17 2.2.1.3 Intra-process Exception Mechanisms 18 2.2.2 Thoth 19 2.2.2.1 Thoth Processes 19 2.2.2.2 Process Destruction as an Exception Mechanism 20 2.2.2.3 Intra-process Exceptions 20 2.2.3 PILOT 20 2.2.3.1 Processes and Monitors in PILOT 21 2.2.3.2 Intra-process Exceptions in PILOT 21 2.2.3.3 Inter-process Exceptions in PILOT 22 2.2.3.4 Exceptions within Exception Handlers 22 2.2.4 The Cambridge CAP System 23 2.2.4.1 CAP Processes and Protected Procedures 23 2.2.4.2 Intra-process Exceptions in CAP 25 2.2.5 RIG 26 2.2.5.1 Exceptions in RIG 26 iv 2.2.6 Medusa 28 2.2.6.1 Processes in Medusa 28 2.2.6.2 Exceptions in Medusa 28 2.2.6.3 Internal Exceptions in Medusa 29 2.2.7 Remote Procedure Call 30 2.3 Mechanisms for Interactive Debugging 31 2.4 Exception Handling in Action 34 2.4.1 Inter-process Cooperation 34 2.4.2 Asynchronous Events -- Terminal Interrupt 40 2.5 Summary 4 4 3 Principles for Improving Programs through Exception Mechanisms 46 3.1 Classification of Exceptions 46 3.1.1 Inter-process Exceptions 49 3.2 Model for Systems Design 52 3.2.1 Objective A: Minimise the Average Run-time 58 3.2.2 Objective B: Minimise the Exception Processing Time 61 3.2.3 Objective C: Increase the Functionality of the System 63 3.3 Examples Illustrating Minimising the Average Run-time 65 3.3.1 Partition the Event Set into 2 Groups to Reduce Detection Costs 66 3.3.1.1 Spell Example 66 3.3.1.2 Putbyte Example 70 3.3.2 Reduce Handling Costs by Restructuring Programs 72 3.3.2.1 Os Example 72 3.3.2.2 Verex Name-server Example 74 3.3.2.3 Newcastle Connection Example 76 3.3.2.4 The TROFF/TBL/EQN Example 78 3.3.2.5 Point-to-point Network Protocol Example 81 3.3.3 Reduce Context-saving Costs in Normal Events 83 3.3.3.1 Read-back Example 83 3.3.3.2 LNTP Example 84 3.4 Examples Illustrating Increasing the Functionality of the System 87 3.4.1 Using Inter-process Exceptions from Server-client 87 3.4.1.1 ITI <BRK> Example 88 3.4.2 Ignorable Exceptions 88 3.4.2.1 ITI Reconnection Example 88 3.4.2.2 Window-size-change Example 89 3.4.2.3 Hash Table Example 90 3.4.2.4 Slow Graph Plotter Example 90 3.4.2.5 Mesa File-server Example 91 V 4 Program Paradigms for Implementing System Dependent Features 92 4.1 Program Paradigms for IateF-.process Exception Notification 93 4.1.1 The KEYBOARD/MOUSE Example 93 4.1.2 Exception Notification in a Synchronous Message-passing System 94 4.1.3 Exception Notification in Procedure-oriented Concurrent Languages 97 4.1.4 Exception Notification in Message-oriented Concurrent Languages 99 4.1.5 Exception Notification in Operation-oriented Concurrent Languages 100 4.1.6 Exception Notification in Distributed Systems 101 4.2 Program Paradigms for Ignorable l:many Exceptions 101 4.2.1 The Storage Allocator Example 101 4.2.2 Storage Allocator Using Unreliable Broadcast Notification 107 4.2.3 Storage Allocator in Mesa 110 4.2.4 Storage Allocator in Medusa 112 4.2.5 Storage Allocator in V-kernel 112 4.3 Exception Handlers as Processes: Mechanisms for Mutual Exclusion 117 5 Programs Illustrating the Model 123 5.1 Exceptions and Protocol Implementations -- a New Approach 125 5.1.1 The ITI Protocol 125 5.1.2 The Interrupt Protocol 127 5.1.3 The ITI Implementation in a Thoth-like Operating System 129 5.1.3.1 Notification of Inter-process Exception Using Process Destruction 130 5.1.3.2 Notification of Inter-process Exception Using a Toggle Switch 131 5.1.3.3 ITI as a Filter and an Exception Process 133 5.1.4 Problems with Synchronisation 136 5.1.5 Recovery after network failure 140 5.1.6 Summary 141 5.2 A Nested Transaction Mechanism for Atomic Data Updates 143 ' 5.2.1 Introduction 143 5.2.2 Overview of the ARGUS Model 144 5.2.2.1 Atomic Objects 144 5.2.2.2 Nested Actions 145 5.2.2.3 Guardians 145 5.2.3 Implementation Considerations. 147 5.2.3.1 Introduction 147 5.2.3.2 Requirements for correct implementation 148 V I 5.2.4 Features of the Implementation 150 5.2.4.1 Transaction Invocation 151 5.2.4.2 Atomic Data Access 152 5.2.4.3 Subaction Commit 153 5.2.4.4 Top-level Commit 153 5.2.4.5 Transaction Aborts 154 5.2.4.6 Transaction Read/Write Conflicts 154 5.2.4.7 The Group Leader's Role 156 5.2.4.8 Group-Leader's TOP-COMMIT Algorithms 157 5.2.4.9 Group Leader's ABORT Algorithms 161 5.2.5 Efficiency of Mechanisms 162 6 Conclusions 165 6.1 Summary of results 165 6.1.1 Characterize the Nature of Exceptions 166 6.1.2 Model for Systems Design 167 6.1.3 Program Paradigms for Systems Design 168 6.1.4 Program Examples 169 6.2 Further Work 169 References 172 Appendix A. The putbyte program 176 Appendix B. The os program , 181 Appendix C. The TROFF/TBL/EQN system 187 vii List of Tables 3.1 Execution Costs for a General Event Manager 67 5.1 Number of SendReceiveReply Messages on <BRK> 131 5.2 Comparison of Locus and V-kernel Nested Atomic Actions 162 A. l Execution Costs for the Putbyte Routine 177 B. l Execution Costs for the Os Utility 183 viii List of Figures 2.1 Thoth Server-client Notification using a Victim Process 37 3.1 Level 1 Classification of Exceptions 46 3.2 A Classification of Inter-process Exceptions 49 3.3 Level 1 Model for Systems Design 55 3.4 Level 2 and 3 Model to Minimise the Average Run-time 59 3.5 Model to Minimise the Exception Processing Time 62 3.6 Model to Increase the Program's Functionality 64 3.7 Expected Execution Costs for Different Values of np and n 70 3.8 Restructuring the Verex Name-server to Reduce Ipc in the Normal Case 75 3.9 Restructuring Newcastle Connection to Reduce Ipc in the Normal Case 77 4.1 Thoth Server Notification Using Two Worker Processes 95 4.2 Thoth Switch Notification 96 4.3 Levin's Storage Allocator 103 4.4 A User of Levin's Storage Allocator 105 4.5 Storage Allocator Using a Monitor and Broadcast Exceptions 109 4.6 Pool-server in V-kernel 114 4.7 A User and Exception Handler of the V-kernel Storage Allocator 117 5.1 ITI Implementation as a Server 126 5.2 Network Packets to be Exchanged on an Interrupt 128 5.3 Process Switches to Handle an Interrupt Using Process Destruction 130 5.4 ITI as an I/O Filter 134 5.5 Process Switches on <BRK> with a Separate Exception Process 135 5.6 Formation of a Version Stack with Nested Transactions 155 5.7 Group Leader Communication Structure 156 B. l The Initial Version of Os 182 C. l Pipe Worker Processes 187 C.2 Message Traffic for Handling an Equation 189 ix Acknowledgements Thanks to Son Vuong, my supervisor, for his support during the last few years, and to my thesis committee of Sam Chanson, Doug Dyment, Paul Gilmore and Mabo Ito, for their active encouragement and advice on the presentation of this thesis, and to Roy Levin for his constructive comments and criticisms. Thanks to David Cheriton who started my investigation into exceptions, and who gave continued inspiration and advice in the first few years while he was at the University of Brit-ish Columbia. Thanks to the many graduate students and staff in the department who have helped me through discussions and criticisms, especially to Ravi Ravindran, Steve Deering and Vincent Manis, and to my colleagues Rachel Gelbart and Lisa Higham for their continued support. Many thanks to Roger Needham and the staff and students at the Cambridge Computer Laboratory for their friendly support and use of their facilities, and thanks also to the staff at Simon Fraser University for letting me finish. Finally, thanks to Derek my husband, and to my children, for their patience. CHAPTER 1 Introduction 1.1. Introduction A crucial aspect of practical programming, particularly in systems allowing logical con-currency such as multi-process distributed systems, is exception handling. Over the last few years some operating system and programming language structures have been developed to aid in the handling of exceptional events. The developments are interrelated because of the recursive nature of computer systems: the operating system must be written in a program-ming language, and the program must be run by an operating system. The various exception handling mechanisms that have been proposed for programming languages all aim to separate the common from the uncommon case and thus to help pro-grammers separate their concerns, by preserving modularity. In general this is achieved by handling an exception at a higher level of abstraction from where it is detected. This is reflected in most embedded programming language mechanisms, where handlers for exceptions are searched in a path which is directed up the so-called procedure calls hierarchy. Of course, the burden of having to write the code for the uncommon case cannot be eliminated. The claimed methodological advantage of embedded exception handling mechanisms is the separa-tion of such code from the main line program both in the time of its writing and in its loca-tion in the program text. Similarly, in operating systems software, the facilities for handling exceptional events have been developed for modularity in programming. Many modern operating systems are 1 4* structured as a hierarchy of processes, where it is possible for a process to control subordinate processes while itself being subject to the control of another. For modularity it is desirable to provide the same exception handling facilities at every level in the hierarchy. However, in operating systems, performance and reliability are additionally crucial to success, and the exception mechanisms must also be designed to enhance the overall system performance and reliability. For example, both the user and the operating system require mechanisms for han-dling errors — otherwise an erroneous program which never halted could only be stopped by switching on" the power supply! In modern concurrent operating systems where a program may consist of several processes which may be running on separate processors, the need for such mechanisms is even more acute. As Nelson remarks in his thesis on Remote Procedure Calls [Nelson 81]: " Machinery for aborting processes and forcing asynchronous exceptions may seem somewhat violent to the gentle programmer of sequential abstractions. But realis-tic remote procedure environments and, indeed, realistic applications will in fact require this machinery to provide robust services and exploit the full performance of distributed systems." The user requires a mechanism to allow him to terminate program execution arbitrarily, and such a mechanism is typically invoked by hitting the terminal <BRK> key. The operating system must respond to such an event by terminating the offending program, reclaiming the resources that were allocated to it, and generally attempting to return to the state which existed before the program was run. Such activity may be considered as an example of exception handling in operating systems. This thesis studies not only the problems which designers of concurrent operating sys-tems have to overcome in handling exceptional events such as the <BRK> described above, but also the problems of systems design exploiting exception mechanisms in general, in a 3 multi-process environment. The high level language exception mechanisms are usually engineered so that their use is economic only when they are rarely exercised; the very cir-cumstance in which an error of logic is likely to remain undetected [Black 82]. Furthermore none of the high level language mechanisms extend to multiple processes (except for an excep-tion ABORT); they are concerned with exceptions detected and handled within a process. The structures which are available for handling exceptional events in a multiple-process environment are specific to the particular environment; most are not portable to other sys-tems, and they do not pretend to be of general use. This research extends the development of linguistic exception mechanisms which deal with intra-process exceptions, to mechanisms in a multi-process environment; in particular to event driven programs used in operating systems. An event driven program is one which responds to events which may be generated externally to the computer system (such as by the user hitting the terminal <BRK> key), or by other processes within the computer system (such as an i/o completion event or a timer firing). Peterson and Silberschatz state that operating systems are event driven programs [Peterson 83], in that if there are no jobs to exe-cute, no i/o devices to service, and no users to respond to, an operating system will sit quietly, waiting for something to happen. We use the term event driven more widely, to include system servers (or monitors) which are structured hierarchically, with a client, or clients, making requests of a server below, which in turn may use the resources of another server (or may make a nested monitor call). We also view some data-driven programs as event driven programs -- in these cases, the events correspond to the value of the data received. The program units which respond to these events are called event managers. In some cases, the events occur randomly, with unk-4 nown probabilities; in others, the probability of an event is deterministic. In event driven programs where the relative probabilities of the events are known, we can say that the event managers execute some small percentage of the code to handle the normal or most probable events, and the rest of the code is executed to handle the unusual events. Now it is computer folklore that 90% of most program code is executed 10% of the time. It is therefore impor-tant to structure such event managers so that the normal case is clearly differentiated from the unusual cases, so the programmer can separate his concerns, to writing efficient code for the normal cases which are executed most of the time1. This thesis develops principles for writing efficient event managers; principles which exploit different mechanisms for handling normal and exceptional events. The principles are used to derive a model for designing software, applicable in multi-process programming environments, which not only preserves modularity, but also enhances efficiency and reliabil-ity, and often increases concurrency. 1.2. What is an Exception? Exceptions have often been considered as errors, although most designers of exception handling mechanisms claim that their mechanisms, while suitable for dealing with errors, are also valuable for other events. Unfortunately this guideline is of little use because the term error has as many interpretations as exception. Black devotes a Chapter in his thesis Exception Handling — the case against [Black 82], to the topic What are Exceptions? He tries out phrases like an exception occurs when a rou-tine cannot complete the task it is designed to perform (from the CLU reference manual 1But note that in systems where the probability of an event is not known a priori, such structuring may not be possible. 5 [Liskov 78]), inconvenient results of an operation (from [Goodenough 75]), errors that must be detected at run-time, (from the ADA definition [US 83]), and means of indicating implementa-tion insufficiency [Black 82] and finally observes that none of these definitions describe an abstract object called exception, which can be neatly formalised. In his thesis Program Structures for Exceptional Condition Handling, Levin [Levin 77] rejects any connotation of error by exception, but he also rejects the common guideline that an exception is a rarely occurring event, as the frequency of many events depends on context. For example, looking up a name in a compiler's symbol table may yield two results: name present or name absent. Which of these is more frequent depends on context. Why should one result be considered exceptional, when the other is not? Levin's solution to the dilemma is to treat both results as an exception. The dictionary defines the word exception in a manner which captures the essence of the notion of unusual case: i.e. " exception - a person, thing, or case to which the general rule is not applicable" -- and this is the spirit in which we use it, strenuously avoiding the conno-tations it has as a synonym for error, observing that errors are a strict subset of exceptions. (At least, we hope that errors are unusual eventsl). We therefore define an exception as a rare event which is treated differently to a normal event. This thesis claims that if an event can be defined as being unusual in most contexts, then it can be considered an exceptional event. This definition agrees closely with Levin's for the problems presented in this thesis. However, it is accepted that in some problems where the context changes, the exception events could become the norm. For example, a system resource-manager may receive one request for almost all its resources, and for the next week it may be running in the exceptional mode of being short of resources. If this situation is judged 6 to be exceptional, then it can still be claimed that there are enough resources for most requests in the normal mode of operation for that resource-manager. If however, the probabil-ities of the events are unknown or very variable, then the best solution may be to treat all events in the same way, and many of the design proposals in this thesis are not applicable in such situations. It transpires that there are several cases encountered in computing practices, particularly in operating systems, which are not generally considered as exceptions, but which also fall into this category under the dictionary definition. For example, program initialisation code which is executed only once and hence usually does not follow the general rule. Similarly, opening a file before it can be read or written is an essential part of the program execution, but the execution of this event occurs with a low frequency compared with execution of other events on the file, such as reading and writing to the opened file. An exception mechanism is composed of three parts: detection (which may be by hardware or software), notification and handling. Notification is the act of signalling the occurrence of an exception to interested parties; in a multi-process system one or more processes may be involved. Together, the detection and notification may be considered as drawing attention to the exception event. Exception handling is defined as the execution of some actions in response to the exception notification. All these components of exception mechanisms are discussed in detail in the thesis. When, then, should we use exceptions? For example, when the End-of-File has been read, is it treated as an exception or as a normal, but relatively infrequent event? There is no clear demarcation between these viewpoints, but they may require different programming styles, and may be implemented by different mechanisms which may incur different overheads 7 at compile-time and run-time. Our research addresses these issues in detail, and shows that generally an exception-handling mechanism should be used when a programmer wishes to improve clarity of the normal-case computation, provided it does not incur unreasonable costs or impair reliability. By using an exception mechanism we attempt to reduce cost by making the normal case simpler and faster2. 1.3. Goals and Scope of the Thesis The thesis of this dissertation is that designing systems software to exploit the use of exception mechanisms leads to efficient modular programs which are also easy to write and to understand. Our goals are to show how to design efficient and modular systems programs exploiting exception mechanisms. This thesis does not include discussion of hardware reliability; this topic has been ade-quately reviewed in another survey paper [Randell 78]. We do not discuss in detail the excep-tion mechanisms embedded in high level languages, as Black has covered this in his thesis, although these mechanisms are referred to throughout the text. Issues of proof of program correctness are also considered beyond the scope of this thesis, although correctness issues are referenced whenever applicable. To achieve the goals some general principles for exception mechanisms are needed — principles which have universal application across languages and systems. The systems we consider are event driven, where the relative probabilities of the events are known a priori. In such systems it is possible to define what is an exceptional event. But context is not the only factor in designating what is an exceptional event -- the language/system in which 2again with due regard to the fact that should the operating context change, the previous excep-tion events may become normal case with corresponding changes (sometimes a dramatic increase) in run-time. 8 computations are expressed also influences what is an exception. Therefore the languages/systems in which exceptional events occur must also be considered. Only then can generalities be made about what constitutes an exceptional event, and how best to handle it. Thus the first goal is to characterize the nature of exceptions. We do this by considering a wide range of systems problems from different areas of systems design, in particular, event managers such as system servers, communication protocols and data-driven system programs3. A general classification of exceptions is then derived. The next goal is to provide a model for designing efficient and modular software in event driven programs in a multi-process environment, by exploiting exception mechanisms. This goal is achieved by considering three different criteria for program design: minimising the average run-time, minimising the exception-handling time, or increasing the program's func-tionality in a modular way. A model is derived which provides system design guidelines for these three criteria. Because of the diversity of languages/systems, the final goal is to provide program para-digms which" employ the design principles propounded in the general model, in particular for the system dependent features of the model. The program paradigms may then be used as models for implementation in different environments. To fulfill this goal, several programs are described in detail, and two of the design models which have been implemented together with their exception-handling mechanisms are described to show their applicability. The operating system Verex [Lockhart 79], a derivative of Thoth [Cheriton 79a] was used to perform some of the research, and also TRIPOS [Richards 79b], implemented on the Cambridge Ring local area network [Wilkes 80]. Later, the UNIX 4.2BSD operating system 3Note that we do not discuss in detail the inner kernel features which are driven by hardware in-terrupts; our research is concerned at a higher level of abstraction than that provided by a raw kernel. 9 [Joy 83] was used, implemented on VAX computers and SUN workstations connected over an Ethernet local area network [Metcalfe 76]. Thus the designs were tested on different real sys-tems, to show how requirements of adequacy and practicality were met. 1.4. Motivation The major motivation is that problems are encountered in managing the numbers and types of special cases and unusual circumstances when writing software for concurrent sys-tems, and there are few, if any, answers to these problems. As programs grow in size and complexity, special cases and unusual circumstances are compounded. Controlling and checking for exceptions takes more and more effort. With the advent of parallel programming, distributed systems and computer networks, not only the number of exceptions increases, but also the type of possible exceptions. Exceptions can be within sequentially executed programs, between tasks sharing memory but operating in paral-lel, and between tasks executing in parallel which do not share any memory. Research on this subject is needed because there are very few, if any, exception-handling techniques implemented for multi-process programs that preserve structural clarity, and that have simply defined semantics so that the techniques are easy to use and understand. In sup-port of this claim, the authors of one multi-process system, PILOT [Lampson 80] written in the high level language Mesa [Mitchell 79] comment that: Aside from the general problems of exception-handling in a concurrent environment, we have experienced some difficulties due to the specific interactions of Mesa signals with processes and : monitors. In particular, the reasonable and consistent handling of signals (including UNWINDS ) in entry procedures represents a considerable increase in the mental overhead in-volved in designing a new monitor or understanding an existing one. Motivation from my own experiences came when writing software for Verex, a multiple-process operating system. It was desirable to trap errors in programs so that they could then 10 be conveniently debugged instead of immediately destroyed. An error-handling mechanism which optionally diverted program termination by handing control to the interactive debugger was implemented by the author. It transpired that the mechanism could be used to handle more than erroneous situations. In considering just how and where the mechanism would be used, many problems of exception-handling had to be addressed. For example, should the program unit raising the exception be allowed to resume from the place where the exception was raised, or should the exception handler terminate execution of the current program unit? Is it desirable, or even feasible, to use the same mechanism to handle exceptions in sequential code, exceptions between parallel tasks sharing memory, and between parallel tasks in a dis-tributed system which have no shared memory? How does interactive debugging fit into the scheme? It is the aim of this thesis to answer some of these important issues. The main contribution of my thesis is in characterisation of exceptions in the context of multi-process systems software, and in the development of a model and program paradigms which improve system programs by exploitation of exception-handling mechanisms ~ either by making the programs more efficient at run-time, and/or by making the program code more modular and reliable. My work also considers the use of exception mechanisms in high level languages in developing systems software and application programs, and attempts to unify the different approaches taken by message-based and procedure-based systems. 1.5. Synopsis of the Thesis In Chapter 2 exception handling mechanisms in several recent multi-process operating systems are presented. Most of these operating systems use different constructs for excep-tional events within a process and between processes. These in turn depend on the character of the process which the operating system supports and the inter-process communication 11 mechanism. Thus the discussion of exception mechanisms in operating systems includes com-parisons of various process and inter-process communication concepts, and also includes a dis-cussion of the various mechanisms available for interactive debugging, as these are often claimed to be useful for exception handling. Two typical problems in operating systems design are then presented, and current solutions given for which exception mechanisms have been used. In Chapter 3 a taxonomy of exceptions is derived from these examples, including a new class of ignorable exceptions. A model for systems design is then propounded, based on the classification. Many examples are given to illustrate the model. In Chapter 4 program paradigms in different languages and systems are described for example problems. These paradigms are based on the general model, but they employ system dependent features such as the server-client inter-process exception notification mechanism. Chapter 5 describes implementations of two systems programs which were designed according to the model, showing the suitability and practicality of the approach. Chapter 6 concludes with a summary of the thesis, and considers further areas for sys-tems design methodology which use exception mechanisms. CHAPTER 2 Exceptions in Multi-process, Operating Systems 2.1. Introduct ion The history of programmed exception-handling has been reviewed by Goodenough [Goodenough 75] and Levin [Levin 77], both of whom made new proposals for exception-handling in high level language design. Since 1977, there has been more emphasis on formal exception-handling in both high level language and in multi-process operating system design. It is worthwhile to make a review of multi-process operating systems, because the nature of their exception mechanisms depends on the process model and on the inter-process communi-cation mechanism supported by the operating system. A multi-process operating system is defined as a system which provides facilities for mul-tiple concurrent user processes which behave as the active entities of the system. It is possi-ble for one process to create another process dynamically. A process may acquire, release and share resources, and may interact, cooperate or conflict with other processes. In multi-process operating systems the designers have to deal with the problems of synchronisation and deadlock. Synchronisation is needed, both for mutual exclusion (because of sharing) and for sequencing (to impose event-ordering for cooperation among processes). The survey article Concepts and Notations for Concurrent Programming by Andrews and Schneider [Andrews 83] is taken as a guide to the review. However, Andrews and Schneider confine their discussion to concurrent high level languages whereas we are also con-cerned with concurrent operating systems. They divide the synchronization primitives into 12 13 two groups; those which rely on shared memory, and those based on message passing. The operating systems discussed fall into several different classes, including systems from both the above groups. (1) The UNIX time-sharing operating system [Ritchie 74] was developed at Bell Telephone Laboratories. It is a multi-user system which allows each user to create multiple processes to run in parallel, and contains a number of novel features such as pipes which have made it very popular. It is written in the high level language C [Ritchie 78]. UNIX-style operating systems have a large kernel supporting user programs through use of system calls. A null system call typically takes .6 millisecs on a VAX 11/750; a read or write system call takes at least 2.5 msecs. Processes are large and there is a high over-head on process creation. We also consider distributed versions of UNIX such as LOCUS [Popek 81] and Newcastle Connection [Brownbridge 82], where several UNIX hosts are connected over a local area network. (2) Thoth was designed as a portable multi-process real-time operating system at the University of Waterloo [Cheriton 79a], The aims of Thoth [Cheriton 79b] were to achieve structured programming through use of many processes, both for concurrent and sequential tasks. Various facilities are provided to make this structuring attractive; small, inexpensive processes, efficient inter-process communication by messages or shared memory, and dynamic process creation and destruction. Thus, the Thoth style is to split certain sequential programs into many cooperating processes. Process switching typically takes 0.5 msec on a SUN workstation. Derivatives of Thoth are Verex [Lock-hart 79], [Cheriton 81], and PORT [Malcolm 83]. There are distributed versions; V-system [Cheriton 83a, 83b] where an identical kernel resides on hosts connected over a 14 local area network, and Harmony [Gentleman 83], where multiple 68000 microprocessors are connected by a Multibus. Object-oriented operating systems such as PILOT, designed at Xerox PARC for the per-sonal computer environment [Redell 80] written in the language Mesa, rely on shared memory, and use monitors for synchronisation of processes. The overheads on monitor calls are only a little more than for procedure calls, but there is no direct mechanism for general inter-process communication between arbitrary processes. PILOT is a single-language, single-user system, with only limited features for protection and resource allo-cation. PILOT and Mesa are mutually dependent; PILOT is written in Mesa and Mesa depends on PILOT for much of its run-time support. Thus Mesa monitors were chosen as the basic inter-process communication paradigm for the processes of PILOT, as described by Lampson and Redell in [Lampson 80]. For distributed object-oriented operating systems, a remote procedure call mechanism is appropriate [Nelson 81], [Spec-tor 83]. CAP [Herbert 78] is a multi-user single processor machine built at the University of Cambridge for research into capability-based memory protection [Wilkes 79]. The machine is controlled by a microprogram which provides the user with a fairly conven-tional instruction set, together with the support of a capability-based addressing and protection scheme. In this protection scheme, the fundamental notions are capability and object. A capability identifies (i.e. names) some object, and possession of a capabil-ity allows some access to the object it names. If the object contains executable code it is called a protection domain. The programs of CAP are protected from each other; they are mutually suspicious subsystems (in contrast to those of PILOT). Most of the CAP 15 operating system is written in ALGOL68C [Bourne 75]. Users may not perform much multi-tasking in the CAP system, but these protected systems present peculiar problems with respect to process creation and destruction, which warrants their inclusion in this discussion. (5) RIG [Lantz 80] is a distributed message-based multi-process operating system developed at Rochester University. It differs from Thoth-like systems in that it has no shared memory between processes. RIG uses special Emergency messages for exception han-dling. (6) Medusa [Ousterhout 80], [Sindhu 84] is a unique operating system designed for a particu-lar multiprocessor hardware configuration, Cm*, at Carnegie-Mellon University. Like RIG, it introduces several novel approaches to exception management and is thus included here. (7) Remote Procedure Call (RPC) is also discussed briefly, as Nelson in his thesis Remote Procedure Call claims it can be used as a programming language primitive for construct-ing distributed systems, although it has no special exception mechanisms. General exception mechanisms are discussed for all these types of multi-process operating systems. Now interactive debugging is often cited as a use for exception mechanisms, so the vari-ous mechanisms available for achieving interactive debugging are described next. Exceptions occurring in two typical example problem situations in operating systems are then described. In many operating systems supporting multiple processes, interaction between processes is often of the server-client relationship. The server process typically manages a shared resource, which is requested by client processes, either locally or remotely. 16 To the client, the request for service appears like a synchronous procedure call; the call returns when the request is satisfied, and the results may be passed as parameters. For exam-ple, in most message-based operating systems, the host's files are managed by a file-server process. If the client processes are viewed as cooperating together, as is the case for a single-user workstation environment, special communication may be required to enhance the perfor-mance. The first example discusses the management of files by cooperating processes. When the server provides access to input-output devices (i/o devices), the desired com-munication may be more complex than that provided by a synchronous procedure call. For example, the terminal server may wish to communicate to the client that the user has pressed the interrupt key (<BRK>). But the client may be outputting to the terminal at that moment, and may not be reading from the device. Such asynchronous events do not fit into the procedure call paradigm, and alternative solutions to handle such events must be found. The second example describes how different operating systems allow the user to handle such asynchronous events. 2.2. G e n e r a l E x c e p t i o n M e c h a n i s m s in O p e r a t i n g Systems 2.2.1. U N I X 2.2.1.1. U N I X Processes UNIX has a structure similar to the classical 2-level operating systems where the kernel contains nearly all of the operating system code. A user process executes either in kernel mode or in user mode. A UNIX user may have many processes which do not share data with one another. 17 In kernel mode, a process uses the kernel stack and data space, which is common to all processes operating in kernel mode. Thus for execution of i/o, a user process will switch modes via a system call from user to kernel, allowing access to shared system data. Syn-chronisation is required between these so-called kernel processes to avoid conflict. This is achieved by various mechanisms such as raising the priority of a kernel process executing crit-ical code to avoid interrupts till it has tidied up, or by setting explicit locks on shared resources. This simple way for achieving mutual exclusion using a non-preemptive scheduler works for a single processor system but would be inappropriate in a multi-processor system. Kernel processes communicate through the shared data of the kernel. In UNIX Version 7 user processes may communicate by creating a pipe, which provides a buffered stream of char-acters. If one process attempts to read from an empty pipe it is suspended. Thus pipes are not suitable for reporting asynchronous exceptions such as user interrupts, because of their blocking characteristics. In UNIX 4.2BSD [Joy 83], there are facilities for inter-process com-munication through use of sockets1, although there is a high overhead on this communication. For example, a Send-Receive-Reply from one process to another on the same machine through a socket takes 10 msecs on a VAX 11/750. 2.2.1.2. U N I X Signals as E x c e p t i o n M e c h a n i s m s A user process may receive notification of an exceptional event by a signal. 2 A signal is a binary flag, which works like a software interrupt. A process that has been sent a signal will start to execute code corresponding to the signal when it is next activated by the operating system. The default action of all signals is to stop the receiving process. Hn UNIX Version 7 sockets are not available though pipes are available in both UNIX Version 7 and in UNIX 4.2BSD 2the only direct inter-process communication mechanism available in UNIX Version 7 18 The signal mechanism is very simple: each process has a fixed number (16) of ports at which a signal may be received. The user can specify different treatment of a signal by asso-ciating a port with a procedure (i.e. a signal handler). A signal is usually generated by some abnormal event, initiated either by a user at a terminal, by a program error, or from another process. One signal is SIGINT which is usually sent to the process running a command when the user presses <BRK> on the terminal. The user can chose to ignore the signal, or execute some clear-up code in the signal handler before terminating. One signal, SIGKILL, cannot be ignored or handled. Thus a process receiving SIGKILL is forced to terminate immediately, with no chance to tidy up at all. To prevent a user from arbitrarily killing processes belong-ing to a different user, the user-id of the process issuing the SIGKILL signal must be the same as the target process. Although the signal mechanism is simple to implement and is very powerful in its gen-erality, it has the disadvantage that, like a hardware interrupt, a signal can occur whenever a process is active, leading to non-deterministic execution. The common way of dealing with this asynchronous event is for the signal handler to set a global flag and then to resume exe-cution at the point of interruption. The process inspects the flag when convenient. This involves two stages in the signal handling, which can lead to errors while handling multiple calls of the signal. Furthermore, signals have the same priority, so multiple signals are pro-cessed non-deterministically. Finally, a process receiving a signal cannot determine which pro-cess sent it, so their use is necessarily restricted to well-defined cases. 2.2.1.3. Intra-procesB E x c e p t i o n M e c h a n i s m s When a process has received a signal, the signal handler may wish to restore the process to some previously known state. A UNIX process may use the C language library functions 19 s e t j m p and l o n g j m p for providing a very basic method for passing over dynamic program levels, equivalent to a non-local GOTO, s e t j m p ( e n v ) saves its stack environment in e n v for later use by l o n g j m p . l o n g j m p ( e n v , v a l ) restores the environment saved by the last call of s e t j m p . It then returns in such a way that execution continues as if the call of s e t j m p had just returned the value v a l to the function that invoked s e t j m p . The feature is really only intended as an escape from a low-level error or interrupt. This mechanism provides no opportunity for procedures which were active but have now been ter-minated (passed-over procedures) to clear up any resources they may have claimed. However, it is cheap and easy to implement. The mechanism has been enhanced by Lee [Lee 83], who defines macros so that the C language appears to be extended with exception-handling operations, similar to those incor-porated into the ADA language [US 80]. 2.2.2. T h o t h 2.2.2.1. T h o t h Processes Each process belongs to a team which is a set of processes sharing a common address space and a common free list of memory resources. Each process on a team also has its own stack and code segments, like coroutines. Processes which share no address space are said to be on different teams. Individual processes on a team are globally addressable, and may com-municate via messages. Inter-process communication is achieved through fully-synchronized message passing; concurrency is achieved with multiple processes, one for each event, rather than with a non-blocking Send, which would use buffered messages. 20 2.2.2.2. Process Dest ruc t ion as an E x c e p t i o n M e c h a n i s m In Thoth, the idea is to have a separate process for each possible asynchronous event such as an i/o interrupt, and to ensure that it executes fast enough to be ready for subse-quent interrupts. Each event is synchronous with respect to the process that responds to it. Global asynchrony of the total program is achieved by execution of its multi-process struc-ture. The claim is that the synchrony of each individual process makes the program easier to understand. All asynchronous communication is handled by process destruction, as follows. First, each process that encounters a fault or exception, or that is the subject of an attention, is destroyed. Second, if the program is to continue running after one of these conditions, the part to remain is designated as a separate process from the process to be destroyed. Processes which are blocked awaiting messages from, or sending messages to a process which has been destroyed are awakened immediately; the death of a process is detected synchronously when another process attempts to communicate with it. 2.2.2.3. Intra-process Except ions No special mechanism exists for handling within-process exceptions in Thoth, and the Zed language in which it was written [Cheriton 79c] provides no special features for exception-handling. Processes incurring exceptions are destroyed so their state does not need to be remembered. \ 2.2.3. P I L O T 21 2.2.3.1. Processes a n d M o n i t o r s in P I L O T PILOT supports two types of processes -- tightly-coupled processes which interact through the shared memory of a monitor (these are similar to the processes on a team in Thoth), and loosely-coupled processes which may reside on different machines, communicating via messages passed over a network. New processes are created by a special procedure activation which executes concurrently with its caller. Such a new process has its own local data, but may also share data with the parent process. Synchronisation and inter-process communication of these processes is achieved through monitors with condition variables [Hoare 74]. A monitor acts as a basic mechanism for providing mutual exclusion, as only one process may be executing code in a monitor entry procedure at a time. Thus access to a shared resource is usually managed through the entry procedures of a monitor. So a monitor is like a server process in a message-passing operating system. Synchronisation and multiplexing of client processes accessing a shared resource are provided by explicit use of condition variables rather than by queues of messages. 2.2.3.2. Intra-process Except ions in P I L O T The Mesa language provide extensive exception-handling facilities in sequential code. Exceptions are propagated through the calls hierarchy until a handler is found. The root pro-cedure of a Mesa process has no caller; it must be prepared to handle any exceptions which can be generated in the process. Uncaught exceptions cause control to be sent to the debugger; if the programmer is present he or she can determine the identity of the errant pro-cedure. Unfortunately this takes considerable time and effort. The interaction between the Mesa exception-handling and the PILOT implementation of Mesa processes has been found to 22 be irksome [Lampson 80]. It can be expressed as a limitation of the signalling mechanism that an exception signal cannot be propagated from one process to the process which created it. 2.2.3.3. Inter-process Except ions in P I L O T No special form of inter-process exception is provided in Mesa. It is difficult to commun-icate exceptional events using monitors. For example, if a pair of processes are communicat-ing through a monitor and one dies, there is no means for the remaining process to be notified. Instead, a timeout interval may be associated with each condition variable, and a process which has been waiting for that duration will resume regardless of whether that con-dition has been notified. Interestingly, PILOT was originally designed to raise an exception if a timeout occurred. It was changed because programmers preferred the less complicated spe-cial timeout mechanism for simple retry situations, rather than employing the intra-process Mesa exception mechanism to handle such retries. 2.2.3.4. Except ions wi thin E x c e p t i o n Handlers A further complication in the interaction between Mesa monitors and the exception han-dling mechanism arises when an exception is generated by an entry procedure of a monitor. There are two ways of dealing with this. (1) A SIGNAL statement will call the appropriate exception handler from within the moni-tor, without releasing the monitor lock (as the monitor invariant might not be satisfied at the time of the SIGNAL statement). This means that the exception handler must avoid invoking that same monitor again, else deadlock will result. Also if the handler does not return control to the monitor, the monitor entry procedure must provide an UNWIND handler to restore the monitor invariant. Mesa automatically supplies the 23 code to release the monitor lock if the UNWIND handler is present, but if the entry pro-cedure does not provide such a handler, the lock is not automatically released, leading to the potential for further deadlocks. (2) Alternatively, the entry procedure can restore the invariant and then execute a RETURN WITH ERROR statement which returns from the entry procedure thus releas-ing the monitor lock, and then generates the exception. However, neither mechanism is checked at compile time so their misuse is possible. 2.2.4. T h e C a m b r i d g e C A P System 2.2.4.1. C A P Processes a n d Protec ted Procedures CAP supports a hierarchy of processes based on the master coordinator, which is a single protection domain roughly equivalent to the UNIX kernel. Any process can dynamically set up a sub-process by executing the enter-subprocess instruction. A process may call a pro-cedure in another protection domain during execution, if it has a special ENTER capability for it. These procedures, accessible behind ENTER capabilities, are called protected pro-cedures (PPs) and provide both protection and modularity. Each protection domain has its own stack and virtual address space. Changing protection domains is roughly equivalent to a Supervisor call in UNIX, or making an entry to a monitor in PILOT. Protected objects, such as a file directory, are managed by PPs, which may be entered by any process possessing an appropriate capability. Although it is always entered at the same place, a PP usually examines its private data structures left by the the previous call, and makes further decisions on the basis of what it finds. In the CAP operating system, each PP is written as a complete ALGOL68C program, not as a procedure. A special run-time 24 library has been written to allow PPs to have a relationship very similar to a coroutine struc-ture. The domains are fully protected from one another except for capabilities which are explicitly passed as arguments. A CAP process is composed of the various PPs independently compiled, which can then be bound together to form a system image. A certain amount of dynamic linking of new PPs into a process is provided, but creation and deletion of new processes is a non-trivial task (cf. Thoth and PILOT). During execution, the CAP process crosses protection boundaries, behaving somewhat like a UNIX process making calls to privileged system routines and changing its mode from user to kernel. However, the UNIX system is only 2-level, whereas there can be many levels of PP calls in CAP. The CAP operating system PPs are required to provide several kinds of services, the majority being concerned with gatekeeping [Needham 77] (i.e. control of access to a service or resource). One example is in calls to the master coordinator. In the CAP operating system, critical Sections which may not be interrupted are executed only in the master coordinator process. Access to the ENTER-COORDINATOR instruction is pre-checked by a gatekeeper PP called ECPROC, running within the caller's process. Alternatively, some services are provided in dedicated systems processes, rather than as PPs in the user's process - such processes correspond roughly to the classic server process model. Access to the service is controlled by a PP in the user's process. Thirdly, some system-wide services such as message buffers are managed by a PP which has exclusive capabilities for the data structures. Processes wishing to access buffers must request the PP to do so on their behalf. Inter-process communication is through messages held in buffers which are dynamically allocated from a fixed pool. They are accessed through objects called message channels. A PP exists for setting up message channels; it checks software capabilities and establishes a communication path between processes. A successful call to this procedure results in one or more capabilities for more primitive message passing procedures such as SENDMESSAGE, which is one of the entries to ECPROC, with arguments specifying the (software) capabilities for the message channels. ECPROC performs any transfer of data or capabilities which may be required, and then makes a call to the master coordinator if any scheduling action is required. 2.2.4.2. Intra-process Except ions in C A P CAP has a primitive fault-handling mechanism which causes a simulated ENTRY to a particular protected procedure called FAULTPROC in the user's process. FAULTPROC examines the state of the process and takes special action on virtual memory faults and link-age faults, eventually returning to retry the failing instruction. For other faults, it alters the stacked program counter of the faulting PP to a fixed value, and returns to it, having stored useful information about the fault in a fixed place. It also sets a flag to cause a further fault when the original PP itself does a RETURN, similar to the mechanism that is used to pro-pagate asynchronous attentions. The code at the fixed address can examine its environment, including the information stored away by FAULTPROC, and decide what to do. Faults which have not been dealt with by a PP are propagated back to the calling domain, with an appropriate change in semantics on crossing the protection boundary. For example, the fault limit violation incurred by a procedure is distinct from called domain suffered a limit violation and took no corrective action. 26 2.2.5. R I G 2.2.5.1. Except ions in R I G . In RIG, internal exceptions (i.e. within-process exceptions), are handled by a procedure call oriented mechanism, based on two library procedures in the implementation language BCPL [Richards 79a]. Errorset and Error allow an error notification to propagate back up the calls hierarchy to a designated point. The program state at the point where the exception is raised, is lost. Thus RIG does not provide for program resumption. Errorset is a function that accepts a severity level and a procedure as arguments. Error accepts a severity and error code as arguments. If Error is called, the call stack is unwound to the point of the most recent Errorset with a severity level equal to or greater than the severity of the error. The error code is then returned to the caller of Errorset, which can then attempt to recover from the error. Calls to Errorset may be nested. This mechanism is quite powerful, although as Lantz himself remarks that provisions should be made for associating handlers with particular exceptions, similar to Levin's or those of CLU. The RIG inter-process exception mechanism is more unusual, as it employs a new mes-sage type -- an emergency message. An emergency message is delivered with highest priority, ahead of any other messages queued for the receiving process, and will cause a blocked process to be awakened. When an emergency message is received, the emergency handler associated with the process is invoked. RIG adopts a server-type process structure for its operating sys-tem utilities, in contrast to Medusa which uses a shared object concept. In RIG, a process has to explicitly register interest in another process before it will receive emergency messages from it, whereas in Medusa, users of shared objects are automatically notified of exceptions through 27 the backpointers stored with the shared objects. Emergency messages are typically generated by event-handlers, although they may be generated by any process. An event handler is a process that is capable of detecting, or will always be informed about, the occurrence of a particular kind of event. In general, a process, PA, must register with an appropriate event handler that it wishes to be notified when a par-ticular event, EPB, occurs in process PB. When EPB occurs, the event handler notifies PA via an emergency message. One particular event with which all processes are concerned is the death of other processes and machines with which they are communicating. The RIG Process Manager acts as an event handler for suicide, crash or suspension. The Process Manager relies on the kernel to notify it whenever a process dies. The Process Manager then sends the appropriate emer-gency message to all interested parties. The relatively simple approach of RIG to emergency handling has several advantages: it is cheap to implement, there is only one handler per process, so it is easier to read, and there is less non-deterministic program execution than in UNIX. The disadvantages of this approach are that there is only one handler for all emergen-cies, and the internal exception handling mechanisms provide no means for implementing exception handling in a high level language such as ADA [US 80], so a separate mechanism must be used by each language compiler. Further, there is still some non-deterministic execu-tion on receipt of an emergency message, as the process is awakened from either Send or Receive states. 28 2.2.8. M e d u s a 2.2.6.1. Processes in M e d u s a In Medusa, cooperating processes, called activities, are grouped into task forces which form the fundamental unit of control. All the operating system utilities are task forces. Processes in both Medusa and RIG communicate by passing messages; in Medusa, processes on the same task force may also use shared memory. A task force is thus similar to a Thoth team; the difference is that Medusa activities are scheduled to run in parallel, whereas in Thoth, the principle that only one process on a team executes at a time is crucial in the design philosophy. 2.2.6.2. Except ions in M e d u s a In Medusa, whenever an exception is detected, it is sent to a single exception-reporter utility. This central utility acts as a clearing-house for the reporting of exceptions to handlers. Thus although the detection of exceptions can occur at any level, the reporting is encapsulated in a single utility. This provides a uniform reporting mechanism for all excep-tions, both inter-process exceptions and within-process exceptions, regardless of their origin. The predefined internal exceptions are divided into about a dozen reporting classes. One class is floating point overflow, another is execution of unimplemented instruction. There are eight other classes for external (inter-process) exceptions. Each activity may nominate a different handler to deal with each of these predefined reporting classes. The exception reporter utility also provides functions for adding user-defined exceptions to this list. We include a discussion of Medusa's internal exceptions here; external exceptions are described later, in Section 4.2.3. 29 2.2.6.3. Internal Exceptions in M e d u s a Four different types of handlers may be used for internal exceptions. (1) By default, an internal exception is handled by the parent of the activity's task force --the parent has limited access to the state of the activity and can resume it if desired. This method is commonly used as Medusa's recovery mechanism. For example, many small programs will be invoked from the user's command interpreter and will not deal with exceptions at all; the command interpreter will kill the exception-generating task force and will output a message on the user's terminal. (2) The handler may be specified as an out-of-line handler, which is equivalent to the address of an interrupt routine. This type of handler is expected to be used as an entry into the reporting mechanisms of high-level languages, which may then propagate excep-tions through abstractions in programs. (3) An in-line handler is invoked only when the activity checks explicitly for an exception occurrence — no special report of the exception is made. (4) Medusa also provides the ability for an activity to name a buddy activity in the same task force to handle any specified exception class. When the exception occurs, the buddy receives access to all the private information of the exception-generating activity, which is suspended. Thus the buddy can be used to help in interactive debugging, and in many other situations where remote handling is essential for recovery. One handler may also activate another handler if it is unable to deal with an exception (e.g. an out-of-line handler may activate the task force's parent, by defining extra exceptions to the exception reporter utility). 30 The notion of a buddy activity to handle exceptions is a departure from the procedure-oriented mechanisms previously discussed. Its chief advantage is that the handler can reside on a separate processor, thus protecting the handler code from the errant processor. There is also a saving of space occupied by the handler on the errant processor, and it is useful for recovery operations after certain computer limitation exceptions such as stack full. 2.2.7. Remote Procedure C a l l RPC has been proposed as a high level language mechanism for the synchronous transfer of control between programs in disjoint spaces whose primary communication medium is a narrow channel. Nelson details two designs for RPC in his thesis [Nelson 81]. The first assumes full programming language support and involves changes to the language's compiler and binder, while the second involves no language changes, but uses a separate translator — a source-to-source RPC compiler — to implement the same functionality. The RPC paradigm is suitable for tasks with master/slave relationships. However, RPC offers only that subset of general inter-process communication provided by message-based operating systems involving hierarchic control structures; the maintenance of a dialogue between peers having symmetrical control relationships is not easy. This is because in an RPC environment, control is passed from the requesting process when the server process is called, and is returned when the server has completed processing the request. The concept of asynchronously interrupting an executing server process is counter to the RPC paradigm. RPC alone is not sufficient for conveniently programming in a distributed computing environment, because the server cannot make out-of-order replies to clients. (By out-of-order we mean that at any instant a server may have accepted more than one client request and the 31 server makes a reply to the least recent client request first). RPC is the only inter-process communication mechanism in the concurrent high level language ADA [US 80] (to which refer-ences are made throughout this thesis). 2.3. M e c h a n i s m s for Interactive D e b u g g i n g Interactive debugging operations such as step-by-step traces are often cited as a use for inter-process exception mechanisms. However, programmed exception-handling cannot usually be used for interactive debugging, because access to the local variables of a routine by another part of a program or by a another process is usually prohibited by scope rules. Thus special mechanisms must be employed for interactive debugging. Several of these mechanisms are now described. Many commercial time-sharing systems provide a debugger, such as DEC's VAX-11 DEBUG utility [Beander 83], which uses special mechanisms. VAX DEBUG is strictly an object program debugger which has access to the symbol table of the user's compilation units. DEBUG uses VAX/VMS exception handling mechanisms to provide the needed control over user programs such as stopping execution at breakpoints. The main feature of this mechan-ism (described fully by Beander) is that breakpoints are converted to intra-process exceptions which are caught by DEBUG and handled by the debugger instead of being propagated up the calls hierarchy. UNIX's software interrupt mechanism has been extended to provide a powerful but inefficient mechanism whereby a parent process can monitor the progress of one or more child processes. These tracing facilities can be used for interactive debugging and include the abil-ity for the parent to inspect and even modify values within the data area of the child process. The child is traceable only if it gives its permission by explicitly executing a system call. Then, every time the child process encounters a software interrupt the parent process is given the opportunity to intervene. The child may be blocked indefinitely if the parent ignores it. A better scheme would involve a more efficient transfer of control between parent and child, and would also only operate if both parent and child mutually agreed tracing should com-mence. In Verex we have developed a unique process-oriented exception mechanism, which relies on the existence of an exception-server process. Whenever an active process incurs an execu-tion error (i.e. an error detected by the hardware such as illegal instruction, illegal address), it is blocked and a kernel routine is invoked which forwards the message buffer to the exception-server as if it came directly from the offending process. The exception-server can inspect the status of processes that send messages to it, and can detect those that have incurred errors (the operating system sets a flag). The exception-server has the power to con-trol the offending process — either by destroying it, or forwarding it to any other interested parties, such as the debugger. The first handler for all errors is the exception-server; to avoid bottlenecks, the exception-server creates a separate team to handle each exception request. At present, the team created has full power to pry into user and system data areas. The advantage of this approach in a distributed system is that it is often easier to han-dle an exception condition at a point that is logically distinct from the site where the excep-tion was raised. The Verex error handling mechanism could obviously be extended to handle exceptions as well as errors, by programming a RAISE statement to send an appropriate message to the exception-server. At present we use this mechanism only to terminate or debug the erroneous process. 33 In Medusa, the mechanism for debugging and tracing^ called MACE, is separate from the exception reporting mechanisms already described in Section 2.2.6. MACE executes as a sin-gle task force with just one activity, and it is located on a dedicated LSI-11 processor. MACE has its own simple terminal i/o and exception handling facilities, so it need not rely on any of the other utilities; it is therefore almost crash-proof. The other utilities notify MACE of breakpoints using a pipe process; MACE can restart such broken activities when the operator desires. The author implemented an interactive remote debugger [Atkins 83a] for programs writ-ten in BCPL running on machines connected over a Cambridge Ring [Wilkes 80]. Most of these machines have the Ring as their only peripheral, and use it to communicate with termi-nals and discs. Hence a resident debugger is of limited use, as many machine crashes will cause contact with the terminal to be lost, rendering the debugger inaccessible just when it is most needed. However, as a computer's Ring interface allows a remote machine to read and write words of its memory, and halt, start and reset it, it is possible to move the debug pro-gram to another machine. The debugger can inspect and modify data structures directly; simple communication can be achieved by polling fixed memory locations in the target machine. This relies on very little software being working in the target machine — just enough to signal that an abort has happened and issue a message. The remote debugger acts interactively to read and update memory in the remote machine, which may be unaware that it is being examined. It also handles traps and aborts in the remote machine. It is thus a multi-event program, awaiting either a character from the user at the keyboard, or a signal at the remote machine. These events are asynchronous. This debugger is very successful, allow-ing new machines to be installed on the Ring in only a few days. Standard coroutines are used to implement the debugger because of their low set-up cost and low run-time overhaed 34 on coroutine switching. However, special mechanisms are used for signalling traps and break-points. In Mesa, any exceptions which have been propagated up the calls hierarchy to the root procedure, are then passed to the debugger, as described in Section 2.2.3.2. Thus the only operating system which uses the same mechanism for debugging and exceptions is Mesa; and it has been observed that in Mesa, the major reason for raising an exception is to invoke the debugger, and thus the exception handling facilities embedded in the language are secondary to the use of exception detection. Thus in general, the mechanisms for debugging are not suitable for programmed excep-tion handling, and so the thesis is concerned mainly with the general exception mechanisms described in the previous sub-Sections 2.2. 2.4. E x c e p t i o n H a n d l i n g in A c t i o n 2.4.1. Inter-process C o o p e r a t i o n ^ An interesting approach to how servers may notify clients of exceptional events, is described by Reid et al. [Reid 83] for the Mesa file-system. In the Mesa file system, client processes are viewed as cooperative, and to achieve this cooperation they support file-sharing. If one process wishes to use a file in a way that conflicts with the way that a second process is using the file, the process that is using the file may be asked to relinquish it. For example, if a process wants to write a file being read by another process, the process that is reading the file is asked to stop. Also, a process may ask to be notified when a file becomes available for a particular use. However, the processes that share files need neither communicate explicitly, nor know one another's identities. We assume that such notification does not happen fre-35 quently, and hence can be regarded as an exceptional event. Cooperating processes are used in the design of the Xerox Development Environment, an integrated multi-window program-ming environment to support sophisticated tools such as windows that load themselves with the new version of an error log each time it changes. To achieve the notification, clients provide call-back procedures, by passing them as pro-cedure parameters to the file-system monitor (which acts like a server). For example, a client can ask the file system monitor to notify it whenever a file becomes available for some partic-ular access, as the client might want to be awakened when the file is available so it can try again. The procedure AddNotifyProc is called to register such a request with the file sys-tem, and the procedure RemoveNotifyProc is called to remove it. When the file system determines that the conditions have been satisfied,^ it calls the NotifyProc passed in as the parameter. The system has to guarantee that when a client is notified, it can indeed acquire the file for its desired access. There is nothing new in using procedure parameters to provide call-back facilities from subprogram to caller [Black 82]. However, the novel way this has been used in providing inter-process communication for a monitor-based operating system is worthy of further exami-nation. The main use of this mechanism is to allow cooperating clients to execute a PleaseReleaseProc whenever the file-server to which they have registered a NotifyProc needs to use the same file for another process. However, the authors state that there are difficulties with this mechanism. First, because the client must be prepared to have its call-back procedures invoked at any time. This may cause subtle synchronisation issues in the inter-process communication between the client, the file system, and (indirectly) other clients. 36 Next, the difficulties are inherent in writing multi-process programs. As a means of com-munication, the call-back procedures expose these difficulties. Note that clients need not mas-ter the subtleties of call-back procedures to use the file system. They can choose instead not to cooperate in their use of files, using a system-provided PleaseReleaseProc that always returns no. Often, tools are first written with little or no cooperation and they gradually evolve to allow more cooperation. Third, since many clients may be calling it simultaneously, the file system must lock some of its internal data structures while it calls the client-provided PleaseReleaseProc or NotifyProc. Although essential to preserving the consistency of data structures and to pro-vide some guarantees on its behaviour, this means that there are file system operations that cannot be invoked from PleaseReleaseProc or NotifyProc without causing deadlock. Therefore, some of the file system procedures may not be called from within a PleaseReleaseProc; these include Release. If the PleaseReleaseProc calls one of these procedures, the process will deadlock on the file system's monitor for that file. If it must call one of these procedures, it must fork another process to perform the call and must not wait for that process to complete, since the process will not complete until the PleaseReleaseProc returns. The return value later from a PleaseReleaseProc may indi-cate that a process has been forked that will release the file. We consider how to implement such a set of cooperating processes in the Thoth environ-ment. As servers cannot Send to clients which have made Send requests to them (by con-vention for security and deadlock-avoidance [Gentleman 81]), some other means has to be provided for notifying clients of the exceptional event PleaseRelease. 37 The drastic approach would be for each client to nominate another process, the victim, which the server kills when it wishes to notify a PleaseRelease request. A fourth process, the client's vulture, awaits the death of the victim (usually by execution of a Receive Specific) from it, and is notified by the operating system when this occurs. The vulture can then notify the client by a Send request. This arrangement is shown pictorially in Figure 2.1. The disadvantage of this mechanism is that the vulture and victim processes may be required in certain applications, merely to provide exception notification to the client. An advantage of this mechanism is that the server does not block when it kills the vic-tim, so no deadlocks occur during the message exchange. Another advantage is that the client receives the notification (via the vulture) when the client executes a Receive — thus the notification is synchronous with respect to the client's execution. Furthermore, it always works, and its use is now well-understood in writing systems code. If the victim process can 3. vulture inform client 4. client may request more information 6. client informs server about new victim 5. client reinstates new victim [ vulture ] ^ 2. operating system awakens vulture Key > = Send = Op. Sys action 1. server destroys victim Figure 2.1. Thoth Server-client Notification using a Victim Process. 38 , serve another role, such as a timer or worker whose state is not needed if the exception occurs, the overheads are not too great. When the client receives a message from the vulture that the victim has died, the client must recreate the victim (for future notifications), Send to the server to establish the new victim, Send to the server for more details (as the victim's death serves only as a binary sig-nal of the event's occurrence) before finally executing the PleaseReleaseProc code. The Thoth solution described above requires the following process switches. (1) For the server to destroy the victim, the equivalent of 2 process switches (e.g. in the Verex implementation, the server does a Send to the Team-invoker who takes appropri-ate action before making a Reply to the server although in the PORT operating system this is achieved by a kernel call taking the equivalent of <1 process switch). (2) For the vulture to notify the client, 2 more process switches. (3) For the client to recreate the victim (if necessary), the equivalent of 4 process switches (e.g. in the Verex implementation, the client does a Send to the Team-invoker who makes a request of the Team-creator before replying). (4) Then the client has to notify the server of the new victim, and, after receiving the reply, has to request more information from the server about the nature of the exception. This takes 4 more process switches. Therefore this mechanism requires the equivalent of 12 process switches, (including Destroy and Create) which could take 6 msecs3. In the UNIX environment, the client could be notified asynchronously that an exception had occurred, by a UNIX signal from the server. However, there are only 16 different signals, and their use is by convention pre-set to certain exceptions (as explained previously) so their 8 assuming the Send-Receive-Reply sequence takes 1 msec. 39 use in this situation would not be encouraged. An alternative in UNIX 4.2BSD would be to use an asynchronous Send from the server to the client via a datagram socket. This in theory does not block the server, so deadlock should not occur. As for the previous example, the client process would receive the notification synchronously when it executed a read from that socket. The client process could then execute the PleaseReleaseProc code. At first sight the UNIX implementation seems simplest, but there are major drawbacks. Firstly, the asynchronous Send does block when the i/o buffers are full, which can occur when communication is over a slow network, leading to potential deadlock. Secondly, because of the nature of UNIX processes, and the asynchronous nature of the inter-process communication, the time for a Send-Receive-Reply is slow (10 msecs on a VAX 11/750), and the system call to a socket is slow (2.5 msecs). Thus the UNIX 4.2BSD notification (using a socket) takes 12.5 msecs4. An alternative in Thoth, is to keep the client process always listening to the server. The server then replies directly to the client. If the client has to be ready to act as a server itself to higher level clients (often the case in a hierarchy of servers) and accept requests, the client must spawn a worker on the same team to await notification messages from the server. This may mean that the server makes an out-of-order Reply to the worker. This technique is commonly used and reduces the number of process switches compared with the vulture-victim mechanism. The process switching is now simply from the server to the worker, and from the worker to the client. This takes only 1.5 msecs on a SUN workstation, much faster than the UNIX approach. If we provide the worker with the code of PleaseReleaseProc, we can call the worker an exception process and possibly even eliminate the need for the client to be told 4This is measured as the time for a datagram transfer of 32 bytes from server to client if client is ready to receive. 40 of the exception. In this case, the number of process switches would be still further reduced to two, whereby the server switches to the exception process and back again. This is the same as the UNIX situation using an asynchronous S e n d to a datagram socket. The separate exception handler process so described has the advantage that the exception handler code is separate from the normal-case code in the client process. Furthermore, the context of the client code does not have to be saved to execute the exception code, as the client process keeps its separate existence during the exception process's execution. Another apparent advantage is that increased concurrency is possible. This solution is very attractive, but it is fraught with problems in real applications, par-ticularly in the synchronisation of exception process and the client. This is discussed fully in Section 4.3 on Exception h a n d l e r s as processes . In RIG, an emergency message from server to client can provide the notification like a socket in UNIX4.2BSD. However, execution of the client is non-deterministic as receipt of the emergency message causes a change of flow of control if the client is either sending or receiv-ing. If the client is executing a Receive at the time, the emergency message arrives synchro-nously. However, if the client is executing a Send at the time, the emergency message arrives asynchronously. Thus the degree of non-determinism falls between that of the Thoth and UNIX4.2BSD model, and the Mesa implementations. The same difficulties with potential deadlocks and difficulty of synchronisation will occur as with the Mesa and Thoth approaches. 2.4.2. A s y n c h r o n o u s E v e n t s ~ T e r m i n a l Interrupt ; One common problem is in defining what should happen to an executing program when an asynchronous event occurs such as an attention or an i/o interrupt. When a terminal user presses <BRK>, what happens depends on what the program specifies should happen, and 41 on what mechanisms are available. For example, an editor might specify that typing <BRK> will cause an exit to the program which called it only if all the user's files have been saved; otherwise it asks if that is really what is meant. In UNIX one of the software signals (SIGINT) is used to interrupt all the client processes which are using the terminal when the user hits <BRK>. Many client programs do not catch interrupts, and are killed. However, a process can arrange to catch interrupts and take appropriate tidy-up actions before either continuing or not. In the Thoth environment, process destruction is used to handle <BRK> by specifying each team to have an associated terminal server. Each such server provides a facility whereby a client process may nominate a victim process which is to be destroyed on input of <BRK>. The victim is destroyed immediately. Alternatively, the client can specify no vic-tims; can make itself unbreakable, and can field the <BRK> as it wishes. Generally all processes on a team are breakable, so an interrupt on the team's terminal destroys the entire team. The chief advantage of this mechanism is that it is simpler than UNIX-like signals as the user doesn't have to worry about a process's internal state after the event — it doesn't exist! Furthermore, a list of processes blocked on others has to be maintained by the operating sys-tem for implementation of the message-passing primitives, so little extra effort is required to implement this destruction mechanism. However, it is necessary to maintain the integrity of resources which are owned by processes which are destroyed, either by a list in kernel data, or by garbage collection, or by timeouts. The Thoth solution is to provide garbage collection of orphaned resources, or to make some processes (such as servers) unbreakable. Runaway unbreakable processes can 42 always be destroyed by an appropriately authorised Destroy command, like the UNIX KILL command. Unlike UNIX, any other processes blocked on a destroyed process are awakened immediately. Thus the Thoth process destruction provides a general mechanism suitable for communication of asynchronous exceptional events. This mechanism has been used in the Thoth text editor to provide a very simple means for handling both <BRK> and other exceptions [Cheriton 79b]. In this scheme, the editor is split into two processes on the same team. The process Main holds all the data structures such as the text buffer which must survive a <BRK>. The other Editor process implements the functionality of the program, viz. reading and performing user commands. The Main pro-cess is unbreakable while the Editor process is breakable, so only the Editor is destroyed on a <BRK>. Because the Editor process executes the commands, <BRK> causes the execution of the current command line to stop. The text buffer remains consistent because it is main-tained by the Main process. The Main process detects when the Editor process has been des-troyed and creates another Editor process which proceeds to read a command line from the input. Thus, the handling of <BRK> appears to the user as that of the editor immediately returning to command level. This structure is also exploited in handling exceptions. On detecting an exception, the Editor process destroys itself after leaving a pointer to an error message in a global variable indicating the exception. The user is notified of the exception by the next incarnation of the Editor process, which prints the error message. Full details are given in [Cheriton 79b]. In PILOT, communication with peripheral devices is handled by monitors and condition variables much like communication between processes. Each device has an interrupt handler process which is awakened by a signal on a condition variable. The target (user) process 43 receives notification of the event by the interrupt handler via another condition variable. User interrupts from a terminal are treated as exceptions (usually the ABORT exception) which the user can chose to handle or ignore in the usual way. Thus there is no way to stop a runaway process except to reboot the system — in a single-user environment this is an accept-able approach. A similar attitude is taken by the designers of TRIPOS, a single-user operat-ing system designed at Cambridge University [Richards 79b]. In CAP, an asynchronous event such as an attention may be signalled via the master coordinator. The target process is notified of the event by suffering an enforced jump. There is just one attention handler active at any time in a CAP process, which is entered immedi-ately an attention occurs. The operating system allows the user to trap all attentions by specifying an attention handler, which may either perform tidy-up actions on the current domain and then execute a standard RETURN to the calling domain, or may attempt to exe-cute a CLEAR on the attention. The operating system sets a RETURN-TRACE flag so that whenever the currently executing domain executes a standard RETURN, the calling domain is also notified of the attention in exactly the same way. During the execution of the tidy-up operations, other protected procedures can be entered; the CAP operating system notes that these protected procedures are in a special state and disallows recursive calls to the attention routines through multiple attentions. Attentions are of different degrees of severity, and different capabilities are required to clear them. The attention is propagated to the calling protection domain if the attention rou-tine in the current protection domain cannot clear it. The calling protection domain then has a chance to handle the attention. The protection domain STARTOP, the initial protection domain in every user process, has a capability which allows it to clear every attention. By 44 default, an attention will terminate a user's job and return to initial state. T h e advantages of this mechanism are that it allows the user to trap all faults and that it copes with multiple attentions. T h e disadvantages are that it allows non-deterministic exe-cution, and that it cannot be used to stop runaway processes which loop during execution of attention handlers. A n d as the attentions are arranged in a simple hierarchy of levels, the mechanism has proven to be somewhat inflexible for handling different types of asynchronous events encountered in computer communications and in distributed environments. 2.5. S u m m a r y T h e survey of the various operating systems shows the great diversity of exception han-dling mechanisms. Purely procedure-based mechanisms, used extensively before about 1977 have disadvantages when applied to distributed systems because of the assumptions of shared memory. For handling exceptions in distributed operating systems, in R I G a mixture of procedure-based and message-based techniques are used; R P C uses a procedure-based tech-nique, and Medusa uses a process-based technique. T h e examples show that the mechanism for notification of exceptional events (such as the asynchronous event of < B R K > ) is system dependent, and varies according to the process model and the inter-process communication facilities provided by the operating system. F o r notification of < B R K > , U N I X uses its inter-process exception mechanism to provide a software signal which forces a control flow change in the target process. T h u s the < B R K > exception is asynchronous in this environment (where a synchronous exception is defined as one where the process is in a state ready to receive it). T h o t h uses immediate destruction of the process awaiting that event, and other processes receive notification when they attempt to communicate with the destroyed process. T h i s is therefore treated as a synchronous 45 exception. T h e P I L O T handler for each device uses monitors with condition variables to notify user processes of device interrupts, again this is a synchronous exception. In C A P , i /o device interrupts force an asynchronous transfer of control to an attention routine in the currently executing process, similar to a hardware interrupt. O f the mechanisms aforemen-tioned, only T h o t h ' s can be conveniently extended to a distributed environment, as the others rely on shared memory for their execution. In R I G , an emergency message is used to notify the target user process: this message unblocks the target either synchronously or asynchro-nously. In all cases, the default action of the operating system is to destroy the target user process (or processes) to which the terminal was connected. T h u s it cannot be stated in isolation that X is a good technique for exception handling, or that Y is a good technique for exception notification; exception mechanisms must be con-sidered in the context of the process model and the inter-process communication facilities available. CHAPTER 3 Principles for Improving Programs through Exception Mechanisms 3.1. Classification of Except ions The discussions and examples in the previous chapter illustrate several features of excep-tions encountered by event managers, and these features are used to derive an informal classification of exceptions, as shown in Figure 3.1 below. Exceptions have been divided into three main groups, according to where the exception is handled. (1) Server-client exceptions. These need to be communicated from the server to the client (i.e. contrary to the usual inter-process communication from client to server), in order to be handled by the client. The exception must be communicated to the client at the layer Level 1 „ reqm :st exceptions Figure 3.1 Level 1 Classification of Exceptions 46 47 above, possibly after the server has handled it and transformed it. When the client and server form a uni-process system, such exception notification is called call-back notification. When the client exists as a separate process, we call server-client exceptions inter-process exceptions. Peer exceptions. These exceptions are detected and handled at the same level, and are typically represented by the less frequent arm of an if-then-else branch. So a peer excep-tion detected by a server during processing of a normal request event is handled entirely at the server's level and is therefore transparent to the client at the layer above. If the server consists of a single process, peer exceptions may be conveniently caught and han-dled using existing mechanisms for intra-process exceptions. In a multi-process system where the server may use related worker processes at the same level to handle specific events, the server may use a cooperating exception process to handle the exception. An example is described in Section 5.1.5, where the ITI protocol's exception process handles reconnection after a network failure, transparently to its client process at the level above. Still further distribution of peer exception handling can be achieved, by arrang-ing that a peer exception is handled by a process on a different host implementing the same level of abstraction as the server. For example in a single level of a communica-tions protocol a checksum error on data received from host B may be detected by the link level protocol on host A. The link level protocol on host A handles this exception by requesting a retransmission from the host B link level protocol; host A's client at the network level above is unaware of this retransmission. Therefore there are two different types of exceptions; those which occur at a level below code which is to handle it (i.e. server-client, or inter-process exceptions), and those which 48 occur at the same level as the handler code (i.e. between cooperating peers). A n analogy with the structured layered approach to communications software, as displayed by the ISO Refer-ence Model of O p e n Systems Interconnection (OSI) [Zimmerman 80] shows that server-client exceptions are part of the layer interface and that peer exceptions are part of the peer proto-col. T h e communication peer protocols are tightly specified, whereas the interfaces are not, as they are system-dependent. T h u s a starting point for the classification of exceptions has been taken along the dimen-sion of where they are handled -- whether an exception is an inter-process exception from server to client where the exception is specified as part of the interface (the exact nature of which is system dependent), or whether it is from peer to peer, in which case the exception and its handling can be specified as part of the protocol. (3) Request exceptions. F o r completeness one more type of exception is distinguished; that of an unusttal request from a client to a server. T h i s is a request event which occurs with a low frequency, say <10% on average; such events represent an infrequently used operation in an interface. F o r example, the operating system i /o servers (or monitors) are driven by events such as requests for i /o from the clients, and by interrupts or mes-sages from the i / o devices signalling i / o completion (a call-back notification). Some of these events will occur much more frequently than others — e.g. the user request to open a file for i /o will occur only once, whereas subsequent read/write requests to the same i /o server will usually occur many times. It is useful to consider the user's OPEN-FILE request to be an exceptional event from the server's point of view, so that the server can use appropriate code to detect it. Such a rare request is a request exception and the server's reply is communicated synchronously to the client, which may or may not per-49 form further exception handling. 3.1.1. Inter-process Exceptions T h i s thesis concentrates on inter-process exceptions, as the linguistic and operating sys-tem mechanisms for managing inter-process exceptions are very diverse, (as described in Chapter 2). W e now show that inter-process exceptions can be further classified along three dimensions, illustrated in Figure 3.2. Server-client communication is in direct analogy with the intra-process exception mechanisms embedded in high level languages, where a handler may be attached to a pro-cedure call at a higher level in the program unit. A n exception occurring in a procedure is usually propagated up the procedure calls hierarchy until a handler is found for that excep-tion. Similarly, an inter-process exception is detected by the server and transferred up the process call hierarchy (i.e. from a server to its client, or from a remotely called procedure to its caller, or from an inner nested monitor to its enclosing monitor). F o r example, a client may request to open a file of unknown name. T h e file-server or monitor detects the exception FILE-NOT-FOUND while trying to open the file — this is then communicated to the client asynchron ous / synchronous notifica i o n Figure 3.2 Level 2 Classification of Inter-process Exceptions 50 which handles it by taking appropriate action such as prompting its user for another filename. Exploi t ing the analogy with high level languages further, we assume that an inter-process exception mechanism would allow exceptions to be propagated back up the abstrac-tion hierarchy just as the linguistic exception mechanisms allow exceptions to be propagated up the procedure calls hierarchy. Such inter-process call-back mechanisms are available in most operating systems, as described previously in chapter 2. Now in [Liskov 79] the view is held that uncaught exceptions within a process should cause the signaller to terminate. T h i s approach also coincides with that of B r o n [Bron 76] and the A D A designers [US 80] who state that the program unit raising an exception signal must be terminated. Therefore we would expect that a structured inter-process exception mechanism would terminate the process which detected and raised the exception. But the analogy with the intra-process linguistic exception mechanisms breaks down here, because one of the purposes of a server or monitor is to handle multiple concurrent requests, so the server cannot be arbitrarily terminated 1 . Instead, the server process detecting the exception should return to its usual blocked state awaiting another event. Similarly, if a monitor detects an exception, the monitor should ensure its locks are released and the monitor return executed, before the exception is raised at the higher level. T h u s it is noted that the signaller of an uncaught inter-process exception needs to be resumed, not terminated, after raising the signal. These points of view can be merged by stating that for single process systems, the only handlers which resume operations at the point where the exception was detected at the time they were raised (excluding flag-setting handlers), should be invoked by exception signals which would evaporate if the excep-tion signal was not caught. In other words, allowing resumption after an exception signal is 'Unless a separate server instance is used to service each request 51 using it like an ignorable information signal. In contrast, for multi-process systems where the server continues after signalling an inter-process exception, the exception can be either ignored or not. This leads to the definition of three major dimensions on which these inter-process exceptions can be placed, already illustrated in Figure 3.2. (1) One dimension is the degree of necessity of handling; whether the exception is ignorable (such as the PleaseRelease exception), or whether the exception must be handled for correct operation of the system, (such as the <BRK> exception). An ignorable excep-tion is defined in this thesis as one which, if uncaught, would just evaporate2 (i.e. the signaller would resume without delay). Now of the high level language exception mechanisms embedded in the languages ADA, CLU and Mesa, only Mesa allows resump-tion of the signaller, yet in Mesa, uncaught exceptions are transformed into an invoca-tion of the remote debugger. Thus in uni-process systems, ignorable exceptions are not conveniently handled by any existing high level language mechanisms. However, in a multi-process system as noted previously, the signaller needs to be resumed after com-municating an inter-process exception. Thus ignorable exceptions take on a new significance in a multi-process environment, because the desired semantic of the signaller being allowed to resume after signalling an interprocess exception is already supported by systems servers. Furthermore, uncaught inter-process exceptions are already ignored in many systems. For example, in UNIX, a SIGNAL to a non-existent process has no effect on the signaller, and has no effect on the system — it is safely ignored. 2Note that this definition means that errors cannot be classed as ignorable exceptions, unless, dur-ing program development, we temporarily wish to ignore uncaught errors. 52 (2) Another dimension is the number of interested parties to be notified; an exception may-need to be communicated by a server to several clients (such as PleaseRelease), or to one client (such as < B R K > ) . For convenience, we distinguish just two cases; inter-process exceptions which may be broadcast to potentially many clients, or sent to one specific client. In U N I X , a S I G I N T signal can be sent to A L L the processes in a group, or to one specific process. T h e processes wishing to receive a signal join a group in order to receive a broadcast inter-process exception; these processes are cooperating together to perform some task. T h u s exceptions which are broadcast by a server to several clients are intended to develop further cooperation between the clients; as such, the broadcast exceptions may be handled differently from 1:1 inter-process exceptions, adding another dimension to the classification of exceptions. (3) F inal ly we consider the notification of an inter-process exception to a client — if the notification is asynchronous, (i.e. if the client is not in a state to receive the exception by having an outstanding request to the server) then it is handled differently than a syn-chronous inter-process exception (in direct response to a request). T h i s feature is part of the server/client interface and is system-dependent, as asynchronous communication usually requires special communication mechanisms. T h u s another dimension of inter-process exceptions is the method of notification. 3.2. M o d e l for Systems Design W i t h the above classification of exceptions in m i n d , we are now ready to develop a model for systems design which exploits the dichotomy between normal and exceptional events. T h e model is derived from the author's experiences in designing and implementing communications software and software utilities. T h e first observation made by the author is 53 that many systems and communications programs which retain little state between events, are suitable for an event-driven approach, where the programs take the form of an outer loop with an event wait at the top followed by a case analysis. Advantages of this approach for systems software design over the alternative mechanism of encoding the state in the program counter, are given in [Macintosh 84]. T h e Macintosh approach to designing applications software is to avoid the use of modes 3 which are often confusing to the users. T h e program-ming technique proposed to achieve this is an event-driven approach, where the heart of every application program is its main event loop, which repeatedly calls the Macintosh Toolbox Event Manager to get events (such as a press of the mouse button) and then responds to them as appropriate. T h e event-driven approach may be conveniently achieved by arranging that the client maintains as m u c h state as possible so that only the m i n i m u m state is maintained by the event manager between events. T h e state information is then encoded in the request event to the server. T h i s approach is used in the Apol lo D O M A I N system [Leach 83]. For example, a client request to read page N of a file is made idem-potent by specifying the page number, N , in the request event to the file-server. Such a file-server need not then maintain state between events, and can be designed as an event-driven server. T h e client must maintain the state information of the current page number. Such multi-process event-driven systems are conveniently modelled by datagram-based message systems such as T h o t h , R I G and U N I X 4 .2BSD. However, they are not so con-veniently modelled in the monitor-based systems such as P I L O T , where the inter-process com-munication is through a monitor and not direct. T h e dual nature of message-based and aA mode is part of an application that the user has to formally enter and leave, and that restricts the operations that can be performed while it's in effect. 54 monitor-based operating systems has been discussed by Lauer and Needham in [Lauer 79], so that in general an event-oriented approach can be simulated in all systems. The differences between them are illustrated in the examples of Chapter 4. Programs which are not amenable to the event-driven approach include numeric systems such as the evaluation of a function in a deterministic purely functional programming language such as pure LISP; these systems are outside the scope of this thesis. Programs particularly suitable for the event-driven approach include simulation programs, operating system software and non-deterministic programs. Given a system which is tractable to the event-oriented approach, the author considered what top-down design principles are to be employed. Clearly efficiency is an important cri-terion, as inefficient programs are very susceptible to subsequent kludges to improve perfor-mance. The author was also concerned with mechanisms to achieve modular, incremental changes to programs, based on experiences with dynamic program specifications to add bells-and-whistles features — difficult to achieve without adding uneccessary complexity to the sys-tem. From a bottom-up approach, there are many possible configurations of processes and inter-process communication paths in multi-process systems. Different choices may be made according to the major objectives of the system. Merging the top-down and bottom-up design ideas result in the model outlined below in Figure 3.3. The model treats program systems as event driven, and proposes the isolation of the events of the system which may be handled with equal cost. These equal cost events are then identified as normal or exceptional, according to their probability of occurrence and the nature of the event. The program system is then designed according to an appropriate objec-tive, chosen from three common cases -- minimising the average run-time, minimising the 55 exception processing time (useful for real-time processing), or increasing program functional-ity. Other objectives such as maximising throughput or maximising resource utilisation may be modelled similarly; for conciseness we limit our discussion to these three objectives. In the rest of this Chapter, the features of the model are elaborated in detail, with many examples to illustrate their use. (1) Treat the program system as event driven. Most systems programs respond to events; even some data-driven utility programs may be considered as event driven, where the value of the data represents the event to which the program responds. An example of such a program is a spell utility program which compares words in a source text file with those in a standard dictionary. Such a program reads characters one by one from the input file till it finds a word delimiter. A word delimiter may be any special symbol such as , ( ; ! . : etc. The completed word is then compared with the dictionary. An analysis of the spell utility shows that it may be considered as event-driven, where the Model for Systems Design (1) Treat the program system as event driven. (2) Isolate the events of the system which may be handled with equal cost, and group to-gether any such events. (3) Identify events as Normal or Exceptional (based on their probability of occurrence and the nature of the event). (4) Design the program according to the objective: Level 1 Minimise average run-time Minimise exception processing time Increase the program's functionality Figure 3.3 Level 1 Model for Systems Design 56 value of the next character from the input file corresponds to a particular event. T h u s for 8-bit characters input, there are 2 8 =256 events. T h e program has just one state, simply RUNNING. L ike all event-driven programs, s p e l l executes a perpetual loop, get-t ing the next event (character from text file) and branching on that state-event pair to an appropriate event handler. If two or more events are detected and handled in the same way, they can be grouped together to form a constant-cost state-event group. T h e probability of this new compo-site event occurring is the sum of the probabilities of the individual events occurring. F o r example, in s p e l l , each alphabetic character can be treated in the same way; there-fore events corresponding to alphabetic characters can be grouped together and con-sidered normal. T h e probability of a normal event occurring is the sum of the probabili-ties of the individual alphabetic characters occurring. In general, the division of events into N o r m a l or Exceptional should be made such that all normal events occur with a probability greater than all exceptional events. However, in some systems it may not be easy or obvious how to separate events in this way. A guideline for whether to describe an event as normal or exceptional is to decide whether an exception-handling mechanism, requiring a different programming style and incurring different overheads (usually costly) at compile and at run-time, is appropriate for the event. Recall that an exception mechanism should be used when a programmer wishes to improve clarity of the normal-case computation, provided it does not incur unreason-able costs or impair reliability. In particular, errors may always be treated as exception events, even if an error event occurs with a probability greater than some other so-called normal event. However, for some systems, the probability of an event varies widely 57 from time to time, and the occurrence of events may be highly correlated. F o r example, in a database transaction system, some queries may be rare on average but when one such query is made, other related rare queries may occur soon afterwards. In such situa-tions, the best rule is to take the approach that all events should be treated in the same way. ( A n example of this is the X.25 protocol implementation described in Chapter 5). F o r yet other systems, the probabilities of the events are not known a priori. A g a i n , it is best to treat all such events in the same way, and as stated in the introduction, the model cannot be used in such situations. T h u s we restrict our discussion to event driven systems where the probabilities of the events are known a priori. W e have chosen to model systems meeting one of three common objectives. A . Minimise the average run-time. A model for this important objective is described in the Section following, and many examples are described in detail in Section 3.3. B . T h i s objective is considered, because many real-time process control systems require prompt, urgent action in response to an exception event such as the temperature exceed-ing a permitted threshold. Speedy exception processing can usually only be achieved by adding overhead of data-collection to the normal events, so this objective conflicts with objective A above. C . A new approach to program design is to achieve incremental increase of the func-tionality of a server/client system by first designing a minimal system, and then provid-ing for incremental addition of new features. T h e model for this objective, described in Section 3.2.3 below, proposes exploiting inter-process exceptions from a server to its clients to achieve the incremental functionality. 58 Another objective, that of ensuring that the exception handling code is correct could be considered separately. This objective may be appropriate in systems where the efficiency of the mechanisms are not important, compared with the necessity of correct exception handling, and where implementation of correct code may be to the detriment of performance. This situation may occur in a real-time process control system where failures are infrequent and may never be fully testable. Our model does not address this important issue in detail, because proofs of program correctness are outside the scope of this thesis. The model is now extended to cover these different objectives. 3.2.1. Objec t ive A : M i n i m i s e the A v e r a g e R u n - t i m e The model is shown in Figure 3.4. (1) At the first level, the major components of executing the normal events are determined, so the program design can be concentrated upon reducing the greatest cost component from the following4: a) cost of detecting that the event has occurred. b) approximate cost of handling the normal event c) cost of maintaining minimum information needed just for correct handling of all exception events -- called the context-saved cost. (2) To structure the program, the most significant cost component is considered from the above costs. a) If there are many events, and/or the handling costs are small, the event detection costs may be significant. The model proposes mapping one exception onto another to 4The case where the exception handling time has a significant impact on the average run-time, such as when a very expensive inter-process exception notification mechanism is used, is considered under the next objective: that of reducing the exception-handling time. 59 Level 2 a. detection costs significant map one exception onto another so N O R M A L cases can be detected in just 1 test e.g. spell, putbyte Level 3 Objective A : Minimise the Average Run-t ime l . a d d request exceptions to a server, and ensure all events are seen first by server for handling normal events e.g. Verex name-server, Newcastle Connection, T B L / T R O F F / E Q N . b. normal-case handling costs significant structure to reflect logical event flow, N O T the logical structure, e.g. os utility. For multi-process systems 2.try/alternative process configurations to reduce ipcr and context-switching fpr handling normal events e.g. restructure ITI as a reader and writer filter + exception process c. context-saving costs significant push more code into exception handling instead of in normal event handling e.g. readback, nested atomic actions. 3.use broadcast exceptions to reduce network traffic over a L A N e.g. V-kernel atomic actions 4. design problem-oriented protocols even at the expense of generality e.g. L N T P protocol 5. add separate exception processes on separate cpu's for concurrency e.g. point-to-point protocol Figure 3.4 Level 2 and 3 Model to Minimise the Average Run-time enable normal events to be detected in just one test. F o r example, the spell and put-byte utilities described in Sections 3.3.1.1 and 3.3.1.2 respectively. b) If handling costs are significant, the program system should be structured according to logical event flow, not the logical structure, e.g. the uni-process os utility described in 60 Section 3.3.2.1. In multi-process systems there are several ways to achieve this. T h e y are elaborated in the third level below. c) If information is saved during normal case processing purely for correct handling of exceptional events, then an attempt must be made to minimise this context-saving code. T h e model proposes pushing as much code as possible into infrequently executed excep-tion handling code 5 . For example, the provision of a read-back facility described in Sec-tion 3.3.3.1, and the design of the protocol L N T P for local area networks, described in Section 3.3.3.2. T h e author's implementation of nested atomic transactions, described in Chapter 5 also reflects this design principle of minimising context-saving code. (3) In this level, various techniques are proposed for structuring multi-process programs to minimise the average run-time. 1. A d d request exceptions to servers, and allow all events to be seen first by the server which handles normal events. F o r example, the Verex name-server described in Section 3.3.2.2, and the Newcastle Connect ion system, discussed in Section 3.3.2.3. T h e T B L / T R O F F / E Q N system also can be redesigned to exploit this technique, and an extended example to achieve this is described in Section 3.3.2.4. 2. T r y alternative process configurations to reduce the inter-process communication (and therefore context switches) in handling normal events. A n example which follows this design principle is the author's implementation of the ITI server described in Chapter 5. It was redesigned from a server process with a reader and a writer worker, to a system consisting of a reader filter, a writer filter, and an exception handler process. sbearing in mind that it might be worthwhile to increase the expected use of the exception han-dling code to ensure it works correctly by mapping one exception onto another. 61 3. Use broadcast exception messages to reduce message traffic over a local area network (compared with many 1:1 messages), e.g. the author's implementation of atomic actions in the V-kernel , described in Chapter 5. 4. Design problem oriented protocols to handle normal events most efficiently, at the possible cost of reducing the protocol's generality and functionality. T h i s approach has been successfully used in the A p o l l o D O M A I N system [Leach 83]. Another example is the L N T P protocol, described in Section 3.3.3.2, in which a tradeoff has been made between functionality and performance in that the protocol's flow control has been tuned at the transport layer to the specific environment of a Local A r e a Network. 5. Distribute exception handling code so it is situated on physically distinct processors thus increasing the concurrency so normal case processing can continue without interrup-tion. A n example of this is a point-to-point network protocol described in Section 3.3.2.5. 3.2.2. O b j e c t i v e B : M i n i m i s e the E x c e p t i o n Processing T i m e T h e model is shown in Figure 3.5. A s for Objective A , the major components of responding to an exception are determined, so the program design can be concentrated upon reducing the greatest cost component from the following: a) cost of detecting that the event has occurred. b) cost of the notification of the exception to interested parties. c) approximate cost of handling the exception. 62 Level 2 Objective B : Minimise the Exception Processing T i m e a. detection costs b. exception notification c. exception handling costs significant costs significant significant map all other cases onto one so These costs may be significant in multi-process systems. push more code into normal-case handling to reduce exception handling. E X C E P T I O N cases can be detected in just one test T h i s system dependent feature is modelled by the K E Y B O A R D - M O U S E example. Figure 3.5 Model to Minimise the Exception Processing Time T o structure the program, the most significant cost component is considered from the above costs. a) Techniques for reducing the detection costs for exception events are similar to those for reducing the detection costs for normal events, considered in Section 3.2.1. b) In some situations, the inter-process exception notification costs are significant. T h i s is part of the system dependent interface layer between client and server. Therefore, instead of a general model for minimising this cost, program paradigms for different systems and languages are given. A n example problem, that of multiplexing input from a keyboard and a mouse, is described in Section 4.1, which illustrates various methodologies for reducing the 1:1 inter-process exception costs, and an example for using l : m a n y inter-process exceptions is the storage allocator described in Section 4.2. Further , if even the small extra cost of process switching is intolerable, the time for server-client exception notification may be minimised by using a uni-process client/server system with asynchronous exception notification from the server to an in-line procedure in the client via hardware or software interrupt signals. 63 c) F o r many systems, the exception handling time can be reduced only by increasing the con-textual information saved while processing normal events. Techniques for reducing the excep-tion handling time are the same as those used for reducing the normal case handling costs considered in Section 3.2.1. 3.2.3. Object ive C : Increase the F u n c t i o n a l i t y of the System. T h e model is shown in Figure 3.6 below. T h e minimal system for handling normal events is first implemented, bearing in m i n d that extra features, which may be termed exceptional, may be added later. T h e idea is to allow modular increase to the program's (or system's) functionality by treating new features as inter-process exceptions from a server to its client or clients. These extra features must be largely independent of the rest of the system. One convenient way to achieve this incremental increase of functionality is by adding a separate exception process at the client level to handle each new feature provided by the server. T h i s can be achieved if there is only weak cohesion between the processes, and has the desirable effect of increasing the concurrency of the system. Examples are the optional reconnection of a broken network connection in the author's ITI protocol implementation, described in Section 3.4.2.1, the ITI < B R K > exception handler, and the storage allocator in Mesa, described in Section 4.2.3. B y using ignorable inter-process exceptions, extra features provided by the server can be ignored by a client until a handler is provided. T h e feature is implemented when the client provides a handler for the inter-process exception. Ideally, the handler should be almost independent of the client code, so that the client and handler are only loosely-cohesive modules; this is good software engineering practice and it facilitates proofs of program 64 Level 2 Objective C: Achieve Incremental Increase of the Functionality of the System Implement a basic system with minimal functionality and treat new features as inter-process exceptions detected by the server for its client (or clients) to handle. Exploit low cohesion between client and handler by making the exception handler a separate process, e.g. ITI reconnection after network failure, ITI <BRK>, storage allocator in Mesa and in V-kernel. Add a handler to the client whenever convenient e.g. window-size-change, hash table, slow graph plotter ignorable interprocess mHst-be-handled inter-process exceptions Implement the feature by adding a handler to the client e.g. ITI <BRK> Exploit cheap unreliable message delivery in a distributed system (such as datagrams) e.g. V-kernel atomic actions implementation Exploit broadcast inter-process exceptions to increase the cooperation between clients of a server e.g. the Mesa file server, the storage allocator. Figure 3.8 Model to Increase the Program's Functionality correctness. For example, the author implemented and tested the ITI protocol without a handler for the <BRK> exception -- this was successfully added later, as described in Sec-65 tion 3.4.1.1 below. Other examples of the use of ignorable inter-process exceptions are given in Section 3.4.2.2-Section 3.4.2.4. In distributed systems where the cost of reliable message delivery is high, it can be advantageous to use a cheaper, but unreliable, message delivery system. A n example is the author's implementation of atomic actions, described in Section 5.2. One aim may be to provide increased cooperation between the clients of a server, for which broadcast inter-process exceptions are particularly useful. A n example which achieves this objective through the techniques described is the Mesa file-server, described in Section 3.4.2.5. A s the inter-process exception mechanism is system dependent, program paradigms are given in Section 4.2 for a storage allocator problem which uses broadcast inter-process exceptions to increase cooperation between clients, in different languages and operating sys-tems. 3.3. E x a m p l e s I l lustrating M i n i m i s i n g the A v e r a g e R u n - t i m e First the costs of the program events must be analysed to determine which component costs are the most significant, so the appropriate branch of the model can be followed. A sim-ple technique for uni-process programs has been developed by the author, and a typical analysis of a general program is detailed. T h i s technique has proved very helpful in analysing uni-process programs and systems, and extensions of the techniques to multi-process systems have also been made by the author. In a general event-oriented program, there is a set U of possible events that can occur during program execution, and a probability function p such that p(u) indicates the probabil-ity of event u, where u 6 U. Efficient performance is important in software programs, and so a cost function, C , is also defined, such that the cost C(u) is the cost of executing the 66 program on receipt of event u. Then the expected cost of executing the program, CT is given by: Cr = Z C(u)*p{u) u 6 U It is assumed that the total cost of executing an event, C(u), is composed of several parts, including the cost of detecting that the event has occurred. By restructuring the pro-gram to reduce a component of the execution cost, such as the detection cost, the expected run-time may be reduced. 3.3.1. P a r t i t i o n the E v e n t Set into 2 G r o u p s to R e d u c e Detect ion Costs 3.3.1.1. Spell E x a m p l e Suppose there is an event-oriented program with events u1,u2 • • « ( n - i ) "n • There is a simple test to detect the occurrence of each of the first (n-1) events, and the n th event is assumed to be none of the others. Further, the tests for events 1 through (n-1) must be made, in turn, before the n lh event can be detected, and the n th event is the most probable. An obvious algorithm to execute this is given by: l o o p g e t e v e n t ( e v e n t ); i f e v e n t = «j t h e n h a n d l e u,; else i f e v e n t = u2 t h e n h a n d l e u2; else i f e v e n t = «(„-i) t h e n h a n d l e «(n-i)> else h a n d l e u„ ; e n d l o o p ; We assume that this cost of executing an event is independent of program state. If this is not the case, then the event set may be increased to include compositions of program state with event, so that the constant cost assumption holds, as shown in the p u t b y t e example in the next sub-Section. 67 A n example of such a program is a s p e l l utility program mentioned previously in Section 3.2, where each event is the next character from a text file, and all alphabetic characters are con-sidered to be normal events 7 . Now if there are (n-1) word delimiters, events ul,u2> • • - « ( B - i ) correspond to the character event being one of the word delimiters and event « n corresponds to the character being an ordinary alphabetic 8 . Note that the word delimiters are peer excep-tions in that they are transparent to the caller of s p e l l , and they are handled within the util -ity program. Assume the probabilities of the exceptional events are all equal 9 to p, as shown in T a b l e 3.1, and the handling cost H is the same for each event. T h e cost of detection of the events is given in the second column of T a b l e 3.1, assuming that each test takes 1 instruction. T h e total execution cost, (excluding the constant extra cost of handling each event) is shown as Table 3.1 Execution Costs for a General Event Manager event probability of event cost of detection expected execution cost new detection cost new expected execution cost P 1 lp 2 2p "2 P 2 2p 3 3p «n-2 P n-2 (n-2)p n-1 (n-l)p "n-1 P n-1 (n-l)p n-1 (n-l)p «n l-(n-l)p n-1 (n-l)(l-(n-l)P) 1 l-(n-l)p Total expected costs/event (excluding handling costs) (n-l)-p(n-l)(n-2)/2 l-p+np(n-l)/2 7We assume that a simple test to see if a character is alphabetic cannot be made 8In our analyses we assume events occur independently of each other, so there is a constant proba-bility p(u) of an event occurring. Violation of this assumption merely reduces the strength of the per-formance gains to be made from following our model, and does not affect the nature of the gain. 9 In the spell utility, the character BLANK will appear as a word delimiter much more frequently than other special characters. For simplification in the general example, we assume that all exceptional events are equally probable. 68 column 3 of the Table. To improve this program, the events are divided into 2 sets — normal {«„ } and excep-tional {«!, «2,-"> u(n-i)}- Then a dummy variable is introduced so that all the exceptional events can be mapped onto the one exception, to reduce the detection costs of the normal event10. Then the program could be written as follows: The cost of detection of the events is now given as column 4 of Table 3.1. If we assume as before, that the handling cost is H for each event11 and the probabilities of the exceptional events are all equal to p then new costs are shown in the last column of the Table. loop getevent ( event ); if f(event) then HandleExceptions(events); else handle un ; end loop; HandleExceptions(events) begin if event = u; then handle «j; else if event = u2 then handle u2; else if event = « ( n - 2 ) then handle U ( n _ 2 ) ; else handle «(„_!); end; 10The choice of a dummy variable depends on the application. Suppose that all exceptional events can be mapped to return a value TRUE from some function, / (event), and all normal events to re-turn FALSE. In particular, for the spell program, such a function could be a simple look-up table, ODDCHAR, with 256 entries, one for each possible 8-bit character, indexed by the character's bit-pattern. The entry in ODDCHAR for each delimiter is the value TRUE, and for each alphabetic char-acter, the value is FALSE. Thus ODDCHAR[event] corresponds to / (event). uwithout making further aggregations of constant-cost events 69 The last row of the table shows C j = total expected cost per event in the initial pro-gram and C2 = total expected cost per event in the new program. Then C+{n-l)-p {n-l)(n-2)/2 and C2=// )+np (n -l)/2, from columns 3 and 5 of the Table. The expected change in run-time, Cl/C2 can be calculated for different values of the combined probability, np , of the exception events. Let C1/C2 = a, the change in run-time. Then there is a reduction in run-time when a > l and when C\jC' 2 > oc. Now CxjC2 > a when H+{n-l)-p{n-l){n-2)/2 > a{H+{l-p)+np{n-l)/2). Let np =y . Then C xf C2 > a when H+{n-l)-y(n-l)(n-2)/2n > a(H +(l-y jn )+y (n -l)/2) y < 2n{n-l-a+H-Ha)/{{n-l)(n a+n-2)-2a) For the best possible speedup, when H=0, y < 2n{n-l-a)/({n-l)(na+n-2)-2a) As n —»-oo,np —»-2/(a+l). Values of y—np against n are plotted in Figure 3.7 for a = 1.25, 1.5 and 2. From Figure 3.7, it is seen there is a reduction in run time for all cases where n >2. Significant reductions of more than half, when a=2 (and H=0), occur when the combined probability np of the exception events is less than 0.5. To achieve a speedup of two times when II>0, the numerator in the above expression must be positive i.e. n-3-H >0, so H <n-3. Thus the partition of the event-set U into two disjoint sets N and E, and the mapping of several exception events onto one to reduce detection costs in N roughly halves the run time 70 t-o <*= 1-2.5 ( C = I - 2 5 C 0 O 5 10 /S 20 as- 30 n—? Figure 3.7 Expected Execution Costs for Different Values of np and n when the handling cost for each event is small relative to the number of events12. 3.3.1.2. P u t b y t e E x a m p l e The author has analysed a more complex example illustrating bow exception detection costs have been reduced in a real program, putbyte. In the Verex operating system [Cheri-ton 80] the routine putbyte outputs a character to a selected output device. Cheriton reduced the average run-time of this routine by mapping several exceptions onto one using a 12Performance gains in testing for the common cases first have long been recognised, as reflected in the FREQUENCY statement in FORTRAN 1 and 11, which enabled the programmer to inform the compiler of the probability of various cases by facilitating flow analysis. However it was dropped from subsequent versions of FORTRAN as the statement required a tremendous amount of analysis at com-pile time and did not yield enough benefits at run-time. But as Sammet remarks in [Sammet 69], the fact that the statement became unimportant is not a reflection on the value of the initial technological contribution — that of giving considerable thought to the possibilities of compiler optimisation of object code. 71 fix-up function, and hence reducing the detection costs. The author analysed the programs in detail, which are described in Appendix A. The program was redesigned by dividing the events (each byte to be output) into 2 sets -- normal { putbyte called to a valid output device with the in-core buffer not full} and exceptional { putbyte called to an invalid output device, putbyte called to a valid output device with the in-core buffer full, and putbyte called with the END-OF-FILE character}. In the modified program the normal event is detected in just one test, instead of in two tests as in the original version. Because the handling costs are very small, a reduction in expected run-time from 5.75 to 4.75 instructions/event is achieved, which represents a significant sav-ing for such a commonly used utility. The putbyte function illustrates the principle of exploiting an existing exception-detection mechanism which catches peer exceptions — namely the check made within putbyte for more space in the output buffer — to catch a server-client exception occurring at a higher level, namely, there being no selected output. This saves run-time overhead on exception detection in the normal case — alternative implementations incur the overhead of checking for selected output on every call of putbyte. Furthermore, it saves code. The author has observed that this practice of exception detection at low levels is used extensively in high-level languages. For example, consider the DIVIDE operator. In most implementations the user does not bother checking if the dividend is zero before trying to divide - but the underlying hardware does trap the error, and, ideally, some mechanism is provided in the high-level language for the user to take suitable recovery action. This raises the issue of what level is most appropriate to catch and handle an exception; this example shows that it is sometimes practical to catch and handle it at the same level, whereas in later 72 examples we show it is sometimes appropriate to catch it at a lower level and handle it at the level above (as in the Newcastle Connection example of Section 3.3.2.3). 3.3.2. Reduce H a n d l i n g Costs by R e s t r u c t u r i n g P r o g r a m s T h e division of the events into two groups for increase in efficiency of detection can be exploited further to make programs more efficient. Suppose the total cost in running a program is given by: CT = £ C{e)*P{e) + £ C(n)*P(n) e € E n £ N where C (e) = the total cost of executing an exception event, and C (n) = cost of executing a normal event. T h i s must be minimised to reduce the average run-time cost. Now as many event-driven programs are driven by a small number of events, we assume that p(n) > > p(e). T h u s reducing the cost of handling a normal event, C(n) should reduce the total cost, even if the cost of handling another, exceptional, event is correspondingly increased. 3.3.2.1. O s E x a m p l e A n example of restructuring for the statistically dominant case to reduce average run-time (due to Cheriton), is a utility program os (for over-strike). O s converts a formatted text file containing backspace characters for underlining and boldfacing, to a file achieving the same printed effect using overprinting of whole lines, and containing no backspaces. Like the s p e l l program described earlier, this data-driven utility can be treated as an event-driven pro-gram, where the input value of each character read represents an event. In its original struc-ture, the program used multiple line buffers, one for each level of overprinting on the line. E a c h character except backspace and line terminators was placed in the highest level line 73 buffer that was currently unoccupied in the current character position. T h u s , its structure reflected the logical operation it was performing, translating a character stream into over-printing line buffers. However, its structure also meant that the character it processed with the least overhead was the backspace character. Given that backspaces constitute about 0.5% of the characters in most text files, this program was structured inefficiently. Details of the program, and the author's analysis are given in Appendix B . F r o m the analysis of the event costs, the expected cost of execution of the program is R +1.25 VK + 13.71 instructions/event, where R = average number of instructions needed to read a byte and W = average number of instructions to write a byte. T h i s program does N O T reflect any structuring for the statistically dominant case — indeed it processes one of the least likely bytes (backspace) most efficiently. Cheri ton under-took to rewrite the program. First , statistically speaking, the program is simply a null filter copying its input to its output unchanged. Starting with this view, the exceptions to recog-nize are END-OF-FILE, a server-client exception, and a BACKSPACE, a peer exception tran-sparent to the user. T h i s leads to a different program structure, also detailed in A p p e n d i x B , with an expected cost of execution of ( R + 1.24 W + 3.8) instructions/event. If W = R = 4 , a reasonable assumption for U N I X , this new program shows an analytical speed-up of 22.7/12.8, comparable to that of two times observed. In the analysis, we find that the processing of normal-case characters changes from 9.85 to W + 3 instructions/event in the new program, and as it is assumeed that W = 4, this pro-vides an improvement. Therefore, some of the speed-up is achieved through the more efficient processing of normal characters. T h e rest is achieved by the much-reduced cost of processing the NEWLINE character - from (125W + 382) to ( W + 3) instruction/event. Here analysis 74 shows that not only is the normal case processing in the restructured program more efficient, but also one of the so-called exception cases, NEWLINE, has been made much more efficient13. This example shows that by restructuring the program to reflect the statistically dom-inant case rather than the logical flow, it is possible to make major savings in run-time costs14. This particular example corresponds to one of Bentley's rules in his book Writing Efficient Code, [Bentley 82], called L a z y E v a l u a t i o n : the strategy of never evaluating an item until it is needed, to avoid evaluation of unnecessary items, where, in the new version of the os utility, the multiple line buffers for overprinting are not established until they are needed. A similar technique could be used for a high level language parser. In most high level languages, the most likely statement after an assignment statement is another assignment statement. Therefore the parser should hand code to the assignment statement handler first once an assignment statement has been found; if the parse then fails, the program should be able to back-up and recover. 3.3.2.2. V e r e x Name-server E x a m p l e To improve run-time efficiency of event-driven systems involving multiple processes, the system should be structured to minimise the inter-process communication or message-flow for the statistically dominant case, not necessarily for the logical or functional flow. By doing so, 1 3 but at the increased expense of handling the B A C K S P A C E exception 1 4 but note that if W is greater than 6, it would actually take longer to process normal case charac-ters. Therefore the cost of the system-provided i/o instructions must always be carefully checked to see whether structural changes would be in fact beneficial. 75 the handling cost of the normal events is reduced. As an example, consider a system design for a name-server. Any system resource accessed by name is checked by the name-server and passed to the controlling process. Ideally, the name-server would handle access requests to disc files by handing them to the file-server, input-output requests to the appropriate i/o device handler, and so on. However, this is too inefficient for file access (which is the most common resource request by name), so many systems have adopted the solution of two types of naming. An alternative approach, implemented (by Demco) in the Verex operating system is to treat named requests for files as normal events and named requests for other devices as excep-tional events. It is illustrated in Figure 3.8 below. All requests for named services made by a Original Version New Version \ N / N A server / ' ex \ Key = Send = Forward — > = Reply Figure 3.8 Restructuring the Verex Name-server to Reduce Ipc in the Normal Case 76 client go first to the file-server which handles file-name requests as before. T h e file-server detects exceptions NOT-FILENAME and handles them by executing a kernel request F o r -w a r d , which forwards the message buffer to the name-server as if it came directly from the client. T h e name-server treats the request as before. In this way only 2 ipc messages are needed to access each file, involving 2 process switches. If 90% of named requests are for files, for every 10 requests there are 22 process switches — 9 requests are handled by the file server in 2 process switches each, and one request is forwarded from the file server to the name server and then to the required i /o server, which makes the R e p l y to the client. W i t h the original implementation 30 process switches would be used -- each request passes from the name server to the required i /o server which makes a R e p l y to the client. T h e client is unaware of this implementation; hence client requests for named resources other than file names are treated as request exceptions by the file server (which previously only received file names from the name server). T h u s restructuring has been achieved by adding request exceptions to the file server and removing the common file-name requests from the name server. 3.3.2.3. Newcastle C o n n e c t i o n E x a m p l e T h e Newcastle Connection Distributed U N I X Operating System [Brownbridge 82] is another situation where the system could be restructured to improve its efficiency by reflecting the statistically dominant case rather than the logical flow. In this system, before each system call that uses a file is performed, a check is made, to see whether the file is local or remote. T h i s check consists of a system call, s ta t , which returns l o c a l or r e m o t e . Local calls are placed unaltered to the underlying kernel for service; remote calls are packaged with some extra information, such as the current user-identifier, and passed to a remote machine 77 for service. Assuming at least, say, 90% of requests are for local files, this implementation uses the server-client exception REMOTE-FILE detected by the stat server (in U N I X the sys-tem calls stat and open act like servers) to control the client's subsequent actions. It intro-duces the overhead of a stat call whenever a file is accessed by name; this system call takes 1.6 msecs on a V A X 11/750. A n alternative approach, designed by the author, is illustrated in Figure 3.9 below. It eliminates the overhead in the normal case where local file access is required, by assuming all file accesses are local, and by making the client try a local file access first. In this case, the exception event REMOTE-FILE is represented by an additional request exception to open, the file-access server. T h e file-access server open detects when the request is of the excep-tional form NOT-LOCAL (i.e. possibly remote, or erroneous) and handles it by making a Original Version New Version K e y ^ = Send > = Reply Figure 3.9 Restructuring Newcastle Connection to Reduce Ipc in the Normal Case 78 failure return from the client call. T h e client checks the reason for failure, and then re-tries as a remote file access. T h i s saves the overhead of a system call on every local access at the cost of extra overhead for every remote access. T h i s example also illustrates restructuring by adding a request exception, REMOTE-FILE to a server o p e n which already has to check for the exception of the file not being found locally, and by removing all the common requests for local file detection from the s t a t server. Both these examples illustrate how to reduce the context switching for the normal case by adding a request exception for exceptional cases to the server which is used for handling normal cases, and then directing all requests first to that server. It is a particularly good technique in these examples because the normal-case server already had to check for the exceptional cases. 3.3.2.4. T h e T R O F F / T B L / E Q N E x a m p l e T h i s technique can also be used where a program suite consists of a system of multiple processes, where output from one process is used as input to another via a pipe, such as the U N I X word-processing system which consists of a suite of programs, T R O F F , T B L and E Q N , for various specialised tasks. T R O F F is a phototypesetting language which formats text documents for a photo-typesetter. It accepts lines of text interspersed with lines of format control information, and formats the text into a printable, paginated document having a user-designed style. T B L is a document formatting preprocessor for T R O F F . T B L turns a description of a table into a T R O F F program that prints the table. It isolates a portion of a job that it can handle (viz. a table) and leaves the remainder for other programs, by isolating a table with delimiter pairs such as T S and T E . 79 E Q N is a system for typesetting mathematics, and it interfaces directly with T R O F F , so mathematical equations can be embedded in the text of a manuscript. T h i s is done by sur-rounding an equation with delimiters such as E Q and E N . If a text document contains tables and equations, some of which are embedded in the tables, it is processed in U N I X thus: T B L < d o c u m e n t > | E Q N | T R O F F where | is the symbol for a U N I X p i p e , meaning that the output from the program on the left of the pipe, is sent as input to the program on the right of the pipe. A pipe is imple-mented as a large buffer in kernel memory, which saves the overhead of creating a temporary file for the program's output. T h i s provides a structured, modular system which reflects its logical action; the process T B L is a module that passes the next input character to the process E Q N , which in turn passes the next input character to the process T R O F F . T h i s way of structuring modules has been found to be successful in many applications. For example, a compiler may produce unoptimised code quickly for test purposes, or the out-put from the compiler may be passed as input to an expensive optimiser before final code gen-eration for production programs. T h a t is, the test compiler output may be passed as input to a code generator (unoptimised), or the test compiler output may be passed as input to a final code generator (optimised). However, in many documents, most characters require no processing by T B L or E Q N , so there is a performance flaw with this design. W e now consider how such systems can be made efficient as well as nicely-structured. T h e model proposes that the system can be restructured so the inter-process communica-tion is minimised for the statistically dominant case. T h i s can be achieved by treating lines 80 with no equations or tables as normal lines, and lines with tables or equations as exceptional. A l l lines are sent to T R O F F first. T R O F F detects when the line is of the exceptional form E Q N or T B L and handles it by forwarding the line to the appropriate server, where it is treated as before -- which means that after processing by, say, E Q N , it is passed back to T R O F F for further processing. T h u s the E Q N - T R O F F relationship is complex: T R O F F effectively calls E Q N which in turn calls T R O F F again to process its output. T o solve this problem of mutual recursion in a feed-back loop, the U N I X designers invented pipes and filters in the early 1970's 1 5. T h e thesis model proposes that T R O F F , E Q N and T B L are each structured as simple unbuffered server processes, executing the following loop: loop Receive(msg, id); **from the level above or from EQN or TBL process-as-necessary; Send(msg, to-server-below); Reply(msg,id ); **to the sender endloop; Normal text is processed only by T R O F F . Suppose T R O F F encounters an equation. T h e n T R O F F passes it to E Q N (in Thoth-Iike operating systems this may be done by execut-ing a kernel routine Forward, which forwards the message buffer to E Q N as if it came directly from the client), and T R O F F completes its loop immediately instead of making a Reply to the client. Note that in the os utility the problem was solved procedurally by constructing line-buffers to hold the intermediate output processed in HandleExceptlon. The data was processed by writing in-line code, instead of making a call-back procedure. The problem with the text formatting example is that the size of the intermediate buffer for EQN and TBL is not known, and the code of TROFF is too large to write again in-line. 81 After E Q N has processed the data it executes a Send to T R O F F for the data to be pro-cessed further. T R O F F recognizes that this, data must not be forwarded; instead it processes it as usual, finally unblocking E Q N by making a Reply to it, so E Q N can then unblock the client process. T h e author's design and analysis of both the pipe system and the restructured system design, for a Thoth- l ike operating system, are given in Appendix C . Analysis shows that for each 1000-line document there are 12000 context switches in the initial design. In the new structure, assuming 90% of the text is normal, and 5% is equations and 5% is tables, there are 4300 context switches, a reduction in process switches of nearly 3:1. If a context switch is equivalent to n instructions, 7700n instructions are saved on pro-cessing a 1000-line document, a significant amount in most systems where context switching may take half a millisecond. One could argue that the E Q N | T B L | T R O F F system could be improved by combining all the programs into one module, as the T E X word processor does [ T E X 84]. But multi-process structuring is good design; our approach is to develop techniques to optimise it. 3.3.2.5. P o i n t - t o - p o i n t N e t w o r k P r o t o c o l E x a m p l e In multiple-processor systems, further improvements can be made in run-time by distri-buting the exception handling code so it is situated on physically distinct processors. A n appropriate technique is to implement each exception handler as a separate process. T h i s allows the normal case code to continue without interruption, with a maximum saving in run-time equal to the cost of the exception handling. T h i s technique can be used to advan-tage with handlers for peer exceptions and for server-client exceptions. 82 A n application of this principle is found in the design of a point to point network proto-col. In some protocol implementations, if the network becomes busy, a busy receiver simply discards incoming data packets, instead of trying to handle them by sending a message such as H O L D , which would take up more processor and network time. A s the sender already has exception code to handle lost packets, the sender will use this mechanism to retransmit at some later time. T h i s may help the receiver to reduce the congestion; further, in some dynamic routing algorithms, this induced delay on the channel to the congested node will reduce its preference as a route, further reducing the congestion. F r o m the view of the model, this example illustrates the principle of minimising the cost of handling a critical real-time exception by mapping one exception ( R E C E I V E R - B U S Y ) at the receiver onto another ( L O S T - P A C K E T ) , handled at the sender. T h i s shows how peer protocols can map peer exceptions between peer processes. T h i s example also illustrates how an implementor can choose the event (in this case, the exceptional RECEIVER-BUSY event), to be handled most efficiently, so that recovery is pos-sible. It also leads to a general principle of how exceptions may be handled simply, by translating them into exceptions that are already handled by another mechanism. T h e proto-col runs on two distinct processes, and when one, the receiver, becomes busy, the exception handling is effectively forced onto the other viz. the sender, thus saving run-time at the busy node. However, most programs and algorithms are not structured to exploit such con-currency, and the problems of coordination of exception handling and normal case processing are not yet well understood. O n a single-processor system there are obviously no gains in run-time to be made by exe-cuting the exception handler as a separate process, but there are clear logical advantages, viz: 83. (1) good separation of normal case code from exception code (2) context saving of the normal case is locked in the well-defined state of a process (3) experience gained in structuring programs this way will be useful in designing distributed systems which exploit true concurrency. In the operating system Medusa , the exception handling code can be specified as another process; the b u d d y of the v i c t i m which incurs the exception. T h i s feature was described in Section 2.2.6.3, and it was shown to be useful in certain situations, such as in remote debug-ging-3.3.3. Reduce C o n t e x t - s a v i n g Costs in N o r m a l E v e n t s T h e third approach to improving efficiency through dividing event-oriented programs into normal and exception events, is to reduce the context saved in the normal case appreciat-ing that in some critical real-time programs, rapid exception handling may be needed which may conflict with this objective. 3.3.3.1. R e a d - b a c k E x a m p l e F o r a simple application of the principle of reducing context-saving in the normal case, consider the os program of Subsection 3.3.2.1. In the revised program, the column count on the current line is maintained with every byte read, in order to handle the backspace excep-tion, thus adding the overhead of context saving to the normal-case processing. T h i s amounts to 1 instruction per byte which equals 6% of the total handling cost of 13 instructions/event. B y maintaining this context, the exception handler is quite straightforward to write. A s an alternative, the normal-case context saving could be eliminated by pushing more work onto the H a n d l e E x c e p t i o n routine, for example, making it read backwards to check 84 the last occurrence of the newline character. T h i s illustrates a tradeoff between maintaining the context up front so that exception handling is relatively simple, or by maintaining little or no context, and making the exception handling more complex. T h i s could be automatically and efficiently maintained by the system, if there was a way of specifying it in a program. Such a facility could be used as illustrated in the program fragment given at the end of Appendix B . 3.3.3.2. L N T P E x a m p l e A more complex example is now described, showing how improved efficiency may be obtained through reduction of context-saving for exception handling in communication proto-cols. Communica t ion protocols can be treated as event-oriented programs, each layer of which is typically implemented as a finite state machine making state transitions in response to the messages received from the layers above and below. In communication protocols, much of the code deals with error detection and recovery. In long haul networks, ( L H N s ) , charac-terised by a long network delay (low bandwidth) and high error rate, there is a high overhead on normal case processing 1 6 of vir tual circuit point-to-point connections, to maintain context so that errors can be handled without a breakdown of the virtual circuit. F o r example, the A R P A N E T protocol T C P / I P [ D A R P A 81] was designed to provide internet packet transport over error-prone subnets. T h e services provided include internet address handling, routing, internet congestion control and error reporting, multiple checksums (one at each the T C P level and the other at the IP level), segment fragmentation and reassembly, datagram self-destruction mechanisms and service level options to clients. T h e large, byte level sequence numbers used in T C P (32 bits) incurs considerable processing overhead [Chanson 85a] but is 1 8 We assume normal case is when there is no lost or erroneous data 85 justified in the context of a L H N on the grounds that a large recycling interval allows easy identification of out-of-order and duplicate segments, and that byte level sequence numbers facilitate the fragmentation of segments at the local IP layer and the intervening gateways, and the reassembly at the remote T C P entity. In contrast, a single local area network ( L A N ) is characterised by a low transmission error rate, by single-hop routing and high channel speed such as 10 M b / s e c , compared with a limiting speed of about 9 .6Kb/sec for a long haul network. In a local area network, the major portion of the packet delay time (time to process and successfully deliver a packet to its desti-nation ) shifts from the transmission delay time of long haul networks to the protocol process-ing time. T h u s the efficiency of the protocols is an important design issue for local area net-works. T h e overheads in maintaining context for correct error handling of a connection oriented protocol such as T C P / I P are unjustifiably high for a local area network as the error rates for local area networks are so small (less than 10~9). T h e overhead of layered protocol implementation has been described in [Bunch 80] — adding protocol layers adds overhead. One solution, adopted in [Chanson 85b], in which a virtual circuit is implemented by a very simple protocol for point-to-point data transfer, applies this principle to reduce normal-case run-time. T h i s protocol, called L N T P (Local Network Transport Protocol) takes into consideration the characteristics of L A N s . L N T P runs under 4.2 B S D U N I X replacing T C P / I P for local c o m m u n i c a t i o n 1 7 . Since the majority of the packets in a L A N are for local consumption, this scheme greatly improves the network throughput rate as well as the mean packet delay time. When packets are destined for other networks supporting TCP or some other protocol, the proto-col can be implemented at the gateway. 86 T h e fundamental philosophy in the design of L N T P is simplicity. T h e objective is to reduce the protocol processing time, in addition to improving understandability and ease of maintenance. T h u s we set out to design a new protocol that includes only the features strictly required in a single L A N operating environment. A n y functions that are needed only in rare occasions, particularly if they can be easily achieved by the application programs, are not included. Therefore, internet congestion and error control, routing, service options pro-vided to high level protocols and certain functions that handle damaged packets found in a typical L H N protocol are not included as they are irrelevant in a L A N environment. E v e n checksumming is specified as an option since the error rates of the communication medium are negligible. T h e only mandatory error control feature of L N T P is the selective retransmis-sion scheme to handle packet loss due to buffer overflows. Consistent with the characteristics of a single L A N environment, the peer address (16-bit address space) does not include internet component. Furthermore, since the probability of out-of-order delivery and duplicates is negligible, the sequence number space is made small (4-bits) - - thus detection of lost packet exceptions is achieved using very simple sequence numbers. T o reduce the state information required to maintain a connection (which simplifies the control structure leading to reduced processing overhead), the sender and receiver are logically separated and the send and receive channels are completely decoupled. A consequence of this decision is that a receiver is unable to piggyback control information on reverse data packets to the sender, a feature commonly supported by L H N protocols to conserve communication bandwidth. However, network bandwidth is not a scarce resource in a L A N . Moreover, our measurement results [Chanson 85a] show piggybacking is rare due to the unavailability of reverse data packet at the right time. T h u s in view of the simplicity and reduced processing overhead, L N T P is asymmetric in send and receive. 87 T h e flow control mechanism in L N T P maximises the parallel operations of the sender and receiver, and minimises the number of control packets that has to be exchanged. T h e concept of threshold in the window space reduces the control traffic (no control before the threshold is exceeded). Because of the deterministic nature of packet delays, a mathematical model can be formulated for a proper threshold to be set as a function of the system parame-ters to maximise the network throughput rate. L N T P implements a single logical timer; in contrast, almost all other protocols specify a separate timer for each outstanding packet (though some implementations cut corners in this). A preliminary implementation of L N T P in the 4.2 B S D U N I X kernel running on a S U N workstation has been made. T h e protocol was tested in a software loopback mode, and its performance compared to T C P / I P . In U N I X 4 .2BSD, the file transfer rate is increased to 450 kbits/sec compared with 360 kbits/sec for T C P / I P running under identical conditions. T h i s improvement is basically due to the simplicity in the control structure resulting in lower pro-tocol overhead for L N T P . T h i s design illustrates the reduction of average run-time by reduction of context mainte-nance specifically for error-handling, and by tuning a protocol's flow control to be most efficient for normal error-free data transfer. 3.4. E x a m p l e s I l lustrating Increasing the F u n c t i o n a l i t y of the System 3.4.1. U s i n g Inter-process Except ions f r o m Server-client 88 3.4.1.1. ITI < B R K > E x a m p l e A n example of the technique for providing incremental increase to a system's functional-ity was made by the author in implementing the transport layer protocol X.29 [ C C G 78], here referred to as Datapac's Interactive T e r m i n a l Interface (ITI) protocol. ITI uses the services of the network layer protocol X.25 [ C C I T T 76] in the Verex operating system. T h e author designed the ITI protocol as an i n p u t / o u t p u t filter consisting of a reader process and a writer process [Atkins 83b]. T h i s minimal configuration supported normal read/write requests, but did not support the exception event of a < B R K > . T h e author was able to implement and test this basic system without any exception handling facilities. A n exception process for han-dling < B R K > was then added to the ITI layer and the system retested with the < B R K > exception. T h i s example illustrates implementing a new feature by making the server export an inter-process exception to a loosely-cohesive handler at the client layer. It is described fully in Chapter 5.1. It also shows how concurrency is increased through multi-process struc-turing. 3.4.2. Ignorable Except ions So far, only those exceptions which m u s t be handled for correct operation of the system have been considered. However, certain exceptions, which we call ignorable exceptions, may be used for notification only; if there is no handler to receive the notification, the system still runs correctly, though maybe with reduced functionality or less efficiently. 3.4.2.1. I T I Reconnect ion E x a m p l e A s already mentioned in the previous Section, the author implemented a transport layer protocol ITI which uses the services of the network layer X.25 protocol. 89 It was observed that on several occasions a remote user would lose his connection, and would immediately log on again, only to find that the program he was working on had been destroyed because of the broken virtual circuit. It therefore seemed desirable to provide an ITI implementation and session-layer service which would hide a failure in the underlying net-work from the application layer until either a timeout fired, or the same user logged in again. In the latter case, the application program will be available to the user just as if no disconnec-tion had occurred (except for a brief R E C O N N E C T E D message). T h e author decided to exploit an ignorable inter-process exception from the X.25 server to the client ITI to provide this incremental increase to the system's functionality. A s already discussed, the ITI layer already had a separate exception process to handle the < B R K > exception. T h e author decided to use the same exception process to handle the new feature which would achieve reconnection as shown above, by exploiting an ignorable inter-process exception from the X.25 layer below. T h e author implemented the ignorable inter-process exception as a non-blocking Reply from the X.25 server to the ITI exception process whenever the network connection was lost. T h e exception process was modified to handle the reconnection. T h e technique used is detailed in Section 5.1.5. T h i s is a powerful example of increasing the functionality of a client/server system incrementally by using ignor-able 1:1 inter-process exceptions. 3.4.2.2. Window-size-change E x a m p l e Consider how a window management program could use an exception mechanism to handle a W I N D O W - S I Z E - C H A N G E notification. If no handler exists for this exception, the window manager takes no action for line size truncation or expansion. If an exception handler exists to handle this notification, a more intelligent action such as line wrap-around may be 90 taken. This illustrates increasing a program's functionality through the use of 1:1 ignorable inter-process exceptions. 3.4.2.3. H a s h T a b l e E x a m p l e In many software systems, a program may be made more efficient by being able to invoke an exception mechanism which runs an alternative algorithm under certain exceptional situations. If no exception mechanism is present, the program still runs correctly, but more slowly. As an example, consider a program which implements a hash table. If the hash table entries get too long, say greater than 100 entries, an information signal is sent to the caller of the routine. If the caller wishes to handle such a signal, it can call an alternative algorithm (i.e. an exception handler) for hashing the table entries, which will reduce the table entries to a reasonable number, allowing subsequent calls to proceed more efficiently. If no alternative algorithm is available, the notification signal is simply ignored by the caller. This approach could be used in a compiler implementation. 3.4.2.4. Slow G r a p h Plot ter E x a m p l e Another example where an alternative algorithm could be useful, would be for a slow graph plotter. If a large plot were requested, and there was already a long backlog of work, an exception notification could be made to the caller. The caller could then invoke a time-consuming algorithm to optimise the plotter steps for that job, thus reducing the plotting time estimate. If the caller provided no alternative algorithm, the plot would still be correct, but would take a long time. This approach could be extended in a distributed system so that alternative algorithms for calculating the plotter steps would all execute concurrently. When the plotter became 91 available to plot this job, the algorithm which had completed with the fewest steps would be chosen, and the other algorithms would be aborted. 3.4.2.5. Mesa File-server Example The Mesa file server described previously in Chapter 2 illustrates how a program that provides certain basic default facilities, may be subsequently enhanced by using notification to an exception handler. The program runs to a minimal specification without the exception mechanism, and may provide further functionality using the exception mechanism. If the client does not provide a handler for the PleaseRelease notification, the notification is sim-ply ignored. Otherwise, the client provides increased functionality by attempting to release its files. This is an example of a increasing the cooperation between clients by using ignorable broadcast inter-process exceptions. CHAPTER 4 Program Paradigms for Implementing System Dependent Features This Chapter presents program paradigms employing the design principles described in the general model for two example problems. The program models are given for different operating systems, and in different high level languages. High level languages are discussed because concurrent programming is the basis of both applications and systems, and one wishes to see a uniformity through all levels of the system. Many designers favour a language-based approach to distributed operating systems, with the aim of providing uniform access to user and kernel supplied resources, both locally and remotely. For example, pro-grammers at the University of York in England have decided to use ADA [USA 80] to imple-ment a distributed UNIX, and they will only allow users ADA-like tasks, not UNIX-like processes. The model, shown in Figure 3.3, proposes designs for achieving 3 objectives: viz. A, for reducing the average run-time costs; B, for reducing exception handling costs; and C, for increasing program functionality. For all these objectives, a general-purpose and efficient inter-process exception mechanism is necessary. For objective A, in a multi-process environ-ment, it is important to ensure that the inter-process exception notification costs do not overwhelm the normal case computation costs. Minimising the inter-process exception notification costs is a declared part of objective B (see Figure 3.5). And for achieving incre-mental increase of a program's functionality, a general purpose inter-process exception mechanism is essential (see Figure 3.6). . Mechanisms to achieve this are system dependent, so a general model cannot easily be described. Instead, a set of program paradigms is presented 92 93 for achieving inter-process exception notification. The 1:1 inter-process exception notification is illustrated by a problem which often arises in concurrent programming involving exception notification of asynchronous events -- that of a window manager process which multiplexes i/o from 2 or more devices simultaneously -- the so-called KEYBOARD-MOUSE problem. The model also proposes the use of ignorable exceptions for increasing a system's func-tionality; again, the implementation of ignorable l:many inter-process exception notifications is system dependent, so a set of program models for exploiting such exceptions is described. The problem chosen is that of a storage allocator client/server system, where the cooperation between the clients is enhanced by the server's ignorable, broadcast inter-process exceptions. Finally, this Chapter discusses program models for achieving mutual exclusion between processes, as it is often logically advantageous to use a separate exception handler process to receive ignorable signals synchronously, rather than use a procedure in the client process, to increase concurrency and to emphasise the independence of the exception handling code from the normal case code. The implementation of mutual exclusion is system dependent, and several program paradigms are given for implementing the mutual exclusion necessary when exception handlers are implemented as separate processes. We show that atomic transactions are helpful for implementing exception handlers as separate processes. 4.1. P r o g r a m P a r a d i g m s for Inter-process E x c e p t i o n Notif icat ion 4.1.1. T h e K E Y B O A R D / M O U S E E x a m p l e Efficient and general inter-process exception notification mechanisms are needed to reduce exception handling costs in multi-process systems, and also to distribute exception handling by using multiple processes, where the invoked process (the exception handler), is 94 not necessarily related to its invoker. T h e example problem is concerned with synchronising simultaneous input from two or more input devices, such as a window manager process which multiplexes i n p u t / o u t p u t by reading input from a keyboard and a mouse simultaneously. T h e problem arises in ensuring that an exception such as < B R K > is processed within a criti-cal real time, whilst other i /o events are processed as expediently and fairly as possible. Furthermore, it should be possible for the user to control the arrival of input from the various i /o devices. F o r example, when no mouse input is required or expected, the window manager does not wish to receive input from the mouse whenever it is accidentally moved, as process-ing such messages could slow down the system considerably. T h u s the keyboard input may be considered as normal-case, and mouse input as exceptional. A new solution for this problem derived from the model is described for T h o t h - l i k e operating systems. W e then discuss how modern concurrent programming languages allow one to model these designs, within the framework of the three types of concurrent program-ming languages described by Andrews and Schneider in their survey Concepts and Notations for Concurrent Programming [Andrews 83]; viz . monitor-based languages, message-oriented languages and operation-oriented languages. 4.1.2. Exception Notification in a Synchronous Message-passing System In the synchronous message-passing environment of Thoth- l ike operating systems, asyn-chronous input-output can be readily modeled by a window manager with 2 workers, K E Y -B O A R D ( K B ) and M O U S E (i.e. one worker for each k i n d of asynchronous event). T h i s is shown in Figure 4.1 below. T h e device server X at the level below demultiplexes the input and makes a Reply to the appropriate worker. T h e window manager receives notification fairly when input is ready 95 = Send Figure 4.1. Thoth Server Notification Using Two Worker Processes on each device provided the operating system is fair in its scheduling. T h u s the problem of scheduling has been pushed down to the level below the server. F o r a message-based system, it is not easy to provide an efficient way to control the pos-sible input devices selected. F o r example, if the mouse reader reported each move to the win-dow manager, including accidental moves from the user, there would be 2 process switches for each move, which would slow the system down considerably. W e propose a solution to this problem, in which the window manager provides a switch for its clients to toggle on or off v ia a simple request. T h e window manager transmits this request to the level below, say (server) process X . T h i s is illustrated in Figure 4.2. Let us call the client requests to toggle the switch PleaseNotifyMouse and NoNo-tifyMouse. T h e server X must maintain a toggle for each input device. Whenever X has input, X checks its toggle for that input device (initially all are on). If the toggle is on, X 96 Figure 4.2. Thoth Switch Notification demultiplexes the input and replies or queues it for the appropriate reader as before. After the client makes a NoNotifyMouse request to the window manager, the toggle for mouse i / p in X is changed to off. W h e n the toggle is offX simply discards its input data from that device. O n l y when the client makes a PleaseNotifyMouse request to the window manager will the mouse data be sent to the level above. T h e masking of the mouse data can be pushed as low down as required with appropriate additions for the toggle at the client-server interface at each level. T h i s solution has been implemented by the author in another context, that of reading < B R K > interrupts from a remote user over the X25 network. In that con-text, the toggle had to be reset after every data input, causing an overhead of 2 extra process switches on every normal case input. T h e author deemed it was too inefficient a solution, and 97 designed another solution for this problem employing an inter-process exception from server to client, which is described fully in Section 5.1. T h u s the toggle switch tool provides an efficient and convenient mechanism for masking asynchronous events, providing that the switching is performed only rarely with respect to the receipt of normal input items. 4.1.3. Exception Notification in Procedure-oriented Concurrent Languages In procedure-oriented languages (or, equivalently, monitor-based) such as Mesa [Mitchell 79], used to implement the single-user operating system P I L O T [Lampson 80] and Modula-2 [Wirth 82], used for developing a real-time operating system called M E D O S on the L i l i t h com-puter, monitors were designed to provide a very cheap method of inter-process communication [Hoare 74]. In the M o d u l a and Mesa languages the monitor approach often necessitates that a process finishing with a resource must explicitly notify (or signal) other processes waiting for it, so they will be awakened. These other processes are said to be waiting on a condition vari-able. Such use of a condition variable, or global flag, is unstructured and difficult to program correctly. T h e notification is easily forgotten as the notify is not an essential part of the releasing process's action; one might even criticise this as being a violation of the abstraction of the process -- such implementation details should not be the concern of the programmer at that level, but should be hidden by being implemented at the level below. T o overcome these difficulties, the Mesa implementors placed a timeout on every condition variable so that miss-ing notify signals would not block processes for ever. A n experiment which increased this system-set timeout- from 200 msecs to an effectively infinite value revealed that many such notifications were missing [Levin 82]. Fur ther difficulties with the interaction of monitors, processes and exception signals have been described by Lampson in [Lampson 80]. 98 Now a notify signal in a monitor will only wake up processes which are already waiting on that condition variable; it cannot be used for general-purpose inter-process communication. In Mesa, for inter-process exceptions the only signal is A B O R T . T h u s there are problems in using monitor-based languages for distributed systems where exception communication is required in general between unrelated processes. However, a program model to achieve this in Mesa is described in Section 4.2.2. W i r t h discusses the problem of multiplexing asynchronous input from a keyboard and a mouse in his book on Modula-2 [Wirth 82]. He suggests that the user execute a BusyRead which in contrast to the conventional Read, does not wait for the next keystroke on the key-board, but instead immediately returns a special value if no character is found. T h e code for detecting whether input has occurred consists of a loop within which is a test to detect whether a character has just been read and the mouse has moved, followed by a call of BusyRead which returns either a character from the keyboard or the special value. Such a Busy- Wait loop is clearly only practical on a dedicated single-user processor as cpu cycles are unavailable for other work. O f course, one could then resume a background process after these tests. A better approach is to have a multiplexor which polls for the various events, run once every clock tick. T h e Macintosh uses this technique of one consolidated event queue, managed by the operating system. T h e user executes get-next-event and pauses there until an i /o event occurs [Macintosh 84]. T h u s we observe that in procedure-oriented concurrent languages, one software tool useful for inter-process exceptions is an event queue managed by the operating system. 99 4.1.4. Exception Notification in Message-oriented Concurrent Languages Andrews and Schneider define message-oriented high level languages as those which pro-vide send and r e c e i v e as the primary means for process interaction, and they assume no shared memory between processes. Message-oriented languages such as C S P [Hoare 78] and P L I T S [Feldman 79] are intended for developing multi-user operating systems. T h e author observed that in the message-passing synchronization primitives the synchronous Send-Receive-Reply inter-process communication primitives of T h o t h were not mentioned; that omission was remedied in a subsequent letter to Surveyor'8 Forum [Atkins 83c]. It has already been shown how the synchronous Thoth- l ike environment may be used to model the problem of multiplexing asynchronous input from two or more devices; however, T h o t h processes on the same team may share memory. W e must therefore ensure that the problem can be modelled without the use of shared memory, if this program design is to be applicable to a high level language such as P L I T S or C S P . T h e main use of shared memory is to save copying data, but data can obviously be transferred by messages if necessary. T h e second use of shared memory is in implementing synchronization between processes on a team; this can be achieved by a semaphore. Hence the solution outlined above for T h o t h will work directly in message-oriented languages where no shared memory is available. Nelson remarks in his thesis on Remote Procedure Calls [Nelson 81] that the need for asynchronous exceptions between processes has long been recognised and implemented in the message-passing world: P L I T S , Medusa and R I G all provide some method of inter-process asynchronous exception. He continues, by noting that the R I G implementors, in particular, found this capability essential for designing reliable systems. However, it is shown here that 100 the toggle switch mechanism provides a suitable synchronous inter-process exception notification tool for message-based systems; thus asynchronous notification is not absolutely essential for good performance. 4.1.5. Exception Notification in Operation-oriented Concurrent Languages Operation-oriented languages provide remote procedure call as their primary means for process interaction. A D A [USA 80] and SR [Andrews 82], intended for designing multi-processor systems, are examples of such languages. A s in procedure-oriented languages, operations are performed on an object by calling a procedure. T h e difference is that the caller of an operation synchronizes with the so-called caretaker that implements it while the opera-tion is executed. In A D A this synchronization is called a rendezvous. A rendezvous presents problems for general inter-process communication as the server-task can only be communicating with one active process at a time, allowing for no asynchro-nous inter-process notification. F o r inter-process exceptions in A D A only the F A I L U R E signal can be given, so this mechanism cannot be used for general notification. A server process in A D A can multiplex calls by nesting accept statements that service the calls. T h i s technique is not yet widely used, and as Andrews and Schneider point out, implementation of some of the concurrent programming aspects of A D A is likely to be hard. S R (Synchronizing Resources) [Andrews 81,82], also uses the rendezvous form of the remote procedure call. However, both asynchronous and synchronous message passing are supported, and Andrews has achieved successful implementations of servers in S R using these operations. O n l y time will tell if the lack of specific software tools in A D A for inter-process exceptions will hinder its growth as a software language. 101 4.1.6. E x c e p t i o n Notif icat ion in D i s t r i b u t e d Systems In distributed operating systems where the user process may be located on a different host to the one the terminal is connected (e.g. during a remote login session), a mechanism must be available to divert the asynchronous interrupt to the remote host, as one of the spe-cial difficulties with asynchronous inputs in a distributed system is the lack of shared memory. T h i s means that a hardware interrupt on one processor can't force a transfer of control to a process executing concurrently on another processor ~ some kind of software interrupt facility is needed. In communication protocols such as X.25 [ C C I T T 76], a special out-of-band message (an interrupt packet) is used to transmit such priority information to the remote host, and this must be received as soon as possible by the target process. A n interactive terminal interface protocol is used on the remote host to simulate the terminal; however this may run at a much higher level than the local terminal server. Special facilities may therefore be required to achieve the desired effects on < B R K > . T h e author designed a mechanism, discussed in the ITI network protocol example in Chapter 5, to solve this problem in both Verex and U N I X . 4.2. P r o g r a m P a r a d i g m s for Ignorable l : m a n y Exceptions 4.2.1. T h e Storage A l l o c a t o r E x a m p l e In his thesis Levin proposes an important, interesting mechanism for inter-process excep-tions, by designing into a high level language a means for the owner of a shared abstraction (typically a server) to notify A L L (or some number) of its clients that an exception has occurred. He calls such exceptions on a shared object structure class exceptions. A l l the processes that have used, and are still using the service may receive exception signals, as 102 specified in the server's functional description. T h u s a module exports an exception to its users. T h i s feature was not implemented by Levin , but we show here that it can be imple-mented, and that it could be extremely useful, particularly in distributed operating systems. W e develop several program paradigms based on the example problem of a storage allo-cator which may detect that its resources, while adequate to meet the current request, are running low. Such a P O O L - S C A R C E exception condition could be usefully propagated to all contexts where resources might conceivably be made available to the allocator. Yet there is no reason to suspend the current allocation request while the condition is being handled; the two actions are logically independent and should proceed (conceptually at least) in parallel. Hence the exception P O O L - S C A R C E is described as an ignorable asynchronous broadcast exception in our classification. T h e allocator may also find it impossible to satisfy a resource request, and may raise a P O O L - L O W exception with its clients, expressing its urgent need for storage. Naturally, the request will remain pending while this exception is being processed. P O O L - L O W is also classified as an ignorable asynchronous broadcast exception. T h e allocator, even after the above, may not possess adequate resources to satisfy the request. In this case, it must raise a P O O L - E M P T Y exception, signifying its inability to meet the requestor's demands. P O O L -E M P T Y cannot be ignored by the client making the request; thus P O O L - E M P T Y is classified as a 1:1 synchronous exception which must-be-handled. Levin's proposed allocator is shown in Figure 4.3 below 1 . Because Levin's so-called sequential-conditional selection policy (described in detail later) could cause deadlock if the pool module is implemented as a moni-1 We have altered the notation from Alphard [Wulf 76] to a Mesa-like notation in which Levin's condition variables are replaced by exception variables to avoid confusion with monitor condition variables 103 StorageAllocator: MODULE = begin P free exception exception pool; integer; POOL-LOW policy sequential-conditional; POOL-EMPTY policy broadcast; end; raises POOL-LOW on pool, POOL-EMPTY on Allocate; Allocate: ENTRY PROCEDURE (p:pool, size: integer) returns (ddescriptor) = begin P(outer); P(inner); if p.free < size then begin V(inner); raise POOL-LOW until p.free > = size; P(inner); if p.free<size then begin V(inner); V(outer); raise POOL-EMPTY; end; end; p.free := p.free - size; < create descriptor d for segment of amount size > V(inner); V(outer); end; Release: ENTRY PROCEDURE (p:pool, d: descriptor) begin P(inner); p.free := p.free + < size of segment referenced by d > < release space associated with d > V(inner); end; Figure 4.3 Levin's Storage Allocator tor, Levin achieves mutual exclusion and condition synchronization using the two semaphores, inner and outer. T h e function of these semaphores is explained in detail by Levin , and also 104 by Black in [Black 82]. The author proposes a new approach which is implementable in exist-ing systems, both monitor and message-based, by following the design techniques specified in the model. Program paradigms are presented for these implementations in Mesa, a language supporting monitors, and in Verex, a message-based operating system. The storage allocator is very simple. Allocate accepts a pool and size as parameters, obtains sufficient storage from the pool and returns a reference to it. If adequate storage is not available to do this, the POOL-LOW exception is generated. It is raised in turn with the users of the pool until either adequate resources become available, or there are no more handlers. If the handlers do not succeed in releasing enough store the POOL-EMPTY exception is generated. The Release operation is straightforward. A pool user has the form described in Figure 4.4 below. The pool user is prepared to handle POOL-LOW at any instant. Except during the time when segments are being added to m, POOL-LOW is handled by the squeeze procedure. But during critical data-structure updates to m, invocation of squeeze is inhibited by a dummy (masking) handler for POOL-LOW that does nothing; in this situation, the POOL-LOW exception is lost. Levin notes the differences in synchronization and communication requirements among the structure class exceptions POOL-SCARCE, POOL-LOW and POOL-EMPTY and their handlers, and proposes a new mechanism to solve the allocator problem. He defines a selec-tion policy for each exception, and describes three such policies: broadcast-and-wait, sequential-conditional and broadcast. Under the broadcast-and-wait policy, all eligible handlers are invoked in parallel. When all handlers thus initiated have completed, execution resumes immediately following the raise (like Mesa's SIGNAL) statement. Thus, while all eligible handlers execute in parallel (concep-105 Userpool : MODULE = begin p : pool; m : hairy-list-structure; exception : NOSOAP policy broadcast; raises NOSOAP on somefunc; somefunc: PROCEDURE (<info>) begin d l , d2 : descriptor; dl := Allocate (p,amtl) [POOL-EMPTY: raise NOSOAP; return]; d2 := Allocate (p,amt2) [ POOL-EMPTY: Release (dl); raise NOSOAP; return]; < fill in dl and d2 > begin add-to(m,dl); add-to(m,d2); end [POOL-LOW: ] end; squeeze: PROCEDURE () begin < perform data-dependent compaction of m using Release(p,d) to release any elements d removed by compacting m > end; end [POOL-LOW: squeeze() ] Figure 4.4 A User of Levin's Storage Allocator tually at least), the signaller is suspended until they have all completed. W e will call this a Reliable Broadcast Send. T h e server, or the operating system supporting such servers, needs a list of all its clients as it must delay until all the handlers are done. In his example, Levin does not use this policy for raising the P O O L - L O W exception --instead he uses the sequential-conditional policy. T h i s policy has a predicate associated with the raise statement. T h e predicate is evaluated, and, if T R U E , execution continues following 106 the raise statement. If the predicate is F A L S E , an eligible handler is selected and initiated. W h e n it completes, the predicate is again evaluated. 2 T h e raise statement terminates when either the predicate becomes T R U E , or the set of eligible handlers is exhausted. T h u s in his example, the exception is raised only as many times as needed to satisfy the current request --it is therefore more efficient for any single request. However, in many situations, if the pool is low for one request, there is a high probability that the next request will also find the pool low. It would seem more advantageous to use the broadcast-and-wait policy, so that all the clients will squeeze their data. T h i s may save subsequent shortage, in the same way that page-look-ahead attempts to reduce page faults by anticipating subsequent requests. But the server then has to wait for all the handlers to complete before it can continue; this can delay the current request unnecessarily if say, the first handler to complete released sufficient resources for satisfying the current request. Levin also proposes a variation of broadcast-and-wait simply called broadcast which relaxes the completion requirement. A l l eligible handlers are initiated in parallel but the sig-naller does not wait for any of them to complete. It merely continues execution following the raise statement. T h u s all handlers and the signaller execute in parallel. Levin's formal proof rule for this policy requires that the signaller and handlers act nearly independently. These are exactly the semantics he intended — the broadcast policy should be used normally to report to users of an abstraction a purely informational condition. (He notes that y o u can relax the proof rules somewhat, but then we risk serious interactions with parallelism and synchronization mechanisms.) T h i s selection policy could be used when the storage falls below a certain threshold; the server sends a P O O L - S C A R C E exception to a broadcast communica-2making sure each time a different handler is selected 107 tion channel to which all clients subscribe. O n receipt of the notification, each client takes appropriate action to release excess storage. This selection policy, unlike the others, does not require that the server keep a list of client handlers, as the signaller acts independently of the handlers. If the signaller broadcasts to all subscribers, and some do not receive the message, the signaller is not directly affected. T h u s we can use a so-called Unreliable Broadcast to implement Levin's broadcast policy, and the interesting feature is that it can be implemented without a centralised list of sub-scribers at the signaller. Instead, in a distributed system, all the system needs is to guarantee to try to deliver the message to all hosts in the system. Hence the server does not have the burden of being responsible for maintaining a list of clients' handlers, nor is concerned with garbage collection of clients which are no longer interested in the exception. A s several local area networks provide an efficient broadcast feature, we would like to exploit it for such broadcast exceptions. A n unreliable broadcast has been implemented by Cheri ton for the V-kernel [Cheriton 84], and a multicast feature has recently been imple-mented in U N I X 4 .2BSD [Ahamad 85]. But to use such an unreliable broadcast in this prob-lem, first we need to show that we can redesign Levin's storage allocator to exploit a broad-cast signal, instead of the broadcast-and-wait signal which he used. W e then discuss how we could use the Medusa, Mesa and V-kernel operating systems as virtual machines to provide tools to support this high level language construct. 4.2.2. Storage Allocator Using Unreliable Broadcast Notification Black discusses various alternatives to Levin's storage allocator in his thesis (pp 120-129), and proposes a solution in C S P [Hoare 78] which uses a reliable broadcast from the server to a list of its clients. His solution persists, in that if an Allocate request cannot be 108 satisfied immediately it continues to signal the P O O L - L O W exception to all the user processes until enough store is available. During this time no further Allocate requests are accepted, but Release requests will be handled correctly. W e consider this an undesirable solution; a timeout is necessary so that the client will not be suspended for arbitrarily long periods should no client be ready to execute Release. W e propose using monitors, condition variables, and their associated W A I T and S I G -N A L operators which were designed [Hoare 74] for handling synchronization and data access in such situations to implement the allocator. Now Levin and Black dismiss the notion of using a monitor for achieving the necessary features because of deadlock. If a call of Allo-cate generates a P O O L - L O W exception and a client handler attempts to call Release, the handler will be suspended awaiting completion of the Allocate request. However, Allocate cannot complete until at least one of the handlers completes. T h i s problem arises specifically because of the semantics of Levin's sequential-conditional policy, whereby a signaller can not continue until an eligible handler has completed. T h i s problem of managing resources in mon-itors was already recognized, and has been described by Lampson and Redell in their imple-mentation of P I L O T [Lampson 80] in Mesa. T h e y resolved it by use of a W A I T on a condi-tion variable which releases the monitor lock. O u r solution requires that the P O O L - L O W exception obeys the broadcast selection pol-icy, because then a monitor can be used for managing the storage pool. T h i s solution is por-trayed in Figure 4.5, where a Mesa-like condition variable MoreAvailable with an associated timeout is used 3 . 3Note that the semaphore inner is not needed here, as the monitor protects the critical data from concurrent access. 109 StorageAllocator: MONITOR = begin p : pool; free : integer; exception : POOL-LOW policy broadcast; exception : POOL-EMPTY policy broadcast; condition variable : MoreAvailable, Squeezedone; Boolean : REQUEST-PENDING = FALSE; Allocate: ENTRY PROCEDURE (p:pool, size: integer) returns (ddescriptor) = begin if REQUEST-PENDING then WAIT Squeezedone; | P(outer) if p.free < size then begin REQUEST-PENDING = TRUE; raise POOL-LOW; while p.free<size do begin WAIT MoreAvailable; if <timeout on MoreAvailable fired> then begin raise POOL-EMPTY; REQUEST-PENDING = FALSE; NOTIFY Squeezedone; | V(outer) return; end; end; end; p.free := p.free - size; < create descriptor d for segment of amount size > REQUEST-PENDING = FALSE; NOTIFY Squeezedone; | V(outer) end; Release: ENTRY PROCEDURE (p:pool, d: descriptor) begin p.free := p.free + < size of segment referenced by d > < release space associated with d > NOTIFY MoreAvailable; end; end; Figure 4.5 Storage Allocator Using a Monitor and Broadcast Exceptions 110 In our program paradigm, the Allocate procedure raises the P O O L - L O W exception sig-nal to all its clients when storage is inadequate, and then executes a W A I T on the MoreAvailable condition variable, with an appropriate timeout. T h e W A I T releases the monitor lock, thus enabling handlers to execute Release. Whenever some store is released to the pool, a N O T I F Y signal on the MoreAvailable condition is executed. T h e Allocate pro-cedure can then try again to satisfy the request. If the timeout fires4 the procedure Allocate can raise the P O O L - E M P T Y exception and return as before. Allocate requests made during squeezing are delayed on the condition variable Sqeezedone, which acts like the semaphore outer in Levin's model. Now that we have designed a storage allocator monitor which uses a notification broad-cast to all its clients, we develop program models for its implementation in various languages and systems. 4.2.3. Storage A l l o c a t o r in M e s a T h e allocator example as it stands is almost exactly implementable in Mesa, except for the problem of raising the broadcast exception P O O L - L O W . W e have already seen one solu-tion in the Mesa cooperating files example, where each client provides a procedure parameter to the server. T h e server invokes each client's procedure in turn until enough storage has been released. Unfortunately this requires the server to maintain a list of clients and their call-back procedures. T h e alternative in our program paradigm is for each client to W A I T on a new condition variable PoolLow which is signalled by the server when the exception is detected with a 4remember we do not have a list of clients, so we cannot tell when ALL the handlers have complet-ed I l l N O T I F Y B R O A D C A S T PoolLow. Unfortunately the clients would be unable to handle any other business while they were being blocked on the PoolLow condition variable. T h i s could be resolved by each client providing a peer exception handler process which waits on the P O O L - L O W condition variable and which executes SQUEEZE synchronously whenever the condition is signalled. T h i s solution also has the effect of increasing concurrency. T h e implementation of condition variables in Mesa still involves a queue of waiting processes to be held in the monitor, but this list is maintained by the operating system rather than by the monitor's programmer. Software tools for efficient implementation of this system are required for the cooperat-ing processes, so that when a client process dies, its peer exception handler dies also, and is (eventually) removed from the monitor's queue. T h i s process death notification should be made only when (or if) the condition variable was notified, otherwise the monitor would receive arbitrary unwanted interrupts from the operating system whenever a client process died. Furthermore, condition variables as specified by Hoare would not be adequate, as the Mesa condition variables are implemented with timeouts and the signal on a condition vari-able signifies a notification hint ~ this implementation of monitors is required for correct operation. It is not clear how efficient such a mechanism could be in Mesa, particularly when some clients are executing on remote machines. Presumably the overheads are quite high, as the implementors of the PleaseRelease cooperating files d id not use this approach. However, the machinery is present and could be used. There is just one problem with this scheme for synchronous notification, and that is in ensuring the mutual exclusion of access to shared data of the client and the exception process while it is squeezing the data. A t o m i c actions, or shared memory and guaranteed progress of 112 a process until it voluntarily relinquishes the cpu (such as for Thoth team-members) would be sufficient to provide mutual exclusion; various designs are discussed later in Section 4.3. 4.2.4. Storage A l l o c a t o r in M e d u s a Medusa defines external condition reports which are based on Levin's proposed structure class exceptions, which can be raised on shared objects. An external report is raised by any activity with access to a shared object which incurs an object exception. For each shared object in the system there are 8 classes of object exception, some of which are signalled by the utilities (e.g. parity failure may occur on memory pages), and others may be defined and sig-nalled by user activities. When an object exception is signalled, all other activities with access to the object receive an external report of the exception. These external reports are made using flagbox objects which are managed by Medusa's exception reporter utility. Each activity can read the flags synchronously in the flagbox when it wishes, or it can request to be interrupted when a particular flagbox becomes non-empty. In this latter mode, an external report is similar to a UNIX software signal interrupt. The major disadvantage of the Medusa external mechanism is that backpointers are needed from every shared object to its users. As we have already shown, such tight coupling is neither necessary nor desirable; in many applications an unreliable broadcast notification from the server to its (unknown) clients is sufficient. 4.2.5. Storage A l l o c a t o r in V - k e r n e l In Thoth-like message-passing environments, an extension of the server-client notification schemes already proposed in Section 2.3.1 requires each user registering an exception-process to be notified. The server is responsible for maintaining such a list, and for garbage-collection 113 of clients who are no longer interested in the exception. T h i s solution suffers from the same disadvantage as Mesa's and Medusa's , that back-pointers are needed from the server to its users. O u r new solution uses a multicast inter-process communication primitive, which we call a Broadcast (or Multicast) Send (BSend) [Cheriton 84] T h i s has been implemented in the V-kernel [Cheriton 83c] and in U N I X 4.2BSD [Ahamad 85]. A server can inform clients of a group-id on which exceptions can be sent. T h e server uses the non-blocking BSend to the group-id to notify the exception, and clients can chose to receive from that group-id either with a separate exception-process, or by their main process (if the client is itself a server wait-ing on a Receive ). A program model for a storage allocator server in the V-kernel is shown in Figure 4.6. pool-server() begin integer id, group-id, CLIENT_ID, AMT, reply, THRESHOLD = 100; Boolean REQUEST-PENDING = FALSE; message msg [MSG-SIZE]; select (msg .RE QUE ST- C ODE) begin case Allocate: begin if REQUEST-PENDING then begin pool P5 repeat begin id := Receive(msg); queue-request(id,msg); reply := NO-REPLY; continue; | P(outer) Alloc: end; if p.free < msg .SIZE then begin REQUEST-PENDING := TRUE; BSend (< message POOL-LOW>, group-id); CLIENT-ID := id; AMT := msg.SIZE; | Implements WAIT, as 114 reply := NO-REPLY; | no reply to client < start timer > continue; end; p.free := p.free - msg.SIZE; satisfy-allocate (id, msg.size); |makes reply to id reply := NO-REPLY; j no need to reply end; case Release: begin < update p.free>; if REQUEST-PENDING then begin | Implements NOTIFY if p.free >= AMT then begin satisfy-allocate(CLIENT-ID,AMT); REQUEST-PENDING := FALSE; end; end; reply := OK; | to releaser end; case Timeout: begin if REQUEST-PENDING then begin Reply ( <message POOL-EMPTY>, CLIENT-ID ); REQUEST-PENDING := FALSE; end; reply := NO-REPLY; | to timer end; end; if reply =^  NO-REPLY then begin msg.REPLY-CODE := reply; Reply (msg, id); end; if (( - REQUEST-PENDING) A (requestq ^ NULL)) then begin dequeue-request(id,msg); | goto Alloc; | V(outer) end end; end; Figure 4.6 Pool-server in V-kernel 115 The server uses BSend to notify any interested clients of the POOL-LOW exception. Clients obtain access to groupid (on which the exception is broadcast) through some initial request to the server, not shown here. The procedure satisfy-allocate(id,size) creates a descriptor for the requested segment of size resource units, and then executes a Reply to the process id. The procedures Queue-request(id,msg) and Dequeue-request(id,msg) save and restore any Allocate requests which are received during a REQUEST-PENDING. Thus the queue of waiting Allocate requests is explicitly maintained by the server, instead of by the monitor's condition variables. One advantage of this approach is that the server can make arbitrary scheduling decisions easily, rather than be constrained by the mechanisms pro-vided by the monitor implementation. Another advantage is that the implementation of WAIT and BROADCAST NOTIFY are explicit here; there are no hidden overheads in operat-ing system implementation of the condition synchronisation primitives. However, this event-driven server is longer than the monitor version in Figure 4.4 as it requires extra variables (REQUEST-PENDING and the queue-variables) to encode its state and manage its flow con-trol. But this example provides a useful program paradigm for message-based systems which do not provide a monitor construct or separate ports for different types of server requests. A corresponding pool user is shown in Figure 4.7. An alternative approach is to assume that a new Allocate request occurs only infre-quently when the server is in the state REQUEST-PENDING. The server can make an exceptional BUSY reply to the client instead of queueing the request explicitly, and the client must take appropriate action such as delaying before retrying the request. This server does not implement the same scheduling semantics, but by designing the client to maintain the extra state information RETRYING, the server code is simplified. This is the approach advo-116 userpoolQ begin extrn hairy-list-structure m; extrn semaphore M-UPDATING; integer id, pool-id, reply; repeat begin id := Receive(msg); select (msg.REQUEST-CODE) begin case somefunc: begin descriptor dl,d2; Send(<AIIocate-msg>, pool-id); if msg.REPLY = POOL-EMPTY then begin reply := NOSOAP; | to caller continue; end; dl := msg.PTR; Send(<Allocate-msg>, pool-id); if msg .REPLY = POOL-EMPTY then begin Send(<Release-msg dl>, pool-id); reply := NOSOAP; continue; end; d2 := msg.PTR; < fill in dl and d2 from pool-id's message > P(M-UPDATING); add-to(m,dl); add_to(m,dl); V(M-UPDATING); reply := OK; | to caller end; if reply = NO-REPLY continue; msg .REPLY-CODE := reply; Reply (msg, id); end; end; exception-handlerQ begin 117 extrn extrn repeat begin hairy-list-structure semaphore m; M-UPD ATING; id := Receive(msg); select (msg.REQUEST-CODE) begin case POOL-LOW: case POOL-SCARCE: begin end; end; end; Figure 4.7 A User and Exception Handler of the V-kernel Storage Allocator cated in the general model for designing event-driven servers which maintain m i n i m u m state, and for the message-based systems like T h o t h this approach considerably simplifies the server 4.3. Exception Handlers as Processes: Mechanisms for Mutual Exclusion If systems are designed so that exceptions are handled by a separate process, the prob-lem is encountered in synchronization of the exception process and the other processes involved. T h i s is particularly acute if the exception process executes non-atomic ally, where a process is defined to execute atomically if it completes in its entirety or not at all, with no visible intermediate states. F o r an example, consider the V-kernel storage allocator described previously. M u t u a l exclusion is necessary on the data structure m; while U S E R is updating it, the exception handler must not be squeezing it. There are several alternatives for inhibiting squeeze P(M-UPDATING); <squeeze by performing data-dependent compaction of m using Send(<Release d message>, pool-id) to release any elements d removed by compacting m> V(M-UPDATING); end; code. 118 while m is being updated: (1) reduce concurrency by allowing only the one process, U S E R , to update m (2) use a simple P - V semaphore (3) hardware test-and-set a shared variable (4) temporarily remove squeeze from the server's list of handlers while U S E R is updating m (5) temporarily mask the action of squeeze in E P while U S E R is updating m (6) rely on non-preemptive process execution -- while the exception process executes, U S E R cannot obtain read/write access to m (7) define m to be atomic data which can only be updated or accessed atomically, by so-called atomic transactions. These solutions are considered in turn below. (1) First , mutual exclusion could be ensured by making the U S E R process receive the excep-tion notification from the server, and execute squeeze only when it is free to do so. T h e server would BSend the P O O L - L O W exception, which U S E R would Receive and act on synchronously. T h i s could be an effective solution when the U S E R acts as a server itself (and thus is usually ready and waiting on a Receive) and when there is no need to gain concurrency by having a separate exception process. T o receive the broadcast exception message, the client U S E R would subscribe to the server's group-id. T h i s would complete a SendReceiveReply loop between U S E R and the server. A l t h o u g h this is considered undesirable for regular Sends, a broadcast Send in the V-kernel where the server does not expect any Reply has been implemented so it is safe to complete such a loop. Reasoning about deadlock in such systems is thus the same as if the broadcast 119 Send were like a Reply; the difference is that the client can pick up this kind of Reply with a regular Receive statement. T h e remaining solutions all provide a greater degree of parallelism, whereby the pro-cess executing squeeze (EP) is distinct from the process updating m ( U S E R ) . In Figure 4.7 the P - V semaphore M - U P D A T I N G was used to force mutual exclusion between U S E R and E P . T h i s technique has the overhead of two extra process switches for each P - V operation. Ideally a solution is required which involves no extra process switches for a normal-case data update (when no other process is accessing the critical data). Use a shared variable, M - U P D A T I N G , initialised to FALSE, which may be used in an atomic action such as provided by a hardware test-and-set. T h e U S E R and E P processes test this before accessing the critical data thus: while M - U P D A T I N G do delay(l) ; M - U P D A T I N G : = T R U E ; critical code M - U P D A T I N G : = F A L S E ; T h i s has little overhead in the normal-case — just two statements in U S E R and E P to test and reset the boolean flag. In our solution, if an access conflict is found, the second process loops in a delay statement until the resource is freed by the other process. T h i s solution has the obvious drawback that it is not crash-resilient, nor is it fair. In the exceptional case of a data access conflict between U S E R and E P , the simple solution imposes some extra delay on U S E R and E P , although this may well be balanced by the lack of process switching. T h i s solution therefore follows the principle of minimising the normal-case overhead (by having no context switches), at the expense of some overhead in extra delay and cpu cycles on handling the exceptional case of data-access conflict. 120 However, hardware support is required for the indivisible test-and-set operation. T h i s solution can only be used if a list of clients is held at the server, and a R e l i a b l e Broadcast is used to notify the clients. T h i s solution imposes an overhead of 4 process switches on each normal-case data update — two to notify the server of the change in the list of eligible exception handlers, and two more to reset the list -- plus the mainte-nance of the list. M a s k i n g the effect of squeeze in E P when U S E R is updating the data-structures, again requires 4 process switches. First U S E R executes a Send to E P to set the mask, and then after the critical update, the U S E R has to Send again to E P to reset the mask. T h i s solution has the advantage over the previous method, that the server is indepen-dent of its users, so it can be used with the unreliable broadcast. Exploi t ing non-preemption between processes on a team to achieve mutual exclusion is available in T h o t h - l i k e operating systems, and has been used successfully for many appli-cations. T h e idea is that a process cannot be interrupted by any other process on the same team; thus if a page fault occurs, only a process on another team can execute, not another process on the same team. T h i s contrasts with the Medusa activities in a task force, which are designed explicitly to execute in parallel. However, when the actions necessary for data updates are complex, involving procedure calls, there are no means of knowing, or specifying, which process actions are atomic (i.e. do not yield the processor by making system calls) and which ones are non-atomic (i.e. do yield the processor, leav-ing an intermediate state of computation visible to another process). Hence a procedure which happens to yield the processor must not be called. In large systems where modu-larity is essential, it is not known which procedures yield the processor; thus such restric-121 tions are intolerable and unmaintainable. (7) Much research has been done on atomic data updates — all-or-nothing computations [Lomet 77], [Moss 82]. Liskov and Scheifler described a high level language approach in their paper Guardians and Actions: Linguistic support for Robust, Distributed Programs [Liskov 83], They promise some high level language tools for specifying atomic actions in their integrated programming language and system called ARGUS. Their approach has centered on a class of applications concerned with the manipulation and preservation of long-lived, on-line data such as airline reservations. In ARGUS, the user can define atomic data which is accessed by processes which are called actions (or, equivalently, transactions). The action may complete either by committing or aborting. When an action aborts, the effect is as if the action had never begun: all modified atomic objects are restored to their previous state. When an action commits, all modified objects take on their new states. Let us see how such ideas could be used in the storage allocator. USER and EP share access to atomic data m. If EP obtains a write permission on the data, and then blocks (maybe for some debugging i/o) and then USER tries to execute, USER will be prevented from accessing the atomic data, as EP has a write lock to it. Ideally, USER will be delayed until EP has finished — just like waiting on a P-V semaphore. But in this case, the program-mer merely has to specify that the data m is atomic; the implementing system then takes care of the queue on the semaphore. This linguistic tool presents a real advance in concurrent high level language design, and should certainly be incorporated into new languages. Its implementation in other operating systems has not been studied as the ARGUS project is an integrated system. 122 T h e author has designed an implementation of A R G U S ' s transactions under the U N I X operating system, exploiting the principles of structuring programs for normal-case efficiency, which is described in full in the next Chapter . T h i s provides a tool for implementing excep-tion processes, with due regard for process synchronization and serializability. T o conclude, for efficient implementation of exception handlers as processes, efficient tools for mutual exclusion are needed such as provided by hardware test-and-set, or by software atomic actions. C H A P T E R 5 Programs Illustrating the Model T h i s chapter describes two programs which illustrate the use of the model. First , two structures used for exception handling in multi-process implementations of the X.25 and X.29 protocols in Verex [Lockhart 79] are described. B o t h these protocols are hierarchic and layered. T h e X.25 protocol [ C C I T T 76] is a multi-layer protocol for interfacing computer equipment to a packet-switched network. E a c h X.25 protocol layer — the physical, link level or packet level -- can be described as a finite state machine ( F S M ) event manager making appropriate state transitions on receipt of packets or timeouts [Bochmann 78]. T h e C C I T T V i r t u a l Terminal Protocol X.29 [ C C G 78], here referred to as Datapac's Interactive Terminal Interface (ITI) protocol, establishes a data structure that is viewed as representing the terminal. A remote terminal user may therefore interact with a host-side application pro-gram over the Datapac network via the host-side X.25 and ITI protocols. These protocols were studied because although each protocol layer is implemented as a collection of cooperat-ing processes, the protocols are structured to reflect differences in the type of exceptional situations that may occur. T h e author implemented the ITI protocol on the Verex operating system [Atkins 83b], by following the design principles established in the model. Deering's implementation of the X.25 protocol [Deering 82] is structured to reflect the logical flow of events. W i t h i n each of the lower protocol layers of X.25, one process takes the form of a Fini te State Machine server. T h i s process provides all the necessary synchronisation of transmitting and receiving, leading to efficient handling of all transitions. T h u s in these lower level protocols, where it is difficult to identify normal transitions because the 123 124 probabilities of the events are not known, there are no special exception cases; all events are treated in a similar way. T h e resulting software is produced quickly and models closely the protocol descriptions contributing to its understandability and maintainability. F o r higher level protocols such as the X.29 protocol, exception events can be identified which occur with low probability. T h e author has structured the ITI protocol so that transmitting and receiving are performed by two independent filter processes, and exception handling is performed by an auxiliary exception process which is invoked as needed [Atkins 83b]. T h i s implementation of ITI illustrates designing a program to achieve all three objectives of the model (see Figure 3.3). (1) Objective A : decreasing the average run-time, is achieved by reducing the normal case handling time (cost b of Level 2 of Figure 3.3). A n alternate process configuration is used to reduce inter-process communication for the normal events (read/write requests). T h i s is shown as feature 2 of Level 3 of Figure 3.4. (2) Objective B : decreasing the exception processing time, is achieved for the < B R K > exception event, by reducing the inter-process exception notification time. T h i s is shown as cost b of Figure 3.5. (3) Objective C : increasing the program's functionality in a modular way, is achieved by structuring the program to use an ignorable exception notification from the server (X.25) to the client (ITI), for network reconnection. T h i s design method is shown in Figure 3.6. T h i s implementation of ITI as a filter is compared with the multi-process server approach used for X.25, and it is shown that the filter structure matches the functionality of the ITI protocol, and results in efficient software. 125 In Section 5.2, a mechanism for implementing nested atomic transactions to perform atomic data base updates is described. T h e need for such a mechanism has already been dis-cussed in the previous Chapter in Section 4.3 concerning mutual exclusion mechanisms. T h e author's implementation illustrates objective A of the model, by reducing the average run-time, shown in Figure 3.4. T h i s is achieved by reducing both cost b — the normal case han-dling cost -- and cost c -- the context-saving cost ~ of level 2 of the model. T o reduce the latter context-saving cost incurred while handling normal events, the approach is to assume that transactions C O M M I T by default, and therefore to assume that an A B O R T is an excep-tional event. Context-saving is thus kept to a m i n i m u m while processing normal events, but at the added expense of exception handling when an exception occurs. Another technique to reduce the normal case handling costs, is to use broadcast excep-tions to reduce the network traffic over a Local A r e a Network (feature 3 of Level 3 of the model in Figure 3.4). A hybr id scheme of reliable-unreliable broadcast inter-process communi-cation is used to achieve this. T h i s design gives efficient performance and is easy to program. 5.1. Exceptions and Protocol Implementations ~ a New Approach 5.1.1. The ITI Protocol ITI acts mainly as an i /o filter, transmitting read/write requests from its client (called the application, A P P L ) through to X.25. Now if it is structured as a finite state machine server like the X.25 server, then the I T I M a i n process requires two worker processes, a reader (ITIReader) and a writer (ITIWriter), to handle delays in reading from and writing to X 2 5 M a i n , as shown in Figure 5.1. 126 Figure 5.1 ITI Implementation as a Server T h e author has analysed such an implementation, and finds that each read and write request from the client involves 6 more process switches at the ITI layer ~ from client to I T I M a i n to ITIReader to X 2 5 M a i n and back again. T h u s for a normal case read from the network, there is a total of 7 Send-Receive-Reply messages (including the processing within X.25), or 14 context switches. O n the TI990 computer, each Send-Receive-Reply sequence takes 3 msecs (though on faster machines with hardware support for context switching it takes about 1 msec). Therefore it takes at least 21 msecs to process the context switches for each read request. A t full speed on a 9600-baud Datapac link, 1200 bytes/sec are received, in data packets of 128 or 256 bytes. T h u s up to 10 packets must be processed per second; the context switches for this take 210 msecs. If the machine is dedicated to the task, it can easily keep up with the data arriving. T h e n the argument for readability and understandability of multiprocess structuring compensates for the context switching overhead. However with three 127 X.25 virtual circuits going full speed, or with one other user on the system the computer could hardly keep up. Following the model, it was noted that a large number of context switches were being made for normal-case processing, so the author decided to restructure the ITI protocol to reflect its major function -- that of an i /o filter (rather than a general-purpose server). B y using a different process configuration, normal case read/write requests can be handled more efficiently. But some consideration must also to be given to handling the exceptional situa-tions which might occur, such as network failing or a user hitting < B R K > . T h e X.25-ITI interface must be designed so that inter-process exceptions can be communicated efficiently and conveniently. First , an analysis of the behaviour of the protocol on the < B R K > exception was made. 5.1.2. The Interrupt Protocol Since their inception, communication protocols have employed asynchronous interrupt signals, but typically require that interrupts be synchronised with the parallel data channel. T h e X.25 protocol specifies a complex sequence of data transfers over the network to handle the interrupt exception signal < B R K > , which occurs when a remote user presses the inter-rupt key on his terminal. A total of 5 packets must be exchanged across the network through the link layer of X.25 on a < B R K > signal. These are shown below in Figure 5.2. T o transmit an interrupt caused by a remote user pressing < B R K > , an interrupt expedited data ( INT) packet is sent by the remote X.25's client, C L R E M O T E , which must eventually be acknowledged by an interrupt confirm ( I N T - C O N F ) packet from the local X.25's client, C L O C A L . Subsequent interrupts must be refused by C L R E M O T E until the acknowledgement arrives. T h e I N T packet may overtake ordinary data packets in transit. 128 C L R E M O T E and Remote X.25 Local X.25 and C L O C A L < B R K > from remote user 1. I N T C L O C A L notifies its = = = = = = = > client of interrupt 2. Q - D A T A pkt ( I N D - O F - B R K ) C L O C A L can now confirm = = = = = = = > after restoring output 3. Q - D A T A pkt (reset param) < = = = = = = = C L R E M O T E can now restore output 4. I N T - C O N F <======= C L R E M O T E can now accept more interrupts 5. Q - D A T A pkt (param reset done) = = = = = = = > E n d of interrupt handling Figure 5.2 Network Packets to be Exchanged on an Interrupt T h u s C L R E M O T E also inserts a Q-DATA packet wi th the code I N D I C A T I O N - O F - B R E A K ( I N D - O F - B R K ) into the data stream at the point of the < B R K > . C L R E M O T E also has an option to discard all subsequent output, which it may exert after a < B R K > . W h e n C L O C A L receives the I N T packet from X 2 5 M a i n , it takes appropriate action to notify its client that an interrupt has occurred ( possibly by destroying the client). T h e n 129 C L O C A L must read data packets until the I N D - O F - B R K packet. C o m m o n l y , packets passed-over in flight are discarded at the receiver. O f course, if the remote host was output-ting, or was quiescent at the time of the < B R K > , there will be only the I N D - O F - B R K packet to be read. W h e n C L O C A L receives the I N D - O F - B R K , it must send a Q-DATA packet back with a parameter set to a code to tell C L R E M O T E to stop discarding output (if it was), and then the I N T - C O N F packet can be sent. A s the last stage in handling an interrupt, C L R E M O T E sends the Q-DATA packet back to C L O C A L , with the parameters reset to indicate that out-put has been resumed. 5.1.3. The ITI Implementation in a Thoth-like Operating System W e observe that although the network can fail at any time, it is not necessary for the higher levels to be informed of this event immediately (i.e. asynchronously). Instead, the application may continue to execute normally, unaware of any disruption to the i / o service until a read or write operation t o / f r o m the network is requested. W h e n this occurs, the ITI reader process may receive, synchronously, the exceptional information concerning the failure of the underlying network layer. T h i s synchronous scheme cannot however be used to communicate a < B R K > interrupt from the remote user, as notification cannot always wait until the user requests an i /o opera-tion. For example, if a user initiates a long compilation, then notices an error and presses < B R K > , a response is required within a short finite period, say half a second. In such a situation, the client is deaf to the server; thus another solution must be found. W e have already discussed tools for such inter-process exception notification such as process destruction (Section 2.3.2.3) and the toggle switch (Section 4.1.1). In order to chose 130 between these alternatives, an analysis of their behaviour on the < B R K > exception was made. 5.1.3.1. Notification of Inter-process Exception Using Process Destruction T h e process switches involved in processing an interrupt using process destruction are shown below in Figure 5.3 1. E a c h inter-process SendReceiveReply message exchange causes Network Local X25 and ITI 1. I N T = = = = = = > . . . X 2 5 M a i n > Destroys ITI's vict im ITI's vulture = > I T I M a i n —> Recreates T i m e r I . X 2 5 M a i n < = I T I M a i n re-registers its timer victim 2. I N D - O F - B R K = = = = = = > ... X 2 5 M a i n = > ITIReader = > I T I M a i n 3. Q - D A T A < = = = = = = ... X 2 5 M a i n < = ITIWriter < = I T I M a i n 4. I N T - C O N F < = = = = = = . . . X 2 5 M a i n < = ITIWriter < = I T I M a i n Subsequent interrupts can now be accepted by C R E M O T E 5. Q - D A T A = = = = = = > . . . X 2 5 M a i n = > ITIReader = > I T I M a i n where ... = additional process switches within X.25 to handle each packet Figure 5.3 Process Switches to Handle an Interrupt Using Process Destruction assuming that the ITI protocol is structured as a server with two workers to handle delays in com-munication with X.25. 131 2 process switches. T h e number of SendReceiveReply messages to handle each packet are given in column 1 of Table 5.1 below. T h u s a total of 20 + n X . 2 5 p 2 = 33 SendReceiveReply message exchanges are needed to handle the < B R K > exception, which takes at least 99 msecs. It takes even longer for ITI's client to make its response (such as a message < I O - B R K > ) . W h e n this system was implemented, timesharing with another user we observed response times of between 1-4 seconds for the < B R K > to be handled. T h i s poor performance arose because the other user could obtain a full cpu quantum between each context switch. 5.1.3.2. Notification of Inter-process Exception Using a Toggle Switch Here, ITI puts itself in the position of being always ready to receive a message from the lower level, regardless of whether or not the application has a read request outstanding. T h u s the client is never deaf to the server below. T h e author has implemented this by adding Table 5.1 Number of SendReceiveReply Messages on <BRK> packet no. ITI as a server no. of context switches ITI as an I/O filter no. of context switches 1 9*2 2*2 2 3*2 3*2 3 2*2 3*2 4 3*2 2*2 5 3*2 3*2 Total number of context switches 20*2 13*2 Total number of SendReceiveReply exchanges 20 13 2nX.25p is the number of additional process switches within X.25 necessary to handle every packet, = 3 for inbound packets and 2 for outbound packets. 132 another state variable to both the ITI and X.25 servers to act as a toggle switch to exert con-trol on the movement of data from the lower level X.25 to the higher level ITI, as described in Section 4.1.1. W h e n the switch is on, X.25 returns both data and interrupts; when off, X.25 returns < B R K > signals only, and X.25 queues normal data as before until the client above makes another read request. Unfortunately, this scheme incurs the overhead of 4 extra pro-cess switches on each read request - 2 to set the switch on before reading data and 2 to set it off afterwards. T h i s overhead can be reduced to 2 extra process switches for every normal-case read by setting a default for X.25 to set the switch off automatically after satisfying every read request. T h e n the I T I M a i n process controls the switch setting so that when there is an appli-cation level read request, it requests data from X.25 by setting the switch on. T h i s default however is not the correct one for application programs such as file readers which do always have an outstanding read to the level below, and which therefore are not deaf to the server. Such programs do not need the toggle switch, which should always be on. T h e default should enable such programs to run with no normal-case overhead. These two aims are incompatible. T h u s , if the ITI overhead is reduced to 2 process switches by using a default off after every read has been satisfied, then every other client of X.25 must be aware of this default, and must set the switch on after every read request. T h i s was considered to be an unacceptable solution, so the author implemented the former design, which increases the context switches on a normal-case read from 14 to 18, a significant amount. A t full speed (9600 baud) over the Datapac network, the computer could no longer keep up with one other user using cpu cycles. 133 5.1.3.3. ITI as a F i l ter and an E x c e p t i o n Process Following the model, the author decided to use an inter-process exception notification from X.25 to a separate exception handler process at the ITI layer above for notifying excep-tional asynchronous events encountered by ITI from the remote client. W h e n the server detects an exception, it checks to see if an exception handler process, E H , exists, and is ready. If so, the server makes a Reply to the E H , renders the E H unready and continues processing exactly as before. T h u s from the server's point of view, it exports an exception merely by making a non-blocking Reply to a registered, ready E H , and then it unreadies the E H . If there is a registered E H which is unready, the server queues the Reply message until the E H makes a Send to indicate it is ready again. T h i s approach is taken so that if there is no exception process ready, the server only has to queue up a Reply message with minimal information. If the queue becomes full , the server can drop subsequent excep-tion replies. T h e server therefore needs to make no assumptions about the E H at the higher level, thus preserving modularity requirements. T h e author observed that the exception process could also aid in structuring ITI as a filter, by allowing the exception process to take control as necessary on a < B R K > , and ini-tiate further read requests (for the I N D - O F - B R K packet) and write requests (for the Q - D A T A and I N T - C O N F packets) until the exceptional situation has been dealt with. E H notifies X 2 5 M a i n with a Send when it is ready again to handle further exceptions. Exceptional events occurring across the X.25-ITI interface during this processing are queued up in X25 until there is a ready registered exception handler process ready to deal with them. ITI can now be structured as a reader filter (RFIL) , a writer filter ( W F I L ) , a timer and an exception process E H , as shown below in Figure 5.4, and the author successfully implemented ITI in this 134 way. This structure reflects the major functionality of ITI, that of an i/o filter, better than the server model, and it is structured more efficiently for the normal case (no interrupts or network failures) by reducing the number of process switches by 2 on each read/write request, from 14 to 12. Here, the run-time overhead for normal cases has been reduced by a significant margin through restructuring the conventional i/o server by two i/o filters and an exception process. In this implemetation, on <BRK>, the context switches are as shown in Figure 5.5. Recall that each inter-process SendReceiveReply message exchange causes 2 process switches. The number of SendReceiveReply messages to handle each network packet are now given in column 2 of Table 5.1. " Process = d i r e c t i o n of SEND Message requests Figure 5.4 ITI as an I / O Filter 135 Network Local X25 and ITI 1. I N T = = = = = = > ... X 2 5 M a i n = > E H 2. I N D - O F - B R K = = = = = = > ... X 2 5 M a i n = > R F I L = > E H 3. Q - D A T A < = = = = = = ... X 2 5 M a i n < = W F I L < = E H 4. I N T - C O N F | < = = = = = = ... X 2 5 M a i n < = W F I L < = E H Subsequent interrupts can now be accepted by remote client 5. Q - D A T A = = = = = = = > . . . X 2 5 M a i n = > R F I L where ... = additional process switches within X.25 to handle a packet Figure 5.5 Process Switches on <BRK> with a Separate Exception Process T h u s a total of 13 + n X . 2 5 p 3 = 26 SendReceiveReply message exchanges are used to handle the < B R K > exception, compared with 33 using process destruction with a ITI as a server. W h e n the system was timesharing with another user we observed response times of around 1 second for the < B R K > to be handled. T h u s the major effect of reducing the con-text switches is not only in the actual context switch time (a difference of just 7x3 = 21 msecs), but in the reduction of cpu time slices made available to other users during < B R K > processing. 3remember nX.25p is the number of additional process switches within X.25 necessary to handle every packet, = 3 for inbound packets and 2 for outbound packets. 136 5.1.4. Problems with Synchronisation T h e exception handling on < B R K > is non-atomic in that intermediate states during exception handling are visible; the exception handling involves several inter-process message exchanges which take a finite time, during which other processes may observe and intervene with the only partially completed handling. T h i s causes problems in synchronisation. There are at least 2 alternatives for E H to coordinate the 5 packet exchanges on < B R K > previously mentioned. W e could either allow the E H to Send directly to X25 to read and write the packets, or we could make E H use the services of R F I L and W F I L . T h e author took the latter approach, as either R F I L or W F I L would probably be interacting with X 2 5 M a i n at the time of the exception, and E H would need to coordinate their actions anyway until the exception handling was completed. If E H was allowed to Send to X 2 5 M a i n directly for the packet exchange, R F I L and W F I L could service concurrently (and incorrectly) other requests from their clients. T h e concurrency during exception handling causes complications. If the < B R K > occurs during a quiescent period (while X 2 5 M a i n had a deaf client) then the synchronisation appears to be straightforward, as R F I L and W F I L are waiting on a Receive for i /o instructions. T h e following stages should occur during exception handling: (1) E H is readied by X 2 5 M a i n executing a Reply to E H . (2) E H makes a Send to R F I L for the I N T packet. (3) R F I L makes a Send to X 2 5 M a i n . . . . and eventually a Reply to E H . (4) E H makes a Send to R F I L for the I N D - O F - B R K packet. (5) R F I L makes a Send to X 2 5 M a i n . . . . and eventually a Reply to E H . 137 (6) EH makes a Send to WFIL to dispatch the Q-DATA packet (7) WFIL makes a Send to X25Main .... and eventually a Reply to EH. (8) EH makes a Send to WFIL to dispatch the INT-CONF packet. (9) WFIL makes a Send to X25Main .... and eventually a Reply to EH. (10) EH considers the job is done, and makes a Send to X25Main, indicating it is ready to handle another exception. However, even in this simple scheme complications arise if ITI's application program (APPL) intervenes with a R/W request during these non-atomic message exchanges. For example, if APPL executes a read request to RFIL during stage 3 above, then after RFIL has made its Reply with the INT packet to EH, RFIL executes Receive again, without stopping, and at this time only the message from APPL is in RFIL's message queue, as EH has not yet had a chance to execute since being unblocked by RFIL at the end of stage 3. So RFIL will accept APPL's read request and will (eventually) return the IND-OF-BRK packet to its bewil-dered client APPL instead of to EH. Thus after a <BRK> exception we need to be able to delay any further interaction of APPL with the filters, until the exception handling is finished. This can be achieved in several ways; the easiest is by setting global flags between EH and the filters, as they are all on the same team and share memory. But even with global flags, there are situations which are not easily resolvable without the ability to be able to give EH the highest priority on the team. This can arise for example when there is already an outstanding read request from APPL when the exception occurs. The server, X25Main, unblocks EH with a Reply, and also, without stopping, unblocks RFIL with a Reply with its data -- the INT packet. Both EH and RFIL are ready to execute, and 138 if R F I L executes first, it can pass the I N T packet to A P P L as data; A P P L can swallow it, and make another read request of R F I L . R F I L can execute next and Send again to X 2 5 M a i n , which makes a Reply with the I N D - O F - B R K packet. R F I L could then return the I N D - O F -B R K packet to A P P L , before E H even begins to execute. If we cannot control process priorities, then we must make R F I L aware that an I N T packet is special, and subsequent data packets for exception processing must be handled by E H . Such an approach violates modularity, and is very difficult to program correctly. Whereas if it is possible to specify that E H executes first if more than one process on its team is ready, then the problem can be ameliorated. E H executes first, and makes a read request to R F I L to pick up the I N D - O F - B R K packet. Meanwhile, R F I L is unblocked with the I N T data packet which it gives to A P P L as before. Now R F I L can accept E H ' s read request for data, and pass the I N D - O F - B R K packet correctly to E H . But another complication arises if both E H and A P P L make read requests to R F I L while it is busy fetching data. Unless R F I L accepts any messages from E H in priority, then again R F I L could send the I N D - O F - B R K packet to A P P L . Recall from Section 4.3 that in Verex a process on a team executes atomically with respect to other processes on a team, till it voluntarily relinquishes the processor, and that a higher priority process always executes before a lower priority one on the same team, even through involuntary time-slice and page-fault deschedules. Both process priority and message priority are available in Verex so the author was able to achieve this complex synchronisation without altering R F I L . A major problem in writ ing an exception handler as a separate concurrent process, arises in synchronisation. A l t h o u g h the technique allows clean separation of normal-case code, and 139 is conceptually modular and well-structured, if non-atomic action is needed during exception handling, special features are needed to write clear, correct programs. Three such features are global (shared) memory, process priority, and message priority for handling the exception process's messages. T h i s latter was also recognised by the implementors of R I G ; emergency messages run at a higher priority than regular messages. In general, we need to be able to specify a partial ordering on events, so that the client is not serviced during the exception processing. Various mechanisms for synchronization in con-current programming have been succinctly described in [Andrews 83]. T h e article states that ensuring the invisibility of inconsistent states during an atomic action is an active research area. Some of the most promising results for our problem have been obtained by Andrews in his language, S R (for Synchronising Resources) [Andrews 82], and by Liskov's atomic actions, described in the next Section 5.2. T h e exception handling could be readily specified in S R by using global Boolean flags such as I N T E R R U P T - P E N D I N G and R E A D - R E Q U E S T -P E N D I N G , plus scheduling and synchronizing constraints. Use of global flags, although unstructured, does lead to some easier proofs of program correctness than with message-passing alone, as Schlichting and Schneider have shown [Schlichting 82]. T h u s we have shown that the use of a separate exception process to handle asynchro-nous exceptions in the T h o t h message-passing environment is both practical and applicable. T o show that there are situations exploiting a separate exception process which are rela-tively easy to program, we describe in the next Section how an exception process is used to handle another exception: the end-of-file exception which occurs when the virtual circuit is arbitrarily and abruptly terminated. 140 5.1.5. Recovery after network failure A s described in Section 3.4.2.1., we wish to provide an ITI implementation and session-layer service which would hide a failure in the underlying network from the application layer until either a timeout fired, or the same user logged in again. In the latter case, the applica-tion program will be available to the user just as if no disconnection had occurred (except for a brief R E C O N N E C T E D message). T h e major problem with providing reconnection is that the state of the suspended proto-col must be saved; yet in order to allow the remote user to log in again for reconnection, a working version of the protocol must be available. T h e old and new virtual circuits must then be merged after a successful login. T h e author observed that the exceptional event of failure of the underlying network could also be efficiently handled using an exception handler process. A s explained in Section 5.1.2, this end-of-file ( E O F ) failure is communicated synchronously to ITI on its next read or write request to X.25. T h e Reply contains the E O F information. T h e exception process is used as follows. There is one exception handler process, E H , for each X.25 virtual circuit. If a network failure on a virtual circuit is recorded, the E H associ-ated with the failed virtual circuit is notified immediately by a Reply from the server below. E H tears down the virtual circuit and suspends itself by executing a Send request to the pro-cess which would be invoked to handle a reconnection. If a timeout occurs before the remote user has reconnected, the timer destroys the suspended session, causing the user to be logged out. If the user attempts to reconnect v ia a new X.25 virtual circuit within the timeout period, another E H process is created for this new circuit. T h e reconnection sequence involves 141 a check of the remote user's status, and if it is suspended, the old E H associated with the torn-down virtual circuit is unblocked with a Reply. T h e old E H links its session to the new X.25 virtual circuit, and wakes up the reader and the writer, completing reconnection before destroying the new E H . T h e reader and writer then continue exactly as if the underlying net-work had not failed, replying with new data to the suspended application. T h i s application was very straightforward to implement and had no synchronisation problems. T h e author made one minor change to R F I L so that on receiving an E O F Reply from the server below, R F I L checked if an exception handler process was ready to handle the E O F . If so, R F I L voluntarily blocked, instead of immediately terminating the virtual circuit by committ ing suicide. T h i s reconnection feature provides a very useful addition to the remote login-in service, and has been measured as being invoked by 5% of remote login ses-sions. 5.1.6. Summary T w o mechanisms for handling exceptions in multi-process protocol implementations have been described. T h e first mechanism reflects the use of combining the role of a server with a finite state machine. T h i s structure is particularly efficient for the lower level protocols such as X.25, where it is difficult or impossible to define a normal flow of control. For higher level protocols where exceptional situations occur less frequently, it is more efficient to structure according to the normal flow of control. T h e server model is not so efficient in these situations - instead, the protocol is structured as an i /o filter with a special exception handler process. T h i s structure has the following advantages over the server structure: 142 (1) it reduces overhead in the normal case by reducing the number of process switches on each read/write request. (2) the normal-case code is separated from the exception code, leading to good modular pro-gramming so that, for example, the client and server can be debugged without any exception handling, as the normal-case code is completely separated from the exception-handling code (3) it provides a clean mechanism for asynchronous server-client notification (4) it provides a clean queueing mechanism for the exceptions; they are not lost (5) the increased concurrency allows exception handling to be done in parallel if that is pos-sible (though not in this example) (6) it can be generalised to other operating systems. The disadvantage is that the synchronisation of the exception process with other processes may be difficult — the increased concurrency may be hard to handle. The author implemented two program paradigms for non-standard exception notification from server to client for synchronous Thoth-like operating systems (1) using a toggle switch for rendering deaf clients able to hear. (2) by using an exception process which registers with the server below The advantage of the latter approach in a hierarchy of servers is that the client (itself a server) can be restructured as an i/o filter, with corresponding increase in performance for normal case event processing, whereas the former toggle-switch solution actually increases the run-time for normal case handling in this example, although it can be used to advantage when the client actually wants to be deaf to a device for an extended period, as described in Section 143 4.1.1. 5.2. A Nested Transaction Mechanism for Atomic Data Updates 5.2.1. Introduction Mechanisms for achieving mutual exclusion have already been discussed in Section 4.3; one approach is to declare atomic data which is operated on by atomic transactions. A n atomic transaction, as defined by Mueller [Mueller 83] is a computation consisting of a collec-tion of operations which take place indivisibly in the presence of both failures and concurrent computations. A t o m i c transactions differ from non-atomic transactions in that they are apparently indivisible - no intermediate states are observable by another transaction. T h e y are also recoverable in that the effect is all-or-nothing - either all the operations prevail (the atomic action commits), or none of them do (the atomic transaction aborts) so that all the modified objects are restored to their initial state. There is much current literature on atomic transactions, and, more recently, on nested transactions [Liskov 83], [Mueller 83], [Moss 81]. Nested transactions provide a hierarchy of atomic transactions, useful for composing activities in a modular fashion. A transaction which is invoked from within a transaction, called a subaction appears atomic to its caller. Subactions can commit and abort independently, and a subaction can abort without forcing its parent to abort. However, the commit of a subaction is conditional: even if a subaction commits, its actions may be subsequently aborted by its parent or another ancestor. Further , objects modified by subactions are written to stable storage only when the top-level actions commit. T h u s a nested transaction mechanism must provide proper synchronisation and recovery for subactions. 144 The implementation of such a mechanism is discussed here, because the implementation is based on the general model for designing software. Liskov's ARGUS project [Liskov 83] has been taken as the specification for nested transactions. In ARGUS, Guardians and Actions are provided as linguistic support for robust distributed programs. The ARGUS project was conceived as an integrated programming language and operating system, but we show that the language can be conveniently and efficiently supported by other distributed operating sys-tems which provide certain features. We describe a possible implementation of a nested transaction mechanism, structured to exploit exceptions, for distributed operating system kernels such as the V-system [Cheriton 83]. We compare this with the conventional approach to implementing nested transactions as exemplified in the LOCUS Distributed UNIX project, and show how structuring the program (in this case, the nested transaction mechanism) to exploit the exceptions improves the efficiency. 5.2.2. Overview of the ARGUS Model 5.2.2.1. Atomic Objects ARGUS provides for atomic abstract data types linguistically declared as stable objects. Only activities which have read or write access to such atomic objects are treated as atomic activities. The implementation of atomic objects is based on simple read and write locks with the usual constraints on concurrent access: (1) multiple readers allowed, but readers exclude writers (2) a writer excludes readers and other writers. 145 When a write lock is obtained, a version of the object is made and the action operates on this version. If the action commits, the version is retained, otherwise it is lost. All locks are held till completion of the top-level action, and the commit of a top-level action is irrevo-cable. 5.2.2.2. Nested A c t i o n s To keep the locking rules simple in nested actions, a parent cannot run concurrently with its children. The rules for read and write locks are extended so that (1) an action can obtain a read lock if every action holding a write lock is an ancestor. (2) an action may obtain a write lock provided every action holding a read or write lock on that object is an ancestor. When a subaction aborts, its locks are discarded. Liskov also states that when a subac-tion commits, its locks are inherited by its parent. We will show that this is not necessary for correct operation of the mechanism, and in this point our model differs from Liskov's. 5.2.2.3. G u a r d i a n s In ARGUS, a distributed program is composed of one or more guardians. A guardian provides access to atomic objects through a set of handler operations, much like a file server provides access to files. We use the word guardian alone, to refer to the location where the atomic object is kept -- equivalent to the Transaction Scheduling Site (TSS) of LOCUS [Mueller 83]. In contrast, guardian handlers refer to subactions, as described below. Liskov states that when a handler is invoked, the following steps occur: (1) A new subaction, SP1, of the calling action is created. 146 (2) A message containing the arguments is constructed. (3) The system suspends the calling process and sends a message to the target guardian. If that guardian no longer exists, subaction SP1 aborts and the call terminates with a failure exception. (4) The system makes a reasonable attempt to deliver the message, but the call may fail. (5) The system creates a process and a subaction SP2 (of subaction SP1) at the receiving guardian to execute the handler. Note that multiple instances of the same handler may execute simultaneously. The system takes care of locks and versions of atomic objects used by the handler in the proper manner, according to whether the handler commits or aborts. (6) When the handler terminates, a message containing the results is constructed, the handler action terminates, the handler process SP2 is destroyed, and the message is sent to the caller SPl. If the message cannot be sent the subaction SPl aborts, and the call terminates with a failure exception. (7) The calling process continues execution. A guardian makes these resources available to its users by providing a set of operations, called handlers, which can be called by other guardians to make use of these resources. Given a variable x naming a guardian object, a handler h of the guardian may be referenced as x.h. A handler can terminate in either normal or exceptional condition. A handler executes as a subaction, so it must either commit or abort - either normal exit or exceptional exit may be committed or aborted. If the guardian crashes, all the active handler calls are aborted and its atomic data must be retrievable from stable storage. 147 Liskov's nested transactions are similar to the nested transaction model of L O C U S , how-ever, in L O C U S , the only atomic data supported is of type file. T o aid in comparison of our approach with that of Liskov and L O C U S , we use atomic data only of type file in our exam-ples. 5.2.3. Implementat ion Considerat ions 5.2.3.1. Introduct ion In the A R G U S model of nested transactions, and in the L O C U S implementation, the relationship between client and guardian is not just that of a one-off request for data --instead, close links are retained so that the system can notify guardians immediately a lock-holder commits or aborts. Parents inherit their children's read/write locks on termination --the list of such locks is called the participant file list. T h e i r implementations are structured so that if a subaction is successful it must explicitly execute a commit at termination. T h i s causes C O M M I T messages to be sent to every participant. T h u s each action must maintain a list of participants. T h e list of participants locked by the committ ing child is then transferred to its parent. T h e parent transaction thus becomes aware of its childrens' deeds, which violates modularity concepts, both in the language A R G U S , and in the supporting operating system. W e claim to preserve these boundaries in our implementation. Further , their implementations are structured for functionality rather than efficiency, as there is a high overhead in the case of no aborts and no R / W conflicts. T h i s is counter to the principles for exception handling, and our implementation which follows the model is consid-erably more efficient. 148 It is assumed that (1) subactions commit more frequently than abort (2) all atomic data is frequently available for R/W access; conflicting R/W requests are rare. The principle followed in our implementation is to structure the system to minimise the run-time by reducing the inter-process communication in the statistically dominant case, and also to reduce the context saved in the normal case, by maintaining the minimum context necessary to achieve correctness. Thus when exceptions do occur, the exception-handling is costly. However, by making the normal case more efficient we reduce the chance of conflicting data accesses and this in turn may further reduce the number of aborts. We now describe the minimum context necessary to achieve correctness, before present-ing an overview of the features of our implementation. We then make a comparison of the efficiency of our implementation with others. 5.2.3.2. Requirements for correct implementation To handle aborts, each guardian needs the ability to detect whether any of its atomic data has been updated by the aborted process, and if so, to restore a valid version of the atomic data, prior to that changed by the aborted process. To handle R/W conflicts, each guardian needs the ability to detect whether its atomic data are free, and if not, whether any write-lock owners are not ancestors of the requesting action. To handle aborts correctly, each guardian of atomic data needs to keep a stack of ver-sions and list of actions holding each lock. For conflict resolution, each guardian needs to keep a current version and a list of actions and its ancestors holding each R/W lock. 149 W e assume that each transaction is uniquely identified in the network by its transaction unique identifier (Tid) . W e assume also that it is possible to determine from a transaction's T i d both the home site of the transaction ( the site where the transaction begins executing) and the T i d ' s of all the transaction's superiors. W e thus encode the transaction's ancestors in its identifier, and can merge the version list with the ancestor list, so that each guardian needs just one volatile data structure, called a t-lock (after L O C U S ) to hold the locking and recovery information for each atomic object. T h e t-lock structure is as follows: T l o c k = Struct [ ReadRetainers, Top] ReadRetainers = List [ Tid] T o p = Struct [ Vstack] Vstack = Stackfversion, Tid] Another requirement for correct operation is the existence of a communication facility which can be used by the guardians to receive ABORT and TOP-COMMIT messages reliably. W e discuss mechanisms for achieving this in Section 5.2.4. However, we show in our implementation, that guardians need not be immediately informed of sub-action commits for correct operation, provided the guardian can determine whether a transaction holding a R / W lock is still alive or not. So although the guardian's version-owner may be out-of-date (can occur when the committed subtransaction T dies), the guardian can still operate correctly, in that it does not use out-of-date versions. T h i s is achieved because the guardian keeps the current version on the top of the version stack, together with what it thinks is the current owner. T h e version stack is updated whenever there is need to resolve a potential R / W conflict, with the youngest living ancestor of T replacing T in any owner's list. Alternatively, the guardian can use a background process to update its t-lock periodically. 150 5.2.4. Features of the Implementation A transaction group and a transaction group leader (Gpld) are used to implement nested transactions correctly and efficiently. E a c h new top-level transaction creates a new process, called the group leader, and all its subtransactions use the same G p l d . A n y guardian of atomic data accessed by any transaction, notifies its group leader, so as to receive A B O R T and T O P - C O M M I T messages. Such messages are sent by transactions to the group leader, who in turn reliably notifies all the guardians in that group. T h e group leader may use a sim-ple 1:1 scheme for notification, or alternatively may use a multicast or broadcast feature. W e digress briefly to describe such a mechanism which can be used advantageously. A n unreliable multicast feature is available in the distributed V-kernel [Cheriton 84]. T h e multicast function is achieved through a G r o u p - i d . A process which subscribes to the G r o u p - i d will receive messages sent to it. N o group membership list is maintained, and so members of a group are only loosely coupled, and messages are not reliably delivered. W e show that this may be extended simply to a reliable multicast which fulfills the requirements. T h e algorithms used by the group leader, and their performance, are discussed fully in Sec-tions 5.2.4.7- 5.2.4.9. T h e implementation is structured so that parents do not inherit their committed children's locks automatically. Indeed, when a child commits, which it can do only on its death, the waiting parent is merely informed of that fact, and the participating guardians are not informed of the child's death. However, if a subsequent attempt is made to access such a guardian's atomic data, the guardian must then be able to detect the status of its lock-holders, and update its t-Iock data accordingly. 151 T o handle the commit of a top-level transaction, the action must send a T O P - C O M M I T message to the group leader. A l l guardians to which the transaction group has, at any time, held a lock, then receive the T O P - C O M M I T message. Each guardian must then commit to stable storage any current versions held by that transaction group. In the event of network partitioning, the problem is the same as that discussed in the L O C U S paper, and no top-level transaction can commit . W e provide a user override for these cases, as any failed component can prevent commitment. O r p h a n transactions will not com-mit; they should eventually detect the decease of their top-level ancestor and should duly abort. M a n y message-based operating systems provide the software tools necessary for efficient and correct implementation; namely that transactions may be implemented as processes, and that parent transactions may await replies from their children subactions. Processes should be cheap to create and destroy dynamically so that each invocation of a guardian handler can be efficiently implemented as a new process, as specified in A R G U S . Further , context switch-ing between processes should not be too expensive. These criteria are all fulfilled by T h o t h -like operating systems, and even U N I X 4.2 processes communicating over datagram sockets can be used for functional correctness, if not efficiency. A synchronous Send-Receive-Reply inter-process communication facility is needed — again, this is available in the T h o t h deriva-tives, and can be simulated with U N I X datagrams, as we show in our examples, details of which are now given. 5.2.4.1. Transaction Invocation T h e calling process, P may invoke another transaction T either as a nested top-level action (only if P is also top-level), or as a subaction, usually by making a handler call to a 152 guardian, viz . g.h. T h e following algorithm is employed: (1) A t the site of P , a new process, pid PI is created for the new transaction T . (If P I is to be run at a remote site, an appropriate remote pid is generated). If the invoked transac-tion is a top-level transaction, it creates a new group leader, whose unique process-identifier is G p l d . (2) P I is added to the list of P's children. (3) T h e complete ancestor list, and the G p l d is passed to P I and P I is readied. (4) T h e parent P awaits completion of PI by executing a Receive(specific) from P I , or if several children are initiated concurrently, the parent must arrange to be notified of each one's death and its status ( C O M M I T T E D or A B O R T E D ) . 5.2.4.2. A t o m i c D a t a Access A n y transaction T wishing to obtain R / W access to atomic data must perform the fol-lowing algorithm. Note that this algorithm will usually be executed by the atomic data's guardian upon receipt of an O P E N request. (1) If a t-lock for the data does not exist, one is created, and a top version stack entry is made from the state maintained in non-volatile storage. (2) If the request is for Read and Wri te access, the request is denied if any other living tran-saction holding a lock is not an ancestor. If the request is granted, a new version stack entry is made for T containing a copy of the top of the version stack. (3) Otherwise if the request is for R e a d access, the request is denied if any other transaction holding a write lock is not an ancestor of T . If the request is granted, an entry for T is 153 added to ReadRetainers. (4) F o r successful requests, the guardian checks to see if this transaction has a G p l d different from the G p l d of any other reader or writer. Note that all writers must have the same group leader, by the locking rules, but readers may be in different groups. If this is a new G p l d , the guardian notifies the group leader so it can receive A B O R T and T O P - C O M M I T messages. 5.2.4.3. Subact ion C o m m i t No special action is taken. T h e parent receives information that the child died, commit-ted, and the parent continues normal execution. 5.2.4.4. T o p - l e v e l C o m m i t T h e committ ing toplevel action, T , must send a T O P - C O M M I T message to the group leader. T h e group leader executes an algorithm described below in Section 5.2.4.8. to send C O M M I T messages to each guardian which has joined the group. A l l guardians to which any transaction of the group has, at any time, held a lock, receive the C O M M I T message (eventu-ally). In the case of network partitioning, not all guardians may immediately receive the C O M M I T message. T h i s problem may be solved as in the L O C U S description, and will not be considered further here. A s parents cannot run concurrently with children, all children must by now be dead (or unable to receive the message). E a c h guardian must then commit any current versions held by that group. T h e group leader uses a standard 2-phase commit protocol such as described by Moss [Moss 81]. Af ter writing data to stable storage and com-mitting, the guardian removes the t-lock structure together with all the version lists. 154 5.2.4.5. Transaction Aborts To haDdle a transaction ABORT, all the guardians in the aborting transaction's group must receive the ABORT message and the transaction-id P of the aborting process. The mechanism for the receipt of ABORT is via the group leader, as for TOP-COMMIT. The guardian checks whether P is a ReadRetainer of a version, then performs the following algo-rithm for each atomic data item. (1) If P is a ReadRetainer of a version, or in the ancestor list of a ReadRetainer, P is removed from the list; goto step 4. (2) Otherwise, the guardian checks, starting at the bottom of its version stack, whether P is a WriteRetainer, or in the ancestor list of a WriteRetainer of a version. (3) If a match is found with a version, say v,, the oldest version prior to v, is restored, and the owner is set equal to P's ancestor. (4) If there are no remaining R/W retainers, the t-lock is removed. To clarify this, an example of the formation of a version stack with nested transactions before and after a transaction aborts is shown in Figure 5.6 below. Thus, ABORT causes the version stacks and the ReadRetainers and WriteRetainers to be updated4. 5.2.4.0. Transaction Read/Write Conflicts To handle conflicts correctly, the guardian must check for each atomic data access attempt, if there is a R/W lock on the data. 4Note that we must check to see if the owner held several versions — if so, the oldest must be re-stored. Therefore the version stack must be checked from the bottom-up for completeness, even though 155 Version Stack before T 2 A B O R T s Version Stack after T 2 A B O R T s version owner-ancestors version owner-ancestors VO stable store VO stable store V I T O V I T l V 2 T 2 - - T 1 - - T 0 V 3 T 3 - - T 2 - T 1 - - T 0 V 4 T 2 - T 1 - - T 0 V 5 T 4 - T 2 - T 1 - - T 0 V 6 T 2 - T 1 - T 0 T 2 A B O R T s bottom Figure 5.8 Formation of a Version Stack with Nested Transactions (1) If no R / W locks are held, access is permitted and a normal data access is performed as described above. (2) If a R / W lock is held on the data, we search to see if the current version owner is an ancestor of P, by checking whether the owner is still alive. In doing so, the top of the version stack is updated. If there is a conflict, access is denied. the most likely ABORT is the current version holder. 156 5.2.4.7. The Group Leader's Role E a c h group leader maintains a list of participants by accepting information from a guar-dian whenever a transaction in a new group receives access to some of the guardian's atomic data (i.e. by gaining a R / W lock). T h e group leader also accepts T O P - C O M M I T and A B O R T messages from transactions in the group. T h e structure of this communication is shown in Figure 5.7 below. O n receipt of an A B O R T or T O P - C O M M I T message 5 from a transaction, the group leader has to notify the participants in its list. F o r this purpose, the G L uses a worker pro-cess on the same team, whose role is to provide two-way communication as described previ-ously in Section 2.3.2. T h e worker has access to the participant list maintained by the G L , and exploits this fact to handle the T O P - C O M M I T protocol, unaided by G L , thus avoiding extra context switches and ipc between the worker and the G L . O n l y when the T O P -Figure 5.7 Group Leader Communication Structure 6Note that ABORT messages must always take priority over TOP-COMMIT 157 COMMIT is resolved does the worker re-Send to its GL. The worker thus acts like an excep-tion process described earlier. There are no synchronisation problems between the worker and its GL because the TOP-COMMIT message should be the last from the group. Similarly, during the processing of ABORT the worker only re-Sends to its GL when the ABORT is completed. If more ABORT's for the group are received by GL while its worker is processing an earlier one, the GL must save them up. To avoid multiplexing and possible delays, we specify one GL per transaction tree group. Usually it is created on the same host as the top-level transaction, although for efficiency in execution of the 2-phase COMMIT protocol the GL should be on the host which has most participants. Thus all group leaders execute the same code, first creating their worker to manage the 2-way communication between participants and the GL. 5.2.4.8. G r o u p - L e a d e r ' s T O P - C O M M I T A l g o r i t h m s . The GL executes the following algorithm on receipt of a TOP-COMMIT message from the group's top-level transaction T. (1) Reply TOP-COMMIT(T) to worker. (2) Wait to receive next message. The worker executes the following algorithm when there is no multicast or broadcast ipc mechanism. (1) Send to GL for instructions (2) On receiving the Reply TOP-COMMIT(T), obtain the list of participants. (3) For each participant, the worker must Send a PREPARE(T) message. 158 If the participant replies ABORT(T), then the whole commit must fail, and all the transactions must be aborted. G o to step 8. If the participant does not reply within a timeout period, the worker can either A B O R T (goto step 8), or retransmit the PREPARE(T) message. If the participant responds PREPARED(T) then goto step 3 unless each participant has been notified. A l l the participants have responded PREPARED(T) at this point. Record in per-manent memory a list of the participants along with a notation that transaction T is now COMPLETING. F o r each participant, Send a COMMIT(T) message. T h e participant must respond COMMITTED(T). If there is a participant which has not responded after a timeout period, the message COMMIT(T) is retransmitted. W h e n all participants have responded, erase the list of participants and associated memory from permanent memory. Reply COMMITTED(T) to G L . Done. Send the message ABORT(T) to each participant. W h e n all participants have responded, reply ABORTED(T) to G L . Done. T h e participants act as follows: If a PREPARE(T) message is received, and T is unknown at the participant (it never ran there, was locally aborted, or was wiped out by a crash), then respond ABORT(T). Otherwise, prepare the transaction locally as follows. Wri te the identity and the new state of any objects write-locked by T to the permanent memory. However, the old state of the objects is not overwritten — the point is to have both the old and the new state recorded in permanent memory, so that the transaction can be locally committed (by installing the new states) or aborted (by forgetting the new states) on demand, 159 regardless of crashes. T h i s write to permanent memory must be performed as a single atomic update. W h e n a transaction is prepared, its read locks may be freed immedi-ately, but write locks must be held until the transaction is resolved. T h e participant must then reply PREPARED(T) to the G L . (2) If an ABORT(T) message is received, then locally abort T . If T is PREPARED erase from permanent memory the potential new states of objects modified by it (using a sin-gle atomic write), and then perform the usual version restoration associated with tran-saction abort. (3) If a COMMIT(T) message is received, then T must have been locally prepared, and is either still prepared, or already committed. If it is prepared, install the tentative new states of objects modified by T in both permanent and volatile memory, discarding the old states of these objects 6 . F inish committ ing the transaction by releasing all its locks? If T is no longer locally prepared, then it has already been committed, and no special action is required. In either case ( T is locally prepared or not), respond COMMITTED(T) to the G L when done. (4) If a transaction has been prepared for a long time and nothing has been heard from the G L , then ask it for the state of the transaction. T h e G L replies PREPARE(T) if it is still preparing, COMMIT(T) if it is committing. If the G L is not running, or there is no record of the transaction, then the node where the G L ran should respond ABORT(T). Moss argues [Moss 82] that this algorithm is correct and will always work eventually. If there is a crash at a participant after the first phase, the permanent memory contains the 8again, ensure that the update is achieved atomically with a single write 160 relevant information for recovery. If a transaction was not yet completed, then the crash will wipe it out, and the transaction will be ABORTED. Further scenarios are explained in Moss, but only the case of the coordinator failing may cause some participants to lock data forever. A manual override must be used in such cases - the data is, however, correctly preserved on permanent storage. Alternative algorithms may be used by the worker if a multicast or broadcast ipc mechanism is available, which reduces the ipc and message traffic still further. (1) The GL broadcasts or multicasts (Bsends) a PREPARE(T) multicast message, and waits till all the replies have been picked up. (2) If the GL receives an ABORT reply, then go to step 7. (3) If the GL receives all PREPARED(T) replies within the timeout period, it records in permanent memory a list of participants along with a notation that transaction T is now COMPLETING. Otherwise it can retransmit PREPARE(T) messages on a 1:1 basis as in step 4 of the previous algorithm, and if the timeout still expires, the transaction must be aborted, (go to step 7). (4) All the participants have responded PREPARED(T) at this point. Now Bsend a COMMIT message. (5) As step 6 before. (6) As step 7 before. (7) BSend an ABORT(T) message. When all participants have responded, reply ABORTED(T) to GL. Done. 161 5.2.4.9. Group Leader's ABORT Algorithms T h e G L executes the following algorithm on receipt of an A B O R T message from any transaction T of the group. It is similar to the C O M M I T algorithm described above. (1) Reply ABORT(T) to worker. (2) W a i t to receive next message. T h e worker executes the following algorithm when there is no multicast or broadcast ipc mechanism. (1) Send to G L for instructions (2) O n receiving the Reply ABORT(T), obtain the list of participants. (3) F o r each participant, Send an ABORT(T) message. (4) If the participant does not reply within a timeout period, the message is retransmitted. If the participant responds ABORTED(T, NONE) then that participant has indicated there are no more transactions in that group holding R / W locks, so the worker can remove that participant from its list.. If the participant responds ABORTED(T, MORE) then that participant has indicated there are more transactions in that group holding R / W locks, and the worker cannot remove that participant from its list. Goto step 3 until each participant has been notified. (5) A l l the participants have responded ABORTED(T) at this point. Reply ABORTED(T) to G L . Done Extension of the algorithm may be made if a broadcast or multicast feature is available, as for the 2-phase C O M M I T described above. 162 5.2.5. Efficiency of Mechanisms A comparison of the L O C U S and V-kernel implementations is shown in T a b l e 5.2. In L O C U S , for each subaction opening n atomic files and later committing, there are 4+2n mes-sages to parent and T S S . F o r top-level commit a 2-phase protocol involving 4n messages (to T S S ) is executed. T h u s if T - > T l - > T 2 - > T 3 - > n atomic files there are 3*(4+2n) messages between T , T 1 , T 2 , T 3 and the T S S , plus 4n messages for 2-phase top-level commit . T h i s is shown in column 1 of T a b l e 5.2. T h u s , contrary to the claim in [Mueller 83], the 2-phase commit does not dominate at 3 or more levels of nesting of transactions. T h e number of message events, where a message event is described as a message being sent or received, is equal to 2 * (no. of messages) in this case, as each message is 1:1. A s the process context switching is often a dominant time-cost, we use this measure, which is equal to the message events. So in the above example, in the case of no aborts or Table 5.2 A Comparison of Locus and V-kernel Nested Atomic Actions no. of ipc messages NO ABORTS LOCUS V-kernel no broadcast V-kernel broadcast Level m subtrans. opening n files m*(4+2n) 4+2n 4+2n Level 3 subtrans. 3*(4+2n) 4+2n 4+2n TOP COMMIT 4n 2+4n 2+2n TOTAL messages 12+10n 6+6n 6+4n ABORTS Level m subtrans. aborts 2n 6+4n 7+3n Level 1 trans, aborts 12+8n 6+4n 7+3n 163 data conflicts, the L O C U S system requires 24 + 20n context switches. F o r our scheme, for each subaction commit there are no context switches. Therefore the only messages for the above model are to create the G L , to establish the participant list and to execute the 2-phase commit protocol. T h e cost of establishing the G L is the overhead of 2 process creations (for G L and its worker) = 4 ipc messages in V-kernel . E a c h participant in the group, P, does a Send to the G L . Therefore there are 2n messages for n participants, shown in column 2 of the T a b l e . F o r top-level commit (no broadcast) the 2-phase protocol involving 2 messages from G L to its worker, and 4n messages between the worker and the participants is executed. T h u s , after the 2-phase top-level commit for successful update of n files (without broadcast), (4+2n) -I- (2+4n) = 6+6n ipc messages, requiring 12 -I- 12n context switches have been executed, as shown in the Table . T h i s compares favourably with L O C U S ' 24 + 20n context switches. If a multicast algorithm is used to broadcast the T O P - C O M M I T message in the V -kernel, only 2+2n messages instead of 2+4n are required, as shown in column 3 of T a b l e 5.2. Let us consider what these figures mean in real time. If n = 5 and the context switching time is lOmsecs (as it is for a Send-Receive-Reply between two U N I X 4.2BSD processes on a V A X 11/750), L O C U S takes 124*10 msecs =1240 msecs to commit (ignoring message transit time), whereas our system takes 10(12+12*5) = 720 msecs (no multicast), and with multicast takes 10(6+20) = 260 msecs. T h u s our algorithm is considerably faster in the normal-case of no aborts or conflicts. W e now consider the exceptional case of a transaction aborting. In the L O C U S system, if a subtransaction A B O R T s , a message is sent to each of the aborting subtransaction's parti-cipants (including those inherited from its committed children). If the transaction T has n 164 participants, and if the lowest subtransaction A B O R T s , L O C U S uses 2n messages. If the highest subtransaction A B O R T s , L O C U S uses 3*(4+2n) + 2n = 12 + 8n messages. In our system, the subtransaction sends an A B O R T message to the G L which in turn sends A B O R T to each participant before replying to the aborting transaction. T h u s in our system, if the lowest transaction A B O R T s we have the following: O n A B O R T , 2 + either 2n (no multicast) or 1+n (if multicast) messages So the T O T A L messages (1 A B O R T , no timeouts) = 6 -I- 4n (no multicast) or 7 -I- 3n, as shown in the T a b l e . T h i s is the same for any level of transaction aborting. T h u s our system is faster than L O C U S to handle an A B O R T if the aborting transaction is more than 2 levels higher than the file user. T h i s situation would commonly occur in prac-tice: the file-opener would be the lowest-level transaction, and it might well be aborted by a grandparent. T h u s even the exception-handling can be faster in our system, and the normal-case always is faster. C H A P T E R 6 Conclusions T h i s thesis is another step towards the time when concurrent distributed programming is commonplace. Its contribution is in identifying exceptional events in the context of systems software, and in the development of a design model illustrated by many examples and pro-gram paradigms which show how systems programs may be improved by exploiting exception mechanisms ~ either by making the programs more efficient at run-time (according to certain criteria), a n d / o r by making the program code easier to write through employing incremental program design. These software engineering techniques are applicable to all event driven sys-tems where the relative probability of the events is known a priori; such systems include hierarchic systems servers, communications protocols and utility programs. T h u s the model is an extremely useful tool for solving a wide range of systems software design problems. 6.1. Summary of results T h e thesis establishes a set of general principles for exploiting exceptions in systems pro-grams by proposing a general model to achieve improved performance and functionality according to desired objectives. T h e thesis also describes program models to carry out these objectives, and gives two extended examples which show the practicality of the design metho-dology. It is appropriate here to summarize these results, in the light of achieving the goals of the thesis, which are to show how to design efficient and modular systems programs exploiting exception mechanisms. T h e goals are elaborated as follows: to characterize the nature of 165 166 exceptions in operating systems, to develop a model for designing efficient and modular software in a multi-process environment, and to provide program paradigms which employ the design principles by exploiting exception mechanisms. The goals are examined separately in the following subsections. 6.1.1. Characterise the Nature of Exceptions An event-oriented view of systems has been taken, where an event manager responds to events, which may be externally or internally generated. This thesis shows that there are many systems programs which may be treated as event driven. Systems servers (or moni-tors), i/o library routines, and, less obviously, some data-driven utility programs come under this description. The exceptions encountered by an event manager may be handled either by the manager itself -- called peer exceptions -- or by the client at the level above — called server-client exceptions. In multi-process systems, where a client is a process separate from the resource which it is using, exceptions from the event manager (called a server) to the client are called inter-process exceptions. Such exceptions, where the handler is at a different level to the detector, have been classified along three dimensions: viz. ignorable v. exceptions which demand attention, broadcast v. 1:1 exceptions, and asynchronous v. synchronous excep-tions. These distinctions occur because the exception mechanisms may be different for the vari-ous classes of inter-process exceptions. For example, notification of ignorable exceptions may be different to that for exceptions which must be handled in a distributed system where a cheap unreliable message delivery may be exploited for ignorable exceptions. Similarly, broad-cast exceptions from a server to several clients may be used to develop further cooperation between the client processes, and may therefore be handled differently to the exceptions which 167 are 1:1 from server to client. T h i s classification of exceptions therefore provides a very useful basis for research into exception mechanisms, and for multi-process systems design exploiting exception mechanisms. 0.1.2. Model for Systems Design F o r event-oriented systems in which the relative probabilities of the events are known a priori, an appropriate design objective is chosen. T h e thesis model considers three objectives, namely: minimising the average run-time, minimising the exception handling time, and (less obviously) increasing the program's functionality. T h i s latter objective was chosen as a new approach to systems design, which is particularly useful and appropriate in a multi-process environment. Programs which are designed to minimise the average run-time should use the following design principles: (1) T h e events are divided into 2 groups; the so-called normal events, and the other, excep-tional events. (2) Cost-benefits may be made in the detection, handling, or context-saving of the normal-case events depending on the number and probabilities of the events. (3) T h e program should be designed to minimise the run-time by either reducing the detec-tion cost of normal events, reducing the handling cost of normal events, or reducing the context-saved while processing normal events. T h i s design may mean that the program is structured according to functionality rather than its logical purpose. In other words, it may pay to cut the program into vertical slices of function, rather than horizontal slices of modularity. One approach is to structure the inter-process communication to minim-168 ise the message-flow in the statistically dominant cases, by recognizing the most common events in the system, and ensuring that they are processed with the m i n i m u m number of context switches. T h i s can be achieved by choosing the event manager process which eventually handles most common events to handle all the events first, by adding request exceptions. T h i s event manager processes the common requests quickly without any extra context switches, and it forwards the exception requests to other processes for han-dling. Another approach is to execute exception-handling code in parallel where possible by distributing the exception-handling over multiple processes. A general principle which leads to better design in a multi-process or distributed system, is to consider exception handlers as separate processes. T h e notion of ignorable exceptions, in the sense that if they are not handled they have no effect on the correct operation of the pro-gram, may also be used to improve program functionality. These two tools used together enable a system to be developed without an exception handler, and then its functionality can be increased by the addition of a handler. Further , broadcast exceptions can be used to enhance cooperation between the clients of a server, and cheap unreliable message delivery can be safely used for ignorable exceptions. T h e many examples illustrate that applying the model enables multi-process programs to be structured simply and in a modular way. T h u s the model provides a very powerful software engineering tool for systems design methodology. 8.1.3. Program Paradigms for Systems Design T h e design principles have been extended into usage models when the design alludes to a system dependent feature, such as in the inter-process exception notification scheme. 169 One example is given in several operating systems and high level languages, for multi -plexing of asynchronous i / o events, where a server needs to notify a client of asynchronous input from more than one input device. Another example shows how cooperation can be achieved between the clients of a storage allocator server, using ignorable, broadcast messages, which may be unreliably delivered for the most efficient communication between processes. Finally, system dependent mechanisms are given for handling the mutual exclusion problems which may arise if the exception handler is implemented as a separate process, including the use of language-level atomic actions. In this way, the properties are isolated which high-level languages should have for imple-menting the design ideas of the model. Hence the systems and language approaches to sys-tems design have been unified. 6.1.4. P r o g r a m E x a m p l e s T w o programs are described which were designed by the author to test the applicability and effectiveness of the model. T h e programs have improved structure and function by exploiting exceptions. Both programs run more efficiently than their counterparts which do not exploit exceptions. These examples show that the principles and tools are both practical and effective. 6.2. F u r t h e r W o r k T h e trend towards distributed systems continues -- to loosely-coupled systems without shared memory, tightly-coupled multiprocessor systems with shared memory, and towards hybrid systems, reflecting the move towards computer systems as concurrent cooperating modules executing an algorithm. New techniques for managing these distributed systems are 170 required. T h e principles enunciated in this thesis for multi-process systems extend naturally into the domain of distributed systems. One open question concerns the techniques for exploiting unreliable messages in networks of computers. T h e use of broadcast (or multicast) messages is very convenient in local area networks where all the nodes receive the same message, thus reducing network traffic over many 1:1 messages. However, in long haul networks, characterised by multiple hops between source and destination machine, the network traffic is usually increased for broadcast mes-sages, because of the routing algorithms. Y e t it is in precisely these networks that the reliable message delivery is most costly to provide (in the sense of elapsed time for acknowledgements and in protocol processing overhead). T h e tradeoff between broadcasting unreliable messages and providing reliable 1:1 message delivery should be examined in these contexts. T h e use of broadcast/multicast inter-process communication is not yet well developed; there are open questions on the best semantics of a broadcast Send; e.g. should the sender be able to specify whether to wait for one, many or all replies. T h e use of broadcast Send for exception mechanisms impacts these semantics, as the idea behind exception mechanisms is to take the load off the normal case processing which can then be made as efficient as possible. Hence an unreliable broadcast Send in which the sender cannot specify, or need to know, the number of replies is considerably cheaper than a broadcast Send where a list of all recipients is needed for checking that all replies have been received. Performance estimates of these alternatives would be valuable in deciding the protocols and semantics of broadcast communi-cation. Another open question concerns the implementation of transaction scheduling sites ( T S S ) for atomic actions. T h e idea behind a T S S is to protect atomic data, (which is shared by 171 several processes), and is crucial for cooperation of access to shared data between exception handlers and other processes. Shared data between the processes should be specifyable as atomic for simple synchronisation, and such atomic data should be efficiently managed by a T S S . Further work on this could include enhancements to high level languages such as Modula-2 in which cooperating systems are written, to provide easy methods for specifying exceptions and exception handling in this way. Efficient algorithms for expedient and fair selections of events should be developed for event managers; the task of deciding which ready event to handle next cannot be left to the operating system in a distributed environment. New mechanisms are needed for distributed highly parallel computations, to reflect needs in both systems software and applications software written in high level languages. A higher-level programming problem which could employ our principles for good design, by the use of exception handling facilities in multiprocess a n d / o r distributed operating systems, is the S M A L L T A L K system [Goldberg 84). S M A L L T A L K is an important step towards message-based concurrent programming in A l applications. If it could be distributed over a local area network by enhancing the language with unreliable exception notification, its pro-gramming power might be considerably enhanced. 172 References 1. Ahamad, M. and Bernstein, M.J., "Multicast Communication in UNIX," Proceedings 5th International Conference on Distributed Computing Systems, pp. 80-87, May 1985. 2. Andrews, G.R., "Synchronizing resources," ACM Transactions on Programming Languages and Systems, vol. 3(4), pp. 405-430, 1981. 3. Andrews, G.R., "The Distributed Programming Language SR—Mechanisms, design and implementation," Software P&E, vol. 12(8), pp. 719-754, August 1982. 4. Andrews, G.R. and Schneider, F.B., "Concepts and Notations for Concurrent Program-ming," ACM Computing Surveys, vol. 15(1), March 1983. 5. Atkins, M.S. and Knight, B.J., "Experiences with Coroutines in BCPL," Software P&E, vol. 13(8), Aug. 1983a. 6. Atkins, M.S., "Exception Handling in Communication Protocols," Proceedings 8th Data Communications Conference, Oct. 1983b. 7. Atkins, M.S., "Notations for Concurrent Programming," Surveyors' Forum, ACM Comput-ing Surveys, vol. 15(4), Dec. 1983c. 8. Beander, B., "VAX DEBUG: An Interactive, Symbolic, Multilingual Debugger," Proceedings ACM SIGSOFTjSIGPLAN Software Engineering Symposium on High-Level Debugging, Aug. 1983. 9. Bentley, J.L, "Writing Efficient Programs," in Prentice-Hall Software Series, New Jersey, 1982. 10. Black, A., "Exception Handling - the case against," in University of Washington Tech Report, 1982. 11. Bochmann, G.V., "Finite state description of communication protocols," Computer Net-works, vol. 2(4/5), pp. 361-372., Oct. 1978. 12. Bourne, S.R., Birrell, A.D., and Walker, I., "ALGOL68C Reference Manual," University of Cambridge Computer Laboratory,, 1975. 13. Bron, C, Fokkinga, M.M., and Haas, A.CM. De, "A Proposal for dealing with abnaormal termination of programs," Twente University of Technology, Mem.Nr.150, The Netherlands, Nov. 1976. 14. Brownbridge, D.R., Marshall, L.F., and Randell, B., "The Newcastle Connection or UNIXes of the World Unite!," Software P&E, vol. 12(7), pp. 1147-1162, 1982. 15. Bunch, S.R. and Day, J.D., "Control Structure Overhead in TCP," Proceedings IEEE Trends and Applications: Computer Network Protocols, Gaithersburg, Maryland, pp. 121-127, May 1980. 16. CCG,, "Datapac Interactive Terminal Interface (ITI) Specification," Trans Canada Tele-phone System, Ottawa, Ontario, Oct. 1978. 17. CCITT,, Interface between DTE and DCE for terminals operating in the packet mode on public data networks, March 1976. 18. Chanson, S.T., Ravindran, K., and Atkins, M.S., "Performance Evaluation of the Transmis-sion Control Protocol in a Local Area Network Environment," Canadian INFOR — Special Issue on Performance Evaluation, vol. 23(3), pp. 294-329, Aug. 1985a. 19. Chanson, S.T., Ravindran, K., and Atkins, M.S., "LNTP - An Efficient Transport Protocol for Local Area Networks," UBC Computer Science Technical Report 85-4, University of British Columbia, Feb. 1985b. 20. Cheriton, D.R., Malcom, M.A., Melen, L.S., and Sager, G.R., "Thoth, a portable real-time operating system.," CACM, vol. 22(2), pp. 105-115, Feb. 1979a. 21. Cheriton, D.R., "Multi-process Structuring and the Thoth Operating System," UBC Com-puter Science Technical Report, University of British Columbia, March 1979b. 173 22. Cheriton, D.R. and Steeves, P., "Zed Language Reference Manual," UBC Computer Science Technical Report 79-2, University of British Columbia, Sept. 1979c. 23. Cheriton, D.R., "Distributed I/O using an object-based protocol," UBC Computer Science Technical Report 81-1, University of British Columbia, Jan. 1981. 24. Cheriton, D.R., "The Thoth System: Multi-Process structuring and Portability," in Elsevier North-Holland, New York, 1982. 25. Cheriton, D.R., "Local Networking and Internetworking in the V-System," Proceedings Sth Data Communications Conference, Oct. 1983a. 26. Cheriton, D.R. and Zwaenpoel, W., "The Distributed V-kernel and its Performance for Diskless Workstations," Proceedings of the 9th Symposium on Operating Systems Principles, Oct 1983b. 27. Cheriton, DR., "One-to-Many Interprocess Communication in the V-System," Proceedings Fourth International Conference on Distributed Systems, May 1984. 28. Clark, D., Proceedings SIGCOMM 88, 1983. 29. DARPA,, "Internet program protocol specifications — Internet Protocol, Transmission Con-trol Protocol," Information Sciences Institute, USC, CA, vol. RFC 791, 793, Sept. 1981a. 30. DEC,, "Digital Equipment Corporation - BLISS-11 Programmer's Manual," Maynard, Mass., 1974. 31. Deering, S.E, "Multi-process structuring of X.25 software," UBC Computer Science Techni-cal Report 82-11, University of British Columbia, Oct 1982. 32. Dijkstra, E.W., "Guarded Commands, nondeterminacy, and formal derivation of programs," CACM, vol. 18(8), August 1975. 33. Feldman, J.A., "High-level Programming for Distributed Computing," CACM, vol. 22(6), pp. 363-368, June 1979. 34. Gentleman, W.M., "Message passing between sequential processes: The Reply primitive and the administrator concept," Software P&E, vol. 11(5), May 1981. 35. Gentleman, W.M., "Using the Harmony Operating System," National Research Council Canada, Division of Electrical Engineering, Ottawa, Ont., vol. NRCC/ERB-966, Dec. 1983. 36. Geschke, CM. and Satterthwaite, E.H., "Exception Handling in Mesa," XEROX PARC report, Palo Alto, 1977, 1977. 37. Goldberg, A., "SMALLTALK-80 The Interactive Programming Environment," in Addison-Wesley, 1984. 38. Goodenough, J.B., "Exception-handling: Issues and a Proposed Notation," CACM , vol. 18(12), pp. 683-696, Dec. 1975. 39. Herbert, A.J., CAP Operating System Manual, University of Cambridge Computer Labora-tory, 1978. 40. Hoare, C.A.R., "Monitors: An operating system structuring concept," CACM, vol. 17(10), pp. 549-557, Oct. 1974. 41. Hoare, C.A.R., "Communicating Sequential Processes," CACM, vol. 21(8), pp. 666-677, Aug. 1978. 42. IBM,, "PL/l(F)Language Reference Manual," Form GC28-8201, IBM Corporation, 1970. 43. Joy, W., UNIX4-2BSD Operating System, 1983. 44. Lampson, B.W., Mitchell, J.G., and Satterthwaite, E.H., "On the Transfer of Conrol Between Contexts," In Lecture Notes in Computer Science,vol.l9,B.Robinet(ed.), Springer-Verlag,N.Y., pp. 181-203 , 1974. 45. Lampson, B.W. and Redell, D.D. , "Experience with Processes and Monitors in Mesa," CACM 28(2), pp. 105-117, Feb. 1980. 174 46. Lantz, K . A . , "Uniform Interfaces for Distributed Systems," PhD Thesis, University of Rochester, 1980. 47. Lauer, H.C. and Needham, R.M., "On the Duality of Operating System Structures," Operating Systems Review, vol. 13(2), pp. 3-19, April 1979. 48. Leach, P A . and Levine, P.H., "The Architecture of an Integrated Local Network," IEEE Journal on Selected Areas in Communications, vol. SAC-1(5), pp. 842-856, Nov. 1983. 49. Lee, P.A., "Exception Handling in C Programs," Software Practice & Experience, vol. 13(5), pp. 389-405, 1983. 50. Levin, R., "Program Structures for Exceptional Condition Handling," PhD Thesis, Carnegie-Mellon University, 1977. 51. Levin, R., Personal Communication, May 1982. 52. Liskov, B., "CLU Reference Manual," Computation Structures Group Memo 161, MIT Laboratory for Computer Science., July 1978. 53. Liskov, B. and Snyder, A., "Exception Handling in CLU," IEEE Transactions on Software Engineering SE-5(6):546-558,November 1979, 1979. 54. Liskov, B. and Scheifler, R., "Guardians and Actions: Linguistic Support for Robust, Distri-buted Programs," ACM Transactions on Programming Languages and Systems, vol. 5(3), July 1983. 55. Lockhart, T.W., "The Design of a Verifiable Operating System Kernel," UBC Computer Science Technical Report 79-15, University of British Columbia, Nov. 1979. 56. Lomet, D.B., "Process structuring, synchronization and recovery using atomic transactions," Proc. ACM Conf. Language Design for Reliable Software, SIGPLAN Notices, vol. 12(3), pp. 128-137, March 1977. 57. Luckham, D.C. and Polak, W., "ADA Exception Handling: An Axiomatic Appoach," ACM Transactions on Programming Languages and Systems, vol. 2(2), pp. 225-233, April 1980. 58. Macintosh,, "Inside Macintosh," in Apple Computers, 1984. 59. Malcolm, M., Bonkowski, B., Stafford, G., and Didur, P., "The Waterloo Port Programming System," Technical Report, Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, January 1983. 60. Metcalfe, R.M. and Boggs, D.R., "Ethernet: distributed packet switching for local computer networks," Communications of the ACM, vol. 19(7), pp. 395-404, July 1976. 61. Mitchell, J.G., Maybury, W., and Sweet, R., "Mesa Language Manual, version 5.0," in Rep. CSL-79-3, Xerox PARC, April 1979. 62. Moss, J.E.B., "Nested Transactions: An APproach to Reliable Distributed Computing," in PhD. Thesis, Masssachusetts Institute of Technology, vol. MIT/LCS/TR-260, April 1981. 63. Mueller, E.T., Moore, J.D., and Popek, G.J., "A Nested Transaction Mechanism for LOCUS," Proceedings of the 9th Symposium on Operating Systemas Principles, Oct. 1983. 64. Nelson, B.J., "Remote Procedure Call," in PhD. Thesis. Rep. CMU-CS-81-119, Department of Computer Science, CMU, May 1981. 65. Ousterhout, J.K., Scelze, D.A., and Sindhu, P.S., "Medusa: an experiment in operating sys-tem design," CACM, vol. 23(2), pp. 92-105, Feb. 1980. 66. Parnas, D.L., "On a Buzzword: Hierarchical Structure," Proc. IFIP Congress,North-Holland publ. Co., 1974. 67. Peterson, and Silberschatz,, "Operating Systems Concepts," Prentice Hall, 1983. 68. Popek, G., Walker, B., Chow, J., Edwards, D., Kline, C , Rudisin, G., and Thiel, G., "LOCUS: A Network Transparent, High Reliability Distributed System," Proceedings of the 8th Symposium on Operating Systems Principles, ACM, pp. 169-177, Dec. 1981. 175 69. Randell, B., "System structure for Fault Tolerance," Proc. Int'l Conf. on Reliable Software, Los Angeles 1976, 1975. 70. Randell, B., Lee, P.A., and Treleaven, P.C., "Reliability Issues in Computing Systems Design," Computing Surveys, vol. 10(2), pp. 123-165, June 1978. 71. Redell, D.D., Dalai, Y.K., Horsley, T.R., Lauer, H.C., Lynch, W.C., P.R.McJones,, Murray, H.G., and Purcell, S.C, "PILOT: an operating system for a personal computer," Communi-cations of the ACM, vol. 23(2), pp. 81-92, Feb. 1980. 72. Reed, D.P., "Implementing Atomic ACtions on Decentralised Data," ACM Trans. Computer Systems, vol. 1(1), pp. 3-23, Feb. 1983. 73. Reid, L.G. and Karlton, P.L., "A File System Supporting Cooperation between Programs," Proceedings of the 9th Symposium on Operating Systems Principles, Oct. 1983. 74. Richards, M. and Whitby-Strevens, C, "BCPL — The Language and its Compiler," in Cam-bridge University Press, 1979. 75. Richards, M., Aylward, A.R., Bond, P., Evans, R.D., and Knight, B.J., "TRIPOS - a port-able operating system for mini-computers," Software P&E, vol. 9, pp. 513-526, 1979. 76. Ritchie, D.M. and Thompson, K., "The UNDC Time-Sharing System," CACM , vol. 17(7), pp. 365-375, July 1974. 77. Ritchie, D.M., S.C.Johnson,, Lesk, M.E., and Kernighan, B.W., "The C Programming Language," The Bell System Technical Journal, vol. 57(6), pp. 1991-2021, July-Aug. 1978. 78. Sammet, J.E., "Programming Languages: History and Fundamentals," Prentice-Hall, Inc. N.J., 1969. 79. Scotton, G.R., "An experiment in High level protocol design," University of British Colum-bia, Department of Computer Science, MSc. Thesis, Dec. 1981. 80. Schlichting, R.D. and Schneider, F.B., "Using Message Passing for Distributed Program-ming: Proof Rules and Disciplines," Carnegie-Mellon Technical Report, vol. 82-491, May 1982. 81. Sindhu, P.S., "Distribution and Reliability in a Multiprocessor Operating System," CMU Technical Report , vol. CMU-CS-84-125, 1984. 82. Spector, A., "Performing Remote Operations Efficiently on a Local Computer Network," Communications of the ACM, vol. 25(4), April 1982. 83. Defense, US Department of, ADA Programming Language Military Standard, MIL-STD-1815, Washington, D.C, Dec. 80. 84. vanWijngaarden, A., "Revised Report an the Algorithmic Language ALGOL 68," Acta informatica, vol. 5,Fasc.l-3, 1975. 85. Wilkes, M.V. and Needham, R.M., "The Campbridge CAP computer and its operating sys-tem," in Elsevier North Holland, 1979. 86. Wilkes, M.V. and Needham, R.M., "The Cambridge Model Distributed System," Operating Systems Review, vol. 14, pp. 21-29, 1980. 87. Wirth, N., "Programming in Modula-2," in Springer-Verlag, New York, 1982. 88. Wulf, W.A., London, R.L., and Shaw, M., "An Introduction to the Construction and Verification of Alphard Programs," IEEE Transactions on Software Engineering, vol. SE-2,4, pp. 253-265., Dec. 1976. 89. Zimmerman, H., "OSI Reference Model—The ISO Model of Architecture for Open Systems Interconnection," IEEE Trans. Commun., vol. COM-28, pp. 425-432, April 1980. Appendix A The putbyte program. T h i s appendix describes an example of minimising the average run-time by reducing the normal case detection costs in the putbyte utility program in the Verex i /o system [Cheriton 81]. T h i s system provides, at the application level, a similar model of selecting input and out-put as is used in several implementations of B C P L [Richards 79a]. T h e Verex Z-language i /o library [Cheriton 79c] provides the routines selectoutput and putbyte to output a charac-ter. In Z, as in many C-l ike languages, the i /o library routines share global data. Putbyte and selectoutput share the global variable selected-output, from which the variables selected-output.ptr and selected-output.bufcnt are accessed. T h e routine putbyte may be con-veniently be considered to have states UNINITIALISED and RUNNING. T h e events put-byte has to deal with are the request for initialisation, initialise, and the request to put a byte to the selected output, putcharacter. Initialisation must occur before any other calls to putbyte are accepted. T h e user must call selectoutput to cause initialisation of the global variable selected-output which influences the state of putbyte.1 If the programmer tries to output a character when there is no selected output device, an exception occurs. T h i s excep-tion is a server-client type, which is detected by the server routine (in this case, putbyte), which may perform some handling before passing the exception on for handling by the client, putbyte's caller. ^n more modular languages, the global data and the routines manipulating it would be encapsulat-ed in a module, and invocations of selectoutput and putbyte would be treated as request events upon which appropriate action must be taken: some events would cause state transitions and others would not. In the unstructured C-like languages, relevant global variables may similarly be treated as state variables and calls to routines which manipulate these, such as selectoutput and putbyte as events which cause state transitions. 176 177 Now the cost of handling a byte when the local buffer is full, is much more than when it is not full. Thus the execution cost of the routine depends on the buffer state as well as the event, so violating the assumption of constant-cost events. The event set is increased to reflect constant cost, by dividing putbyte's state of RUNNING into two new states, BUFFER-FULL and BUFFER-NOT-FULL so there is a constant cost of handling each event-state pair. Thus there are 3 states, UNINITIALISED, BUFFER-FULL, BUFFER-NOT-FULL, and two external events, initialise and putcharacter. By combining these into constant-cost state-event handling the events 3x-aA defined in Table A.l below are obtained2. Table A . l Execution costs for the putbyte routine event probab-ility cost of handling cost of detection expected execution cost new detection cost new expected execution cost Si= initialisation 0.00025 3000 0 0.75 0 0.75 s2= putbyte-uninitialised 10"* 1000 1 0.001 2 0.001 s 3 = putbyte-with-buffer-full 0.001 1000 2 1.002 o 1.002 s 4 = putbyte-with-buffer-not-full 0.999 2 2 3.996 1 2.997 Total expected costs/event 5.75 4.75 A regular implementation of putbyte might be of the following form: The routine error handles the server-client exception NO-SELECTED-OUTPUT by-printing a message and making a synchronous error return to the client. The routine emp-tybuf handles the peer exception PUTBYTE-BUFFER-FULL which is transparent to the user (except that the call to putbyte may take longer). ^ o r simplicity, the rare combinations of initialise with states BUFFER-FULL and BUFFER-NOT-FULL are ignored. 178 putbyte(ch) { if selected-output = NULL then error(NO-SELECTED-OUTPUT); selected-output.ptrH—h — ch; if selected-output.bufcnt-1—h = 1000 then emptybuf(); } emptybuf() { write ( selected-output.buf, selected-output.bufcnt ); selected-output.bufcnt = 0 ; } T h i s program is now analysed to see where its efficiency can be improved. T o obtain the probabilities of each event, assume initialisation (i.e. a call of selectoutput) is performed just once for each set of data, and its cost is independent of the previous state of putbyte. If the average length of the data to be written is 4000 characters, then p (s x ) = 0.00025. Assuming that a request to putbyte with no selected output occurs very rarely, say 1 in 10 6 bytes writ-ten, and the buffer is 1000 characters, then p ( a 3 ) = 0.001, so the remaining event, putbyte-with-buffer-not-full occurs with probability 0.999. Note that no kind of division is made here between the events. A s s u m i n g that the routine initialise which handles the request exception call of selec-toutput takes approximately 3000 instructions, the routine emptybuf takes 1000 instruc-tions, and the routine error also takes 1000 instructions, and say just 2 instructions are needed to insert the byte into the buffer and update the counter for event s 4 , then the han-dling costs for each event are as described in column 2 of T a b l e A . l . Now the cost of detect-ing event a x in putbyte is zero, because initialisation is handled in a separate routine. Assuming each test costs one instruction, the detection costs are as shown in column 3 of Table A . l . Hence the expected average run-time cost = 5.75 instructions/event. 179 To minimise the run-time cost, the detection cost of the commonest event, s 4 should be reduced to just one test. Unfortunately this event cannot be tested explicitly; it is assumed to be none of the others. The approach is to follow the general principle of mapping several exception events onto one to reduce detection costs. Here, performance can be improved by mapping s2 and s3 together (as the initialisation exception sx is already separated otit through a call to a separate routine). The detection of the server-client exception PUTB YTE- UNINITIALISED can be mapped onto the detection of peer exception BUFFER-FULL by arranging that the uninitial-ised global variable selected-output points to a dummy file which has a full buffer. The two exceptional events s2 and a 3 are then detected in just one test. Individual exceptions s2 and s 3 are separated in the handler emptybuf. A n appropriately modified program reads as fol-lows: putbyte(ch) {. if selected-output.bufcnt-|—|- < 1000 then selected-output.ptr-j—|- = ch; else emptybuf(ch); } emptybuf(ch) { if selected-output — dummy-file then error(NO-SELECTED-OUTPUT); selected-output.ptr = ch; write ( selected-output.buf, selected-output.bufcnt ); selected-output.bufcnt = 0; selected-output.ptr = selected-output.buf; } The new program's detection and execution costs are given in the last two columns of Table A . l . The average run-time cost is reduced by 1/5.75 = 17.4%; achieved by dividing the events into two sets - {a4} corresponding to a normal event and {s2s3}, corresponding to exceptional events, and by modifying the program so that the normal event is detected in just 180 one test. 181 Appendix B The os program. T h i s appendix describes an example of minimising the average runtime by restructuring a utility program, os, to reflect the statistically dominant case rather than the logical flow. Os converts a formatted text file containing backspace characters for underlining and boldfac-ing, to a file achieving the same printed effect using overprinting of whole lines, and contain-ing no backspaces. T h i s data-driven utility can be treated as an event-driven program, where the input value of each character read represents an event. In its original structure, the program uses multiple line buffers, one for each level of overprinting on the line. E a c h character except backspace and line terminators are placed in the highest level line buffer that is currently unoccupied in the current character position. T h u s , its structure reflects the logical operation it is performing, translating a character stream into overprinting line buffers. However, its structure also means that the character it processed wi th the least overhead is the backspace character. G i v e n that backspaces consti-tute about 0.5% of the characters in most text files, this program is structured inefficiently. A n initial version of the program is given in Figure B . l . Using the same approach as in the putbyte example, the program is divided into several state-event pairs, each with constant cost over execution of the event. T h e main external event is reading of the next character from the text file. W i t h the program structure described above, the cost of the event depends on what the character is — NEWLINE, BACK-SPACE, END-OF-FILE, and all the others. If the program has one state, RUNNING, the constant cost event-state combinations for analysis are as shown in Table B . l below. 182 while ( byte 7^ END-OF-FILE ) begin read( byte ); if byte = NEWLINE then writeline( )3; else if byte == BACKSPACE then cc-; else begin for (i=0; i< MAXOVERPRINTS; ++i ) begin line = Lineptrp]; if line[cc] = BLANK then begin line[cc] = byte; CC+ + ', Lastxfi] = max{Lastx[i],cc}; break; end; end; end; end; writeline( ) begin for (i=0; i < MAXOVERPRINTS; ++i ) begin last = Lastx[i]; if last = 0 then break; line = Lineptr[i]; for (j=0; j<last; ++j ) begin put ( linejj] ); linep] = BLANK; end; Lastx[i] = 0; end; put (NEWLINE); cc = 0; end; Figure B.l The Initial Version of Os. 'Note that writeline() would be expanded as an in-line macro 183 Table B . l Execution costs for the os utility event probab-ility cost of handling cost of detection expected execution cost new handling cost new detection cost new expected execution cost BACKSPACE 0.005 1 3 0.02 50W+160 2 .25W+0.8 NEWLINE 0.01 125W+3S2 2 1.25W+3.84 W+l 2 0.01W+0.03 END-OF-FELE 10-5 0 1 lO"4 0 1 10"* REMAINING CHARS 0.985 7 3 9.85 W+l 2 .9S5W+2.955 Total expected costs/event 1.25W+13.71 1.24W+3.785 Assuming BACKSPACE occurs say 1 in 200 characters, p(INITIALISE) = .005; simi-larly, if the average length of the line is 100 characters, then p(NEWLINE) = 0.01. Assuming the average text file is 10000 characters, then p (END-OF-FILE).— 10"5. Furthermore, it is assumed the system function read(byte) takes R instructions," which are required to handle every event. Now the cost of handling NEWLINE depends on how many backspace characters there are in that line, so either more events should be generated to represent this, or the average handling costs over all lines should be taken. The latter approach is used here, noting that there is approximately one backspace character on every other line. The average handling cost per newline character is the average of a line without a backspace and a line with a backspace character half way along. This costs approximately ((100(W+3)+5) + (150(W+3)+10))/2 = 125W + 382, where the system function put(byte) takes W instructions, and the cost of executing a for-loop is 2 instructions per iteration4. From Table B.l the expected cost of execution of the initial version = (1.25W + R + 13.71) 4Note that more characters are written than read, because of the extra blank characters in the overstrike buffers. 184 instructions/eventf T h i s program does N O T reflect any structuring for the statistically dominant case -indeed it processes one of the least likely bytes ( B A C K S P A C E ) most efficiently. T h e program can be rewritten to reflect the statistically dominant case of normal characters by writing it as a null filter copying its input to its output unchanged. Starting with this view, the excep-tions to recognize are END-OF-FILE, a server-client exception, and a BACKSPACE, a peer exception transparent to the user. T h i s leads to the following program structure: while ( byte ^ END-OF-FILE ) begin read( byte ); if byte = BACKSPACE then HandleException(); else put( byte ); end; In the HandleException routine, characters are read and stored till the next newline into line buffers for overprinting, similar to the initial program which used buffers for all lines. T h i s normal-case processing is not sufficient as it stands to handle the BACKSPACE excep-tion because it is necessary to know the current character position on the line when a back-space is encountered. T h e final program is structured as above with this additional context-saving code, which takes the form of the insertion of the following lines in the else clause above: if byte = NEWLINE then col = 0; else colH—h; T h i s program was measured as twice the speed of the original on most text files, and, as Cheri ton pointed out, this is contrary to the expectation that the gain would be small because the fixed file access overhead would dominate. 'for simplicity the constant cost R for each event is excluded from the Table 185 A n analysis of this program reveals that the new execution cost is composed of three parts: the handling cost, the detection costs and the context-saving cost ( = 1 instruction). These are shown in T a b l e B . l , where the context-saving cost is merged with the handling cost. In this version the cost of handling the BACKSPACE exception in HandleException includes the cost of reading and storing characters till the next newline — which means that the cost of handling an ordinary byte depends on whether a backspace has already occurred on a line, or not. T o preserve comparison with the first os program, all the extra costs in processing the ordinary bytes on a line with a backspace, are ascribed to the B A C K S P A C E character. T h e cost of handling every ordinary character is then W + 1, and the cost of han-dling B A C K S P A C E is the extra cost in handling the line buffers. A s s u m i n g as before, that a B A C K S P A C E appears half way along a line, the expected cost of executing HandleExcep-tion is the cost of reading and writing one line buffer of half the average length = 50(W+3) +10. T h e n the total expected cost of this version of the program is approximately (R+1.24W+3.8) instructions/event. T h i s final program fragment illustrates the use of a readback facility, as described in sec-tion 3.3.3.1, which would reduce context-saving costs in the processing of normal events, by enabling the last NEWLINE character to be located in the input document, thus obviating the need to preserve a column count. while (byte != END-OF-FILE) begin read(byte); if byte = BACKSPACE then HandleException(); else put(byte); end; HandleException( byte ) begin col = 0; 186 oldbyte = NULL; while oldbyte 7^ NEWLINE do begin readback( oldbyte ); C 0 I + + ; end; newread(byte); < p ro cess- chars- as- b efo re >; end; 187 Appendix C. The TROFF/TBL/EQN system. In a Thoth-like operating system, a UNIX-like pipe system could be implemented using multiple processes, by using a pipe worker process to connect two systems processes, as shown below in Figure C.l. The TBL, EQN and TROFF processes act like servers, waiting on a Receive statement for either data from the client, or for the WORKER-READY message from their pipe-worker. After processing a client request (usually a text line or less), if the worker is free, the server Figure C.l Pipe Worker Processes 188 makes a Reply to the worker 6 before making a Reply to its client and completing its loop by awaiting another request. If a client request cannot be serviced immediately because the worker is busy, the server q\ieues it (for later processing by the worker), and makes a Reply to its client, as before. T h e capacity of the queue can be set to the capacity of a U N I X pipe. If the queue becomes full , the server witholds the Reply to the client until the worker emp-ties the queue. T h e worker processes the request by executing a Send to the server at the level below. W h e n the worker becomes free it executes a Send to its master for more data. T h u s in the above system structured as a hierarchy of clients and servers, there are 6 x 2 = 12 process switches for each data item in the source file — from C L to T B L to P W 1 to E Q N to P W 2 to T R O F F to W R I T E R -- and back again. Suppose an average document has 5% of its text as tables to be processed by T B L , 5% as equations to be processed by E Q N , and 90% as normal text to be formatted only by T R O F F . T h e n this system structure does not reflect the statistically dominant case. T h e system can be restructured so the inter-process communication is minimised for the statistically dominant case by treating lines with equations or tables as exceptional. A l l lines are sent to T R O F F first. T h e user's document is read by a client process, C L , which executes a Send to T R O F F , for each unit of data. T R O F F executes a Receive and processes normal text, then executes a Send to the level below (say to process W R I T E R ) before executing a Reply to C L , which then proceeds synchronously with the next item of data from the user's document. T h u s for each item of normal text there are 4 process switches — from C L to T R O F F to W R I T E R to T R O F F to C L . T h i s is illustrated below in Figure C.2 . eCare must be taken to ensure that the data is copied into a safe place before the server makes a Reply to its client. 189 Figure C.2 Message Traffic for Handling an Equation T R O F F detects when a line is of the exceptional form E Q N or T B L and handles it by forwarding to the appropriate server, where it is processed as before, and then re-sent to T R O F F for further processing. T h u s 7 process switches are incurred for equation text ~ from C L to T R O F F to E Q N to T R O F F to W R I T E R to T R O F F to E Q N to C L . T R O F F forwards subsequent input to E Q N , until E N occurs. A similar scheme can be used for T B L text and for mixed E Q N and T B L text. If an average document has 5% of its text as tables to be processed by T B L , 5% as equations to be processed by E Q N , and 90% as normal text to be formatted only by T R O F F , in a 1000-line document there are 4*900 -I- 7*50 + 7*50 = 4300 context switches, compared with 12*1000 = 12000 for the pipe system. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051908/manifest

Comment

Related Items