Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A dynamically reconfigurable and extensible operating system Veitch, Alistair Craig 1998

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1998-346382.pdf [ 7.49MB ]
JSON: 831-1.0051386.json
JSON-LD: 831-1.0051386-ld.json
RDF/XML (Pretty): 831-1.0051386-rdf.xml
RDF/JSON: 831-1.0051386-rdf.json
Turtle: 831-1.0051386-turtle.txt
N-Triples: 831-1.0051386-rdf-ntriples.txt
Original Record: 831-1.0051386-source.json
Full Text

Full Text

A Dynamically Reconfigurable and Extensible Operating System by Alistair Craig Veitch B.Sc, University of Waikato, 1988 M.Sc. (Hons), University of Waikato, 1990 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE F A C U L T Y OF G R A D U A T E STUDIES Department of Computer Science We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH C O L U M B I A July 1998 © Alistair Craig Veitch, 1998 In presenting this thesis/essay in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representa-tives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Computer Science The University of British Columbia 2366 Main Mall Vancouver, BC Canada V6T 1Z4 Date: Abstract Operating systems are constantly getting more complex in the functionality they support, due to the increasing demands made by modem hardware and software innovations. Basingjhe kernel design on co-operating and modular services incorportating a flexible communications infrastructure with run-time binding makes the operating system dynamically configurable and extensible. These features aid in the management of system complexity, while also resulting in several software engineering and performance benefits. Configurability gives the operating system designer and implementor the freedom to build a large number of components, which can be composed into different configurations depending upon the final system requirements. System components can be built and debugged in a user address space, and then transparently migrated into the kernel address space for performance once they have been demonstrated correct. This removes one of the major obstacles to devel-oping kernel services, that of the necessity to reboot the system after each change to the service code. The system administrator can also reconfigure the system, providing similar advantages, and allowing dynamic system upgrades to be made, reducing system downtime. Extensibility lets new functionality be integrated into the operating system. This can be done on an application specific basis. This enables the development of in-kemel applications in cases where high performance is required, such as for dedicated file servers. It is also possible for applications to interpose specialised kernel services, allowing them to dramatically increase their performance and aggregate system throughput when the default system policies are ill-matched to their behaviour. The Kea operating system has been designed and implemented to be dynamically configurable and extensible. The design of the system features that make these features possible are described. Experimental results are shown that demonstrate that Kea offers comparable perfor-mance to a traditional operating system on the same hardware, and that extensibility can be used to increase performance for selected applications. i i Table of Contents Abstract " Table of Contents i» Acknowledgement • viii Chapter 1 Introduction * • ! 1.1 Operating System Challenges . ........ . , • • -1 1.2 Problem Definition. . . . • • -4 1.3 Research Contributions 6 1.4 Design Overview -8 1.4.1 Overall Systems Structure. . . 9 1.4.2 Inter-Service Communications 12 1.4.3 Extensibility , 14 1.4.4 Application Specificity -19 1.5 Thesis Organisation 19 Chapter 2 Related Work 21 2.1 System Structure -21 2.1.1 Multics 22 2.1.2 Hydra 23 2.1.3 Microkernels .23 2.1.4 Chorus 24 2.1.5 Mach 24 2.1.6 Lipto 25 2.1.7 Spring 26 2.1.8 Single Address Space Systems 27 2.1.9 Protected Shared Libraries 28 2.1.10 The Flux OSKit 28 iii 2.1.11 Conclusions on Structure 29 2.2 Application Specific Extensibility 29 2.2.1 Single User Systems 30 2.2.2 SPIN 30 2.2.3 Vino 31 2.2.4 Exokernels 3 2 2.2.5 Meta-Object Systems 33 2.2.6 Synthetix 34 2.2.7 Spring 34 2.2.8 Other Approaches and Systems 34 2.2.9 Conclusions on Application Specific Extensibility 35 2.3 Summary of Related Work • • • -35 Chapter 3 The Kea Architecture 36 3.1 Kernel As Services......... : .37 3.2 Domains •-38 3.3 Events • -39 3.4 Names -39 3.5 Threads . .-. • -40 3.6 Inter-Domain Calls 40 3.6.1 Portals 41 3.6.2 Portal Remapping 45 3.7 Services • -46 3.8 Service Acquisition -49 3.9 Internal Service Structure • -51 3.10 Service Manipulation 51 3.10.1 Service Migration 52 3.10.2 Service Replacement 56 3.10.3 Service Interposition 59 3.11 User/Kernel Unification • -60 3.12 Protection & Security 60 3.12.1 General Security 61 3.12.2 Protection in a Decomposed System 63 3.12.3 Global Reconfiguration and Security 65 3.12.4 Local Reconfiguration and Security 65 3.13 Scheduling • -67 iv 3.14 Implementation Summary 67 3.15 Service Performance 68 3.15.1 Kea IDC Performance 71 3.15.2 Comparisons With Other Systems .73 3.16 Architecture Summary 75 Chapter 4 High Level Services . .78 4.1 Disk Driver Service 79 4.2 Buffer Cache Service 79 4.3 Filesystem Services 79 4.4 File and Mount Service 80 4.4.1 The Mount Service 80 4.4.2 The File Service 80 4.5 File Service Composition 81 4.6 Compressed Filesystem Service 81 4.7 Networking Services . 82 4.8 Other Services 82 4.8.1 Syslog Service .83 4.8.2 Unique Identifier Service 83 4.8.3 Console Service 83 4.8.4 Keyboard Service 83 4.9 High Level Service Summary 84 Chapter 5 Performance and Reconfigurability 85 5.1 Experimental Overview 86 5.1.1 Experimental Hardware 86 5.2 FFS Performance 87 5.2.1 Reconfiguration and Performance 87 5.2.2 FreeBSD Comparison 89 5.3 FAT Performance 92 5.4 Service Migration and Replacement Costs 94 5.5 Service Interposition 97 5.6 Conclusions on Reconfiguration Performance 99 v Chapter 6 Application Specificity Evaluation r 100 6.1 In-Kernel Applications • • • 100 6.2 Application-Specific Extensions 102 6.2.1 The Read-Ahead Service 102 6.2.2 Buffer Cache Management 104 6.3 Conclusions on Application Specificity 106 6.3.1 Performance Issues .107 6.3.2 Security Issues 108 6.3.3 Policy Issues 109 6.3.4 Application Specific Summary 109 Chapter 7 Conclusions and Future Work I l l 7.1 Conclusions H I 7.2 Future Work . . 113 7.2.1 Further development 113 7.2.2 Finer Grain Decomposition 115 7.2.3 Improved State Migration 115 7.2.4 Dynamic Code Generation 116 7.2.5 IO Buffer Manipulation 116 7.2.6 Service Format 117 7.2.7 Application Specific Extensions 117 Bibliography 119 Appendix A Kernel Service Interfaces 130 A . l Domain Interface 130 A.2 V M Interface 131 A.3 Event Interface 133 A.4 Name Interface 134 A. 5 Thread Interface 135 Appendix B High-Level Service Interfaces 138 B. l IDE Interface 138 vi B.2 Bcache Interface 1 3 9 B.3 File Interface 140 B.4 Mount Interface -141 Appendix C Scheduler Interface 142 vii Acknowledgements As with any effort that takes almost five years, a lot of people have contributed in one way or another. It feels as if I could write another thesis just on those who have helped along the way, whether it be academically, socially, or just by being there. Certainly one page isn't enough to do you all justice. Most important to the thesis was my supervisor, Norm Hutchinson. Norm was a never-ending source of advice and encouragement, and has become not only a mentor, but a friend. Several people have helped in implementing Kea. Andrew Agno, Geoff Burian, Davor Cub-ranic, Peter Smith and Christian Vinther each contributed something. Peter Smith deserves extra mention as office-mate, wild ideas listener and proof-reader. My parents, Ron and Isobel Veitch, deserve special thanks for always encouraging me in learn-ing. My brother John and his wife Anita, and my in-laws, Alan and Gael Jellyman, Jeremy and • Andrea Blackmore, and Spencer Jellyman, have always been supportive. I've shared some of the best times away from campus with my various climbing partners. For the opportunity to escape, and the joy provided, thanks to Adrian Burke, Kory Fawcett, Patrick King, Lee Purvis and Murray Rudd. Many, many other people have provided help, a positive environment, or a chance to escape the thesis for a while. In alphabetical order, you are: Don Acton, Kelly Booth, Dave Brent,Tan Cavers, Terry Coatta, Richard Dearden, Mike Donat, Adrienne Drobnies, Brad Duska, Mike Feeley, Dave Finkelstein, Jean Forsythe, Alain Founder, Murray Goldberg, Mark Greenstreet, Scott Hazelhurst, Iris Koch, Andrew Lesperance, Dwight Makaroff, David Marwood, Lucy Meagher, Roland Mechler, Holly Mitchell, Gerald Neufeld, Raymond Ng, Margaret Petrus, Carlin Phillips, George Phillips, Peter Phillips, Frank Pronk, Jackie Purvis, Jennifer Ryan, Mike Sanderson, Chris Simpson, George Tsiknis, Alan Wagner, Sue and Wally Walker and Carol Whitehead. Last, but definitely not least, my wife, Dallas, deserves thanks for always being there for me. ALISTAIR C. VEITCH The University of British Columbia July 1998 viii C h a p t e r 1 Introduction 1.1 Operating System Challenges The operating system is the interface between a computer's hardware and the applications that run on that computer. As a consequence, operating system designers must continually respond to demands made by new and faster hardware, the applications which require access to that hardware for increased performance, new application technologies and the various classes of people who will be using the system. Since the development of the first electronic computers, computer hardware technology has advanced rapidly. Modern computer systems have more and faster processors, larger and faster memories and disks, faster and wider buses, and higher network bandwidths with lower laten-cies. In the past five years alone, the speed and/or size of all of these technologies has grown by at least an order of magnitude, and this growth shows no sign of slowing for at least another, decade. As a consequence, the assumptions under which an operating system was designed and developed may become less valid, or even false [Rosenblum et al. 95]. Differing rates of devel-1 SECTION 1.1 OPERATING SYSTEM CHALLENGES 2 opment in computer subsystems may also lead to a shifting of performance bottlenecks. As examples, consider that inter-process communication (IPC) speed, while still very important, has been shown to be increasingly dominated by cache design [Bershad 92], or that many file-systems restrict the maximum size of files or the filesystem itself to some size (e.g. 4 Gb) which is smaller than the size of modern disk drives. The original development of timesharing systems necessitated software techniques through which the hardware could be not only used, but also managed. In particular, the need to share hardware resources among multiple users, and provide security to those users, resulted in a great increase to system software complexity. New hardware may also require the operating system to undergo redesign. The transition from 32 to 64 bit processors required extensive redesign of several operating systems. Initial ver-sions of Unix were designed to accommodate only one sort of filesystem, necessitating a revi-sion when several filesystem types needed to be supported [McKusick et al. 96]. Currently, gigabit networking technology is forcing reevaluation of the design of low level network lay-ers. The demands made on operating systems by modern applications are at least as important a consideration as those made by hardware. Multimedia applications require large amounts of storage, high-bandwidth access to that storage, and fast, low latency network access. Database systems require efficient access to extremely large storage. An increasing dependance on dis^ tributed applications requires the operating system to support new computing paradigms. Mobile computing is forcing operating systems to deal with a dynamically changing operating environment and disconnected operation. Applications may also require services that have not been foreseen by operating systems designers, or that have been designed in such a way as to be incompatible to the applications needs. As examples, consider that database systems often implement, or reimplement, services normally performed by the operating system, such as threads, memory management, or filesys-tems. Real-time applications demand specialised scheduler support and performance guaran-SECTION 1.1 OPERATING SYSTEM CHALLENGES 3 tees that most general purpose operating systems cannot supply. The page-replacement algorithms of most operating systems interact badly with the memory access patterns of appli-cations that must do significant garbage collecting, such as Java, Lisp or Emerald. The disparate types and requirements of applications essentially force the operating system designer to make a choice between trying to satisfy the needs of a specific class of application, and consequently producing a system that may not perform well on other applications, or trying to build a general purpose system that will work with a large number of application classes, but will not provide optimal performance for any of them. It is impossible for any group of design-ers to foresee all possible application demands. Other challenges to an operating system designer relate to the conflicting needs of various classes of user. Generally speaking, application developers do not, on the whole, care about the underlying system structure. Their only demands are that the system provide them with the abilities (programming interfaces) to develop applications, and that the system be efficient. This can be contrasted with system developers, who want to be able to easily reason about a system, and simplify the process of developing and debugging code to be incorporated into the , system. Often these two viewpoints are in conflict - a more modular system structure may ben-efit designers, but may not support the efficiency requirements of developers. Other users also have different requirements of an operating system. System administrators want to be able to control the configuration of their systems, and in particular want to be able to easily change that configuration when necessary, e.g. when new software releases become available or a machine has new hardware added. Researchers wish to be able to experiment with totally new compo-nents which may incorporate new concepts, but do not wish to have to change the entire system to accommodate this experimentation. SECTION 1.2 PROBLEM DEFINITION 4 1.2 Problem Definition An operating system design - the paradigm which dictates how the various components mak-ing up the system are structured, the dependencies between them, and the means through which they interact - should be such that the operating system itself is flexible, is able to accommo-date, new hardware and software technologies, and meet the demands made by specific appli-cations; and users. Given the demands made by new hardware and software technologies, it is desirable that operating systems be capable of evolving in a timely manner. In particular, the •; operating system developers must be able to quickly and easily design, implement, debug and install new operating system services, or modify existing services. System administrators also need to be able to easily reconfigure systems. Application developers require a system that lets them extend the system in order to give their application the best performance possible. Builds ing a system that satisfies all of these requirements was the goal of the research described by this thesis. In particular, the thesis describes the design, implementation and evaluation of a recdnfigurable and extensible operating system. For the purposes of this thesis, these terms are defined as: Re configurable: A system is reconfigurable if the components comprising the system can .be rearranged in some manner. With respect to operating systems, this is restricted to -;, mean the replacement of services, or the migration of services between address spaces. These operations may be done to enhance the systems performance, to replace compo-. nents which have been found to contain errors, or to provide richer functionality. Extensible: A system is extensible if new functionality can be added to that system. With respect to operating systems, this concept is further subdivided. General or global extensibility refers to the ability of system administrators or developers to add new ser-vices, which then become available to all users of the system. This is closely related, but not identical to, reconfigurability. Application specific extensibility is the ability of application developers to insert code that replaces or modifies the behaviour of a ser-vice, for a single application only, with other applications continuing to use the default SECTION 1.2 PROBLEM DEFINITION 5 system provided service. This allows applications to increase both their own perfor-mance and the total system throughput. It should be noted that an important property of both reconfigurable and extensible systems is that these actions can be performed dynamically at run-time. A reconfigurable and extensible system has many tangible advantages. For developers, it elim-inates the need to rebuild and reboot a system when changing any portion of the code. Instead, new services can be developed, compiled and downloaded directly into the kernel, replacing older versions. When services are still unreliable, and need debugging, they can be installed into their own address space, isolating them from other parts of the system. Once debugged, they can be incorporated into the kernel itself. This saves time previously needed to recompile, reinstall and reboot a system. Administrators can also install services in order to support new hardware, without having to recompile an entire kernel, or can reconfigure a system in order to meet changed operating requirements. Costly downtime can be avoided by the dynamic instal-lation of system upgrades. Application developers can modify the system, so that their appli-cation can increase its performance. There are several properties that are desirable for a reconfigurable and extensible system: •Fine grain construction - system components must be built on as fine a grain as possi-ble. This lets the system develop in an incremental manner, and restricts the scope of changes to those strictly necessary for the desired extension. • Flexible communications infrastructure - the various components of the system must be able to communicate with each other, and the communications system used must be capable of supporting changes to the components, including those made at run-time: • No performance degradation - compared to conventional operating systems offering similar functionality, there must be little, or no, performance degradation. As perfor-SECTION 1.3 RESEARCH CONTRIBUTIONS 6 mance is often the yardstick by which operating systems are measured, this is a critical property. • Run-time service replacement - for the resulting system to be truly flexible, it must be possible to dynamically replace parts of the system. This is also important for rapid development, as dynamic replacement removes the necessity for a system rebuild and reboot with every change to a services code. • Application transparency - changes made to kernel services should be transparent to . both the applications using the system, and to other kernel services (even in the case where they are using the,service which is being changed). • Security mechanism - there must be some means through which arbitrary users or appli-cations are prevented from modifying vital system components, but which also allows them to make changes that are deemed to be "safe" by the systems administrators. • Protection mechanism - there must be some means through which the system is pro-tected from damage by dynamically inserted code, particularly when that code has been provided by an application. • Ease of use - the interface to install and modify system components should be relatively easy to use. Additionally, there should be support for determining the current system configuration. > 1.3 Research Contributions An operating system, Kea 1, incorporating the properties listed above has been developed. Dur-ing implementation, it was also observed that the structure of the kernel itself strongly encour-aged the development of system services in a highly modular and structured fashion. Based on these results, the following thesis statement is made: 1. The Kea is a native New Zealand bird, with several interesting properties of its own [Temple 94]. SECTION 1.3 RESEARCH CONTRIBUTIONS 7 An operating system based on modular system services and incorporating a flexible communications infrastructure with run-time binding can be built with-out sacrificing performance. Kernels employing this paradigm are fully recon-figurable and extensible, as services can be dynamically installed, migratedand replaced, both globally and on an application specific basis. Kea is the first operating system that satisfies all of the properties listed in the previous section. The important innovations that make Kea unique are its communications structure, which allows transparent run-time binding, the capabilities for dynamic modification of the system's structure, and the unification of the kernel and user programming environments, which further enhances Keas flexibility, as well as conferring substantial software engineering benefits. Kea was also designed with the goal of application specificity in mind, and demonstrates that this can be accomplished with this type of system architecture. The thesis also defines some of the inherent problems associated with any form of application specificity, due to the conflict between an application's local resource requirements, and the need of the operating system to balance these with both the global resource availability and the demands of other applications. The thesis determines that some types of system changes only make sense when they are glo-bal, and then only for highly specialised applications which will be the primary load on the machine in question. Another thesis contribution is the specification of one possible modular decomposition of oper-ating system structure, and the identification of those parts of the system which "cut across" all other such boundaries, or are fundamentally tied to the system in such a way that it is impossi-ble to replace them without major changes to all the other parts of the system. The most impor-tant system components with this property are the scheduler and protection mechanisms. In the case of scheduling, the thesis specifies a unique low-level interface that enables the construc-tion of many types of scheduler. This interface is sufficiently complete that development of a new scheduler is relatively easy. It also reducesthe number of functions that depend on the SECTION 1.4 DESIGN OVERVIEW 8 properties of any individual scheduler to two, which can be easily found and modified in any code depending on scheduling properties. With regard to protection and security1, the Kea architecture proposes that these concepts are separate, and are in fact largely orthogonal both to each other and to the modularity of the sys-tem as a whole. While it has been successfully argued that modularity and protection should be separate [Druschel et al. 92a], as the partitioning of the system into modules is a matter of con-figuration, not design, Kea is the first design that also tries to isolate the security model of the system. Instead of implementing a set security policy, Kea isolates security information into a small number of structures that are defined by the system designer. By only providing base functions that treat these as anonymous structures, designers can implement as much (or as. lit-tle) security checking as they desire. Hooks have been set in the code wherever security deci-sions need to be made, and, as proof of concept, a simple Unix-like protection system has been implemented and used where deemed appropriate. Kea also directly supports, or substantially eases the development of, several other capabilities, including, but not limited to, interposition agents [Jones 93], operating system emulation [Golub et al. 90, Malan et al. 91], shared libraries [Gingell 89, Sabatella 90, Orr et al. 93], and continuous operation [Anderson et al. 92]. 1.4 Design Overview . There are three important facets of the Kea design. The first of these is the overall system struc-turing, i.e. how the individual system components are coalesced into the whole. The second is the communications structure that binds these components. The third is the specification of the functionality for extensibility and application specificity. Each of these is examined in the sub-sequent sections. 1. Protection is regarded as the hardware enforced means by which parts of the software system are separated from each other, typically implemented using address spaces. Security is the set of policies used to determine user access to various parts of the system, and may be implemented using either hardware protection or soft-ware. SECTION 1.4 DESIGN OVERVIEW 9 1.4.1 Overall Systems Structure Before discussing the structure of the Kea system, it is necessary to briefly review some of the existing paradigms for operating system structuring. The issue of how best to design, configure and build an operating system kernel continues to be a somewhat contentious issue in the research community. While many paradigms are possible, the primary two are the monolithic and microkernel designs. Traditionally, operating system kernels have been constructed as a single monolithic image, compiled from a large body of kernel code. This code is often many hundreds of thousands (or even millions) of lines and implements many disparate sets of functionality, such as virtual memory, process management, network protocols and file systems, many of which interact in subtle ways. Additionally, kernel code is often inherently more complex than application code, due to synchronisation, scheduling and timing constraints, and the need to manipulate privi-leged processor state. For the same reasons, kernel debuggers are also more difficult to develop and use. Because of this complexity, kernel code requires a proportionally higher level of main-tenance and development than ordinary, or application, code. Partly in' order to combat the cost of monolithic kernel development, another kernel design technique, microkernels, has been pursued by researchers. As the name implies, microkernels are designed to be "small", where the kernel itself provides only the needed functionality to implement "servers" - programs that run in their own address spaces in user mode, and provide the system with those services not provided by the microkernel. The microkernel design phi-losophy offers the developer advantages in modularity, as different services, e.g. file systems and network protocols, can be implemented in separate servers, and can often be debugged as user level programs, using standard debuggers. Despite their advantages, microkernels incur a performance penalty that, to a large extent, ren-ders them unpalatable to many developers, for whom performance is all-important. This per-formance penalty arises out of the means through which different servers communicate with SECTION 1.4 DESIGN OVERVIEW 10 one another. In a monolithic kernel, this is done with procedure calls, but this is impossible within a decomposed system, where various parts are running in different address spaces. Instead, microkernels must use some form of inter-process communication (IPC) between serv-ers and kernel. This usually takes the form of message passing, often with some type of remote procedure call (RPC) [Birrell 89] layered on top. The disadvantages imposed by this architec-ture - where passing messages often means marshalling arguments, composing the message, copying the message between address spaces and context switching between those address spaces - limit the performance to be strictly less than that of an equivalent monolithic system, even for very highly tuned implementations [Hartig et al. 97]. Microkernels do, however, have the promise of increased flexibility, due to the decomposition : of, a single kernel into multiple servers. In reality, this promise is seldom realised, as a typical system only includes one server, providing all the standard operating system functionality not provided by the microkernel. Compared to the equivalent monolithic systems (from which they are normally derived), these "single-server" systems are no more flexible or extensible, have the same internal complexity, and often consume more resources. Furthermore, when the sys-tems are decomposed, the performance penalties due to IPC costs become more important, because of the greater amounts of cross-address space communication. The architecture proposed in this thesis resolves much of this conflict between modularity, sys-tem decomposition, and performance, by implementing a hybrid scheme. The system is com-posed of a set of services, which together implement a complete operating system. A service is essentially a well defined interface through which some entity can be accessed and asked to perform a task for the caller. There are several ways to view a service. As seen by the program-mer, there are two basic parts to each service, an interface and an implementation. The interface describes the procedures, constants and data types that the service offers to clients, while the implementation refers to the compiled code which implements the interface. A reasonable comparison to make is that the interface is analogous to a C header (".h") file, while the imple-mentation is like a library. As an example, the set of procedures that manipulate the virtual SECTION 1.4 DESIGN OVERVIEW 11 memory,of an address space might be grouped into a single service, the "vm" service. The ser-: vice guarantees to provide the given procedures, with specified semantics. From the applica-tions viewpoint, a service appears as a number of pointers to functions - one for each of the service entry points. By dereferencing these pointers and calling the specified function, the ser-vice can be accessed. To both the programmer and application, this appears as a local function call - the underlying function pointer implementation is effectively invisible1. Jn the case i where the services are in separate address spaces, the functions pointed to are automatically generated stubs, which call the underlying communications system. Where the services are co-located (that is, located in the same, the destination function address is called directly. The mechanisms for this are discussed in sections 1.4.2 and 1.4.3, on the.communica-tions structure and extensibility features, respectively. Services are built up in a highly modular fashion, and each can be run in a separate address space. However, for performance purposes, services can be co-located into any other address space, including the kernel. In the case where a service is loaded into an address space in which another service is already resident, the underlying communications system optimises any inter-service interactions into the appropriate procedure calls, thus ensuring optimum performance. This is transparent to both the services and their developers. While some other systems [Rozier et al. 92, Lepreau et al. 93, Condict et al. 94] have allowances for co-location, Kea is the first to make this completely transparent. Additionally, Kea makes this co-location dynam-ically available. At any time, services can be migrated between address spaces, transparently to clients of the service. The same facility also allows services to be dynamically replaced, or removed from the system entirely (although the latter may not be advisable if other services depend on the one being removed). The dynamic run-time binding of services is what gives the Kea system its extensibility, as at any time services can be extended, or new services added, in order to increase the systems func-1. This is at least true for the C language, in which Kea has been developed. SECTION 1.4 DESIGN OVERVIEW 12 tionality or capabilities. It is also highly beneficial to system developers, as the operating sys-tem no longer needs to be rebuilt and rebooted every time a services code is changed. 1.4.2 Inter-Service Communications As described, services essentially appear as a set of procedures, which can be invoked by other services or applications in order to accomplish some task. This argues strongly for a commu-nications paradigm that directly supports procedure calls. This can be contrasted with typical microkernels, which use message passing, usually with an additional RPC layer. Although message passing is arguably more general, there are several advantages to directly supporting procedure calls: • Familiarity - procedure calls are more familiar to the average programmer. Directly supporting them allows for easier development. . . 1 • Efficiency - while many systems can hide the underlying message passing mechanism using automatically generated stubs that marshall and unmarshall procedural argu-ments, this also introduces a layer of unnecessary inefficiency. By making "procedure call" the IPC mechanism, some cost is incurred due to the more complex operation (as opposed to passing a simple contiguous message), but an ultimate gain in efficiency is made by getting the kernel to perform tasks normally done (and duplicated) in every server in a conventional microkernel. Consider the section of pseudo-code below (Figure 1.1), that could come from a server in such a system: w h i l e ( t r u e ) { r e c e i v e m e s s a g e ; d e c o d e m e s s a g e , -c a l l p r o c e d u r e t o p e r f o r m o p e r a t i o n ; s e n d r e p l y m e s s a g e ; } Figure 1.1: Server Pseudo-code SECTION 1.4 DESIGN OVERVIEW 13 In the Kea system, this loop, and its receive/decode/send components, are entirely elim-inated, as these operations are carried out in the kernel, with the service procedures being invoked directly. • System throughput - When a call is made, the thread making the call effectively trans-fers into the destination address space. This is accomplished through a mechanism akin to the "migrating threads" model used in Mach 3.0 [Ford & Lepreau 94] and Springs "doors" [Hamilton & KougiOuris 93]. These papers describe several performance ' advantages of this idea, due to the lack of scheduler interaction, simpler code paths, and reduced thread interactions. This model also removes any artificial barriers to the par-allelism inherent in the service. Instead of only processing one request at a time (as in Figure 1.1), any service can have a potentially unlimited number of threads executing in parallel within itself, allowing greater system throughput in many cases. This natural parallelism also goes some way towards reducing priority inversion problems caused by high priority threads having to wait on low priority threads whose request is currently executing in the server. With Kea the high priority thread can run in parallel with the low priority one, and is not blocked from entering the service. An IPC mechanism that directly supports procedure call also considerably simplifies the con-struction of the Kea system as a number of independent services. As each of the services is specified by an interface, which in turn consists of a number of procedural entry points, the gen-eration of client-side stubs is simple. More importantly, the stubs themselves are very short,-and therefore efficient1. Kea's IPC mechanism also considerably eases the implementation of service co-location. When being built, each service is compiled into a single object file (the implementation file). When loaded into a new address space, the implementation file is dynamically linked against any necessary libraries. When the service is loaded into, or migrated to, an already extant address space, it is first dynamically linked against the existing application code and/or service 1. Stub generation and structure are discussed in detail in Sections 3.8 and 3.10.1. SECTION 1.4 DESIGN OVERVIEW 14 implementations in that address space. This prevents any duplication of routines within address , spaces, while keeping service code separate. To accomplish this, the kernel must keep symbol information for each address space and service available, but this has been found to be only a minor cost - for example, the storage space needed for the symbols in Kea's standard G library is only 5 Kb. As a performance optimisation, the kernel detects when services are co-located, and optimises the service invocations into a single indirect procedure call, eliminating the kernel trap over-head completely. This is easily accomplished because, as described, the access points to each service are kept in function pointers. Because the kernel has access to the symbol table which specifies these pointers, and knows the names of the procedures comprising the interface, it can adjust the pointers to point to the service directly, rather than to the client stub. This alleviates one of the standard problems with microkernel systems, namely that the performance overr heads imposed by IPC/RPC are tod costly. 1.4.3 Extensibility The structure of the system as a number of independent services, combined with some exten-sions to the service loading mechanisms described above, allow the implementation of a num-ber of features which make Kea extensible. These features are referred to as service migration, replacement and interposition. Service Migration As the system is used, it may be desirable to move the service into another address space. The primary reason to do this is performance. If a service can be migrated into the address space of the most frequent client of the service, then the time needed for each service invocation can be reduced by the time taken for changing address spaces, which is a potentially expensive oper-ation. Alternatively, invocations into the kernel are far cheaper than those into another address space. Thus, for trusted services such as device drivers, filesystems, and network protocols, it is desirable to move, or migrate, these into the kernel once they have been debugged. This SECTION 1.4 DESIGN OVERVIEW 15 offers several advantages to the service developer, particularly if they are developing kernel services. Firstly, the strictly defined service interfaces enforce modularity, which makes the system as a whole more maintainable. Secondly, it is not uncommon for newly developed ker-nel services to cause a system crash, necessitating a tedious and time-consuming program/ crash/reboot development cycle. Within Kea, services can initially be developed within a pri-vate address space, which restricts the crash (if any) or consequences of a programming error, to that address space. Finally, since the service is running in a separate address space, it can be easily debugged with standard tools. Supporting service migration between user and kernel address spaces has the interesting con-sequence of unifying the user and kernel programming interfaces. This is required, as to be exe-cutable in both environments, the environments themselves must both conform to the same interfaces; In particular, the behaviour of synchronisation, memory allocation, and asynchro-nous event handling have to be identical. This unification has the beneficial side effects of reducing the complexity of the kernel programming environment (at the one-time cost of increasing the complexity of the low-level kernel code), and simplifying the software engineer-ing process, particularly in documentation and testing. Service Replacement Given the ability to migrate services between address spaces, it is simple to also support.the dynamic replacement of services, as almost exactly the same operations need to take place. Replacement is actually slightly easier, as there is only one address space involved. There are three reasons why a developer or system administrator might wish to replace a service: • Performance - compared to the original, the replacement service may offer improved performance. This could be as simple as a reduced memory footprint or smaller GPU usage, or as complex as a changed trade-off in efficiency between various procedures in the service (e.g. a faster speed for some of the frequently invoked service procedures, at the cost of slower times on some of those that are used less often). SECTION 1.4 DESIGN OVERVIEW 16 • Correctness - the replacement service may fix a bug in the original service. Replace-ment allows the system administrator to ensure system correctness, without potentially expensive down-time. It would also be possible for system vendors to ship upgrades to the system, without requiring the system administrator to perform a full system rebuild and installation. • Testing - During development, services must be tested and debugged. Service replace-ment enables this process to proceed in a more timely manner, as faulty services are eas-ily replaced. Figure 1.2 illustrates service migration and replacement. In this figure, several services and an application (solid boxes) are shown as existing in several different address spaces (dashed boxes). Thin arrows indicate calls made on services. Calls that cross address space boundaries will be accomplished using a form of RPC. Other calls, within address spaces (such as that shown between services B' and C) will be optimised to direct procedure calls. Two of the basic service operations that make Kea extensible are also shown (thick arrows). The first, service migration, is when a service is moved transparently between address spaces. In this case, ser-vice A can reside in either its own or the kernel's address space. The second, service replace-ment, shows service B being replaced by B' (i.e. service B will no longer exist). These changes are transparent to both the service clients and the services themselves. Calls between services are optimised when possible. Thus, after replacement, service B' will now make upcalls [Clark 85] to C, but will have its calls to D optimised. Service Interposition Service interposition refers to the ability to interpose a new service between any existing pair of services. Interposition can be usefully applied in many areas. At the interface between the SECTION 1.4 DESIGN OVERVIEW 17 Application Service A I Service I Migration Service B' Service Replacement Service A Service C Service D Base Kernel Services Hardware Figure 1.2: Service Migration and Replacement Dotted boxes indicate address spaces. Thick arrows are service operations, thin arrows are calls on services. operating system and applications alone, [Jones 93] identified several possible uses of this type of facility: • Call tracing/monitoring - an interposed service can record or monitor an application's use of the underlying service. • Operating system emulation - a service can be used to translate calls made by an appli-cation written for a "foreign" operating system into those used by the native system. • Protected environments - a service wrapper can be developed that limits the actions of an untrusted binary. • Alternate/enhanced semantics - a service could offer a interface that extends that of the underlying service, e.g. a transaction-based file system. SECTION 1.4 DESIGN OVERVIEW 18 By interposing services at different points in the service hierarchy (not just at the upper layer), and allowing this to occur dynamically, the set of applications to which interposition applies is greatly expanded. In particular, the primary uses envisaged for service interposition are in inter-posing on system services that implement policies. For example, a service that, when inter-posed on Kea's buffer cache, implements preferential caching of specified file blocks has been developed. The first parts of Figure 1.3 shows an example of a simple service interposition. Various service configurations are shown from left to right. The initial configuration, (a), shows two applica-tions (A and B), using a chain of services S1 through S3. In (b), a new service, S4, has been inserted between the applications and S{. S 4 offers the same semantics as S l 5 and the applica-tions will not be able to tell that the underlying service structure has changed, except for a pos-sible performance increase. As an example, if S{ was a filesystem, S 4 could offer compression services, transparently uncompressing and compressing the data for read and write calls respec-tively. Another example could be adding an encryption layer onto a standard network protocol. B (c) Figure 1.3: Service Reconfigurations SECTION 1.5 THESIS ORGANISATION 19 1.4.4 Application Specificity Kea also specifies that each of the above operations can take place on an application specific basis. That is, if an application wishes to provide its own service, equivalent to one already existing in the system, but having different performance characteristics or enhanced semantics, it can specify the use of that service for its computations only. This ability can be applied both to services on which the application directly depends, but also to services further down the application's call chain. This capability lets the application effectively install its own code into the system, in order to enhance its own performance, while leaving other system clients unaf-fected. Part (c) of Figure 1.3 shows an application specific interposition. Here a service S 5 has been inserted, taking the place of S 2 for all calls from application B. Any calls that originate in application B (shown by dotted lines), as well as those made on its behalf by other services, are redirected through S5. This is transparent to all other clients. In particular, there is no way for service to tell that it will be directed to another service - it continues to make the same calls, and the underlying communications infrastructure handles the destinations. 1.5 Thesis Organisation The remainder of this thesis is organised as follows: • Chapter 2 is a detailed survey of related work, and contrasts this work with Kea. In par-ticular, the evolution of some of the important concepts are discussed, together with a comparison with their expression in other systems. • Chapter 3 describes the low-level details of the Kea kernel architecture. It focuses on the organisation of the system as a whole, and provides details on the design and capa-bilities of the core services. In particular, the IPC mechanism and service manipulation functionality is discussed in detail. The special facilities needed for the unification of the user and kernel programming environments are also described. A section deals with the problems of protection, both in general terms, i.e. how it can be provided in a SECTION 1.5 THESIS ORGANISATION 20 decomposed system, and in terms of the special capabilities needed for the extensible and application specific features. Another section describes scheduling functionality, and how different scheduling code can be made, if not dynamically replaceable, at least easy to develop and configure into the system. The chapter concludes with a discussion of the performance of the base services. • Chapter 4 describes some of the higher level services built using this architecture, con-centrating on the design of the Kea filesystems and their supporting services. It then dis-cusses how the design of these services both demonstrates and facilitates Kea's flexibility. • Chapter 5 describes several experiments which measure the performance of the system for each of the developed services. These experiments allow an evaluation of the sys-tem reconfigurability to be made. • Chapter 6 discusses the application of the Kea architecture to the problem of application specificity. Several experiments and their results are described and discussed. The chap-ter concludes with a discussion on the limitations of application specificity both for Kea specifically, and operating systems generally, and considers how these might be over-come. •Chapter 7 concludes the body of the thesis, summarising the experimental results and main contributions of the research. The possibilities for future work arising out of the thesis are also discussed. • Appendix A provides details on the interfaces to selected low level Kea services.' • Appendix B describes the interfaces to selected high level services. • Appendix C details Kea's internal scheduler interface. C h a p t e r 2 Related Work Several of the concepts on which Kea is based are similar to, or have been extracted from, a number of related systems. These can be (roughly) grouped into two different classes - systems that influenced ideas on operating systems structure and general extensibility, and those that were explicitly designed for application specific extensibility. This chapter describes influential systems in both areas, and examines where they differ from Kea. 2.1 System Structure Since the advent of computer systems, the question of how best to structure the operating sys-tem has been of interest to systems architects and researchers. As described in the introduction, the default, monolithic kernel approach, in which all kernel code is built and configured into a single image, does not satisfy all the needs of modern systems for flexibility and extensibility. The key structural ideas that Kea is based on can be identified as: • Fine grain modular construction 21 SECTION 2.1 SYSTEM STRUCTURE 22 • Procedural IPC • IPC optimisation when services are co-located • Transparent service migration and replacement • The orthogonality of protection, modularity and security. The following sections examine other systems that incorporate some of these, or similar, ideas and compare those systems with Kea, pointing out both similarities and differences in approach or structure. 2.1.1 Mult ics The Multics system [Organick 72, Corbato et al. 72] was extremely influential on systems design. Using special hardware, the GE 645, Multics was a system that depended heavily on segmentation. A Multics process consisted of a collection of segments, each of which had its own protection attributes, and could be shared with other processes. Thus, the Multics system let designers build system software as a set of segments, which were linked into applications as necessary. Every segment belonged to a protection ring, with inner segments (low ring num-ber) protected from outer segments. The boundary between user and kernel mode was blurred, with only a few "supervisor" segments at ring 0 able to execute privileged instructions. Other system segments provided functionality such as filesystems, and the application segments were composed with these. There are two aspects of Multics that are pertinent to Kea. The first is the construction of the system as a set of segments. This is superficially similar to the Kea philosophy of system con-struction as a set of cooperating services. The major difference is in the binding between system components. In Multics, this binding is completely static - the number and types of segments comprising the system are fixed at the time the system is compiled and linked. In Kea, service binding is completely dynamic. Services can be manipulated in a number of different ways at run-time. Additionally, the protection in Kea is active, being based on address space location, SECTION 2.1 SYSTEM STRUCTURE 23 and the current protection given to a service. Multics segments are statically configured with a protection ring, which cannot be changed. Finally, Multics relied on specialised hardware to support calls between segments and enforce segment protection boundaries, whereas the Kea system is architecture neutral. The second major contribution of Multics was that there was no real differentiation between user and kernel modes, and in particular, the programming environment was the same for all developers. "It is worth reemphasizing that the only differentiation between Multics systems programmers and user programmers is embodied in the access control mechanism which deter-mines what on-line information can be referenced; therefore, what are apparently two groups of users can be discussed as one" [Corbato et al. 72]. This simple, yet powerful, concept has been revitalised in Kea, where it offers significant development advantages. Concurrently with Kea, this idea has also been developed in the Rialto system [Draves & Cutshall 97], where sig-nificant software engineering advantages have also been observed. 2.1.2 Hydra Hydra [Wulf et al. 74, Levin et al. 75, Wulf et al. 81] is, in many ways, the ancestor of both object-oriented and microkernel systems. In particular, Hydra was one of the first systems to experiment with the modular construction of operating systems, the "small kernel" idea, and the separation of policy and mechanism. This was accomplished by separating functionality into different "objects", and providing a capability based call mechanism between these objects. The call mechanism was very costly however, and, like Multics, Hydra was statically configured, which severely limited both its flexibility and extensibility. 2.1.3 Microkernels While the "microkernel" description can cover a wide variety of systems, the general usage is generally presumed to mean a small kernel, exporting a minimal set of functionality, on which various higher-level "servers" can be built. Servers are developed in separate address spaces, SECTION 2.1 SYSTEM STRUCTURE 24 and communicate using RPC. While the microkernel design facilitates the construction of a modular, system, there are two important limitations with such a system with respect to exten-sibility. The first is that even the most aggressively optimised systems still incur a performance overhead with respect to the equivalent monolithic systems [Hartig et al. 97]. The second is that, typically, the number of servers implemented are few, and of large granularity. For instance, typical Mach based systems only implement a single server, which provides the entire range of normal operating system functionality [Golub et al. 90]. In general, the greater the number of servers, the worse the performance, due to the increased number of IPCs between servers. This may account for the fact that, to date, only one microkernel based system, QNX [Hildebrand 92], has achieved commercial success. However, several microkernel based sys-tems, primarily Mach and Chorus, implement techniques that partially overcome these prob-lems. 2.1.4 Chorus The Chorus system [Guillemont et al. 91, Rozier et al. 92, Walpole et al. 92] supports supervi-sor tasks that can be co-located in the kernel address space. IPC between these tasks can be optimised to procedure calls, but this is not transparent to the code, and must be explicitly spec-ified when the system is configured. While co-location gives Chorus some of the advantages inherent in Kea, the static nature of the system reduces both its flexibility and extensibility. Additionally, supervisor tasks have access to additional kernel interfaces, that are not visible to other tasks. Coupled with the lack of transparency for IPC optimisation, the location transpar-ency of Chorus servers is far less than that of Kea services. 2.1.5 M a c h Several versions of Mach support kernel co-location of privileged subsystems. In-kernel serv-ers [Lepreau et al. 93], allows servers to be loaded into the kernel at run time. When this is done, calls between client and server are optimised from the high overhead RPC mechanism to use trap-based calls, which can execute substantially faster. There are several differences SECTION 2.1 SYSTEM STRUCTURE 25 between this system and Kea. Firstly, and most importantly, there is no optimisation of server-server or server-kernel interactions. Secondly, this work concentrates on improving the effi-ciency of the existing Mach system, unlike Kea, where the focus is on providing a complete infrastructure for extensibility. Another project [Condict et al. 94] also extends the Mach system, allowing server co-location in the kernel, and using a form of optimised RPC (although not direct procedure calls, as in Kea). Also, server to kernel calls are not optimised, as the Mach kernel continues to require certain calling conventions, unlike Kea, where all kernel interfaces have exactly the same call-ing semantics as other (non-kernel) interfaces. A further difference is in the RPC semantics. This system builds RPC out of one-way messages, and optimises for this, whereas Kea assumes a round-trip procedure call, and provides this facility directly. Finally, while the issue of opti-mising server-server interactions is addressed, this facility has not been tested, as the system implemented uses only the OSF/1 single server. Neither of the systems described above support the arbitrary co-location and grouping of ser-vices, but are only concerned with locating servers in the kernel space. While this supports some common configurations, it may not suit others, where a system developer or administrator may want to place certain groups of servers in independent address spaces for reliability or test-ing purposes. Finally, these systems to not have any of the dynamic placement or migration fea-tures of Kea. 2.1.6 Lipto Lipto [Druschel et al. 91, Druschel et al. 92a, Druschel et al. 92b, Druschel 93] is perhaps the system most like Kea in its design. Lipto is an object-oriented system, and the servers imple-menting objects can reside in any configuration of address spaces, including the kernel. When object servers are co-located, the object invocation mechanism detects this, and optimises the object invocations to use a local stub and indirection mechanism, resulting in near procedure call efficiency. Other than the minor difference of the RPC mechanism being object based, SECTION 2.1 SYSTEM STRUCTURE 26 rather than procedural in nature, there are two major differences between Kea and Lipto. The first is the level of granularity at which RPC optimisation takes place. In Lipto, the unit on which sharing takes place is the object. For each object reference obtained by a client, the sys-tem has to provide an appropriate (either local or remote) proxy object. Kea optimises not on objects, but at the interface level - that is, one layer of abstraction higher. This eliminates some overhead due to proxy objects, and also allows the system to optimise directly on the interface's procedural entry points. The second major difference is that Lipto's configuration is statically determined at boot-time, and is not dynamic like Kea's. It should also be noted that the Lipto project only progressed to the implementation of a prototype of the object invocation method. Kea is the first system to apply the principle of fine-grain decomposition within a fully functional operating system. Despite these differences, Liptos' authors deserve recognition for first observing the orthogo-nality of modularity and protection in operating system construction. 2.1.7 Spring Spring [Hamilton & Kougiouris 93, Khalidi & Nelson 93a] is a distributed operating system that represents all system resources as objects, and supports a highly efficient IPC mechanism that supports object invocation. Like Kea, the system is microkernel based, with higher level services implementing functionality such as filesystems and networking. These services can be loaded into any address space, dependent on boot-time operating system settings. The Spring IPC mechanism has semantics very similar to that of Kea's, and a very efficient implementa-tion. For small arguments (16 bytes or less of scalar data) it takes advantage of the SPARC call-ing mechanism, where parameters are passed in registers. For larger arguments (up to 5 Kb of data and capabilities), the "vanilla" path is used. A third, "bulk", path uses virtual memory remapping to transmit large data objects. The Kea IPC mechanism was largely based on the semantics of Spring's, but Kea provides only a single method of data transmission. While sup-porting a bulk data copying method within Kea IPC would considerably enhance the perfor-SECTION 2.1 SYSTEM STRUCTURE 27 mance for some applications, this possibility has currently been relegated to future work. The major limitation of Spring with respect to Kea is that Spring only supports a limited form of service co-location. While objects can be created in the same address space, the IPC mecha-nism is still used for invocation. While this does eliminate the cost of an address space switch, the full performance available from the automatic short-circuiting of calls is not available. This facility is vital to the Kea philosophy of giving the system designer the choice of safety or per-formance. Also, the Spring IPC mechanism does not directly support true procedure calls, but still relies upon argument marshalling, which incurs a further performance penalty. Finally, Spring doesn't support the migration and replacement of services, features which are necessary for dynamic system reconfiguration. Spring also has support for system extensibility through object composition. This is discussed in.section 2.2.7 on page 34. 2.1.8 Single Address Space Systems Encouraged by the increasing availability of 64-bit processor architectures, a number of sys-temshave been implemented that investigate the possibilities of supporting an operating system and its applications within a single address space [Chase et al. 93, Bartoli et al. 93, Heiser et al. 93]. While these systems promote code sharing and modularity, and consequently enhance the flexibility of the operating system itself, none of the existing systems have been using this for the investigation of application specificity. Also, these systems are, to date, all statically configured with regards to kernel/user boundaries and services. An ancestor of the single address space systems, Psyche [Scott et al. 88] is similar to Kea in that it is composed of a set of modules. Psyche uses a single virtual address space to facilitate sharing between modules, trading protection for performance in much the same way that Kea does. However, Psyche is statically configured, and does not provide a means for arbitrary reconfiguration or module replacement. SECTION 2.1 SYSTEM STRUCTURE 28 2.1.9 Protected Shared Libraries Protected shared libraries (PSLs) [Banerji et al. 97] have been proposed as a means of effi-ciently enhancing the modularity of operating systems. PSLs extend the notion of shared librar-ies [Gingell 89] to include protected state (i.e. information that can only be changed from within the library code, not from the library client) and data sharing across protection bound-aries! When calling a library routine within a PSL, a "partial address space switch" is per-formed, which maps (or unmaps, as appropriate) data shared between the library and the caller. It is argued that by using PSLs, operating systems can be made more modular, while increasing system protection. While this may be true, there is no essential difference between PSLs and any other structuring mechanism that offers data protection (such as those described above), and no facility for co-location when the library code is trusted. The latter ability is important in order to obtain the performance demanded by system clients, especially since calls into a PSL are just as expensive as any other IPC mechanism that must change address spaces. 2.1.10 The Flux OSKit The Flux OSKit [Ford et al. 97] is a project that is designed to enable researchers to easily build operating systems - it does not necessarily define the structure that the operating system has to take. The OSKit provides a number of OS components, such as file systems, device drivers and virtual memory systems that can be used in a "building block" fashion to build a complete OS. While not explicitly defining1 an OS structure, the OSKit is relevant to Kea in that its compo-nents are analogous to Kea services, albeit statically configured. The OSKit, in parallel with Kea, is one of the first systems to explicitly explore OS construction as independent sub-systems, rather than dependent modules. 1. However, we would argue that a structure is implicitly denned by the nature of the components provided (all derived from Unix). Unfortunately, this is hard to avoid, and is partially true for Kea as well. 2. or as independent as possible. Whenever a service is used by another, there is a dependency. SECTION 2.2 APPLICATION SPECIFIC EXTENSIBILITY 29 2.1.11 Conclusions on Structure A l l of the systems discussed above share some concepts with Kea. In particular, the necessity for the modular decomposition of operating systems has long been recognised, as has the trade-off between modularity and protection, which different systems have addressed in different fashions. While not unique in its construction as a set of cooperating services, Kea is the only system developed to date that fully optimises the IPC between services when they are co-located. Kea is also the only system that supports true procedural IPC and the transparent migration and replacement of services at run-time, features necessary for full extensibility. The final comparison to be made with other system structures is that, as far as possible, Kea explic-itly avoids a set security policy, recognising that this is orthogonal to both the protection and modularity of the system. To summarise, while Kea does borrow heavily from the best ideas inherent in other systems, particularly Spring and Lipto, it also has additional functionality that makes it uniquely suited to the construction of a reconfigurable system. 2.2 Application Specific Extensibility How best to build a system that explicitly allows applications to insert code into the kernel has been an active research topic in recent years, with several different methodologies explored. Equally important to the question of how code is inserted into the kernel, is the problem of assuring that the code can be trusted, and will not compromise either the system security or per-formance. During the development of Kea, it was decided that this question was being ade-quately solved by other researchers (the systems demonstrating this are discussed in sections 2.2.2 through 2.2.4). Consequently, a final decision on which method(s), were appropriate for Kea was left open. The remainder of this chapter examines the techniques developed for code injection, and some of the representative systems using these methods, and compares them to Kea. Kea security is discussed later in section 3.12 on page 60. SECTION 2.2 APPLICATION SPECIFIC EXTENSIBILITY 30 2.2.1 Single User Systems Many personal computer operating systems, e.g. MS-DOS or Windows, run both applications and the operating system in a single address space, which provides good performance, and . gives applications free reign to arbitrarily modify all system code and data [Schulman et al. 92]. There are many disadvantages to such systems. There is no well defined interface through which modifications can be made, the system is not constructed so as to make arbitrary replacement of a particular service obvious, and the system has absolutely no protec-tion,from a buggy or malicious application, and no way of testing for this before it is used. 2.2.2 SPIN SPIN [Bershad et al. 95, Pardyak & Bershad 96, Sirer et al. 96, Hsieh et al. 96] has been developed to explicitly support application extensibility. SPIN allows applications to install low-level system services into the kernel, so that they can receive notification of, or take action on, certain kernel events. These events may be as disparate as a page fault, the reception of a network packet destined for the application, or context switches affecting the application. Ker-nel extensions are written in Modula-3 [Nelson 91], a type-safe, modular programming lan-guage. Due to these language properties, the compiler can enforce the safety of extensions at compile time. In particular, the compiler can guarantee that any extension will not reference any memory outside the module boundaries, except that for which it has been granted explicit access through an interface. By performing compilation at run-time, it also becomes possible to perform various optimisations on the code produced, further enhancing the efficiency of the system [Chambers et al. 96]. Modula-3 and compile-time checks enable the use of pointers as capabilities, which avoids expensive run-time security checks inherent in other capability sys-tems, whether hardware [Carter et al. 94] or software [Wulf et al. 81, Black et al. 92] based. SPIN'S developers have demonstrated that this architecture is both efficient and practical to use in a number of cases. A possible problem with the SPIN system is with the event system. There may be semantic difficulties when multiple handlers are installed on a single event. While SPIN has features that can control whether handlers execute asynchronously or synchronously, SECTION 2.2 APPLICATION SPECIFIC EXTENSIBILITY 3 1 which order they execute in, and which ones can return a result, it is certainly conceivable that multiple applications might install conflicting handlers. There are several differences between the SPIN and Kea systems. Firstly, the approach to code installation is fundamentally different. Where Kea specialisation is based around the idea of interfaces, and uses interposition and/or composition on those interfaces for specialisation, SPIN effectively uses single procedure remappings (events can be associated with any proce-dure invocation). This makes SPIN extensibility even more finely grained than Kea, but may also make some extensions more difficult to develop, where the functionality of several parts of a single interface need to be managed as a single extension, although this will only be able to be evaluated after many different types of extensions have been developed. As an application extensible system, SPIN is definitely more mature, and arguably more functional that Kea. SPIN does not however have any functionality supporting dynamic system reconfiguration or global extensibility (although it is possible that the same technology could be extended in these directions). 2.2.3 Vino The Vino operating system [Seltzer et al. 96] has also been explicitly designed to support appli-cation specific extensions. Vino is object-oriented, and allows applications to dynamically insert code that replaces the implementation of a method on any object. Alternatively, code can be installed to run when a specified kernel event (such as a network packet arrival) occurs. The granularity at which the system can be extended is dependent upon the objects and events exported by the system. Like other systems, Vino is concerned with the safety of the down-loaded code, and takes two measures to prevent this code from damaging other applications or the system. The first safety technique used is software fault isolation [Wahbe et al. 93, Small & Seltzer 96] or "sandboxing". Software fault isolation is a method whereby binary code is checked, and pos-sibly modified, to ensure that any memory references generated by that code will be within cer-SECTION 2.2 APPLICATION SPECIFIC EXTENSIBILITY 32 tain boundaries (the sandbox). This can be done at either compile-time, or load-time, and ensures that an object can, at worst, only affect its own operation. The performance overhead associated with the generation of sandboxed code is relatively small, typically on the order of a few percent, although it is arguable that even this may be significant for some systems. To prevent denial of service problems (caused, for example, by an extension that obtains a lock and doesn't release it), Vino also employs a transaction-like system of extension invocation. Effectively, all extensions are wrapped with stub code, which treats all invocations of that .• extension as a transaction, and can "abort" the extension if it performs dangerous actions. This requires an expensive invocation service (over two orders of magnitude more than a procedure call), and also mandates that all kernel functions which change kernel state must have an equiv-alent "undo" operation. The complexity and performance overheads associated with the Vino transaction mechanism prevent it from being as efficient as either SPLN or Kea. Also, like SPIN, Vino does not include any features for either dynamic system reconfiguration or global extensibility. 2.2.4 Exokernels Another approach to extensibility is taken by the Exokernel project [Engler et al. 95, Kaashoek et al. 97]. Exokernel design relies on the implementation of a very low level kernel, which does nothing except export the base abstractions provided by the underlying hardware [Engler & Kaashoek 95]. The abstractions provided by an exokernel permit applications run-ning on the system to protect their hardware resources, and if necessary, share them with other applications. The remainder of the operating system functionality is implemented within librar-ies (libOS's), which are linked into an applications address space, and which provide and man-age higher level abstractions such as filesystems. Extensions to the system are made by applications changing libOS code. Exokernel designers have found it difficult to design sys-tems that can safely multiplex physical resources, and, in the latest version of their designs, have found it necessary to provide a means by which small pieces of code, written in a SECTION 2.2 APPLICATION SPECIFIC EXTENSIBILITY 33 restricted language, can be downloaded into the kernel, in order to specify the types and pro-tection attributes of disk blocks. There are two major failings with the exokernel approach to extensibility. Firstly, the means by which applications specialise the system is static (extensions must be compiled into, or with,. the appropriate libOS and/or application). Secondly, the extension mechanism is not well spec-ified. In particular, there appear to be no well defined interfaces through which extensions may be carried out. If application designers do not wish to develop directly on the base.exokernel interface (i.e. develop a new libOS), then they need full knowledge, and code for, the existing libOS. In contrast, Kea relies on explicitly exposing the service interfaces comprising the sys-tem, and making explicit the mechanisms through which they can be manipulated. 2.2.5 Meta-Object Systems Several systems, e.g. Pi [Kulkarni 93] and Apertos [Yokote 92], have proposed the use of an. object-oriented operating system that uses the principles of reflective metacomputation [Maes 87, Kiczales et al. 91, Kiczales et al. 92] to provide a means through which the objects composing the system can be modified. These systems are constructed from fine-grain objects, representing fundamental system resources, and use the composition of these objects to build larger services. By providing a metasystem for manipulation of objects by applications, the developers argue that they can build incrementally modifiable systems. While the goals, and ultimate result of these efforts may be quite similar to that of Kea, the methods by which they are achieved are quite different. Also, the systems as designed use a much coarser breakdown than the Kea design, and it is anticipated that the overhead for supporting metaobject speciali-sation at the kernel level will be substantial. Finally, these systems only support a limited form of extensibility, as only existing objects can be modified - to add entirely new objects, the sys-tems must be entirely rebuilt. SECTION 2.2 APPLICATION SPECIFIC EXTENSIBILITY 34 2.2.6 Synthetix The Synthetix [Cowan et al. 96] operating system, and its ancestor Synthesis [Pu et al. 88, [Massalin & Pu 89] provide enhanced application performance through run-time generation and optimisation of code which interfaces to operating system services. This is transparent to the applications using the system, and results in better performance, even if the application developer is unaware of this possibility. The techniques learnt from these systems have also been applied to a commercial operating system [Pu et al. 95]. In contrast to other extensible systems, Synthetix only makes extensions at the top layers of the operating system services, and does not provide any means by which applications may control their own resources. A fur-ther limitation is that the system for generating extensions can only optimise for those cases that the designers foresee - there is no way in which the future requirements of applications can be anticipated. A valuable experiment would be an evaluation of the Synthetix technology in the context of another extensible system, where the best of both explicit and automatic exten-sions could be applied. 2.2.7 Spring The Spring system (previously discussed in section 2.1.7) is able to dynamically extend the sys-tem by layering new object servers upon existing ones, although this has only been explored in the context of file systems [Khalidi & Nelson 93b]. Kea extends this capability by not only let-ting services be interposed, but also replaced and migrated. Additionally, it is possible to per-form these actions on an application specific basis, and at any layer in the service hierarchy. 2.2.8 Other Approaches and Systems Other approaches, such as interpreted code, have been suggested for system extensibility, but have been found to be very costly in terms of performance [Small & Seltzer 96]. Many researchers have investigated various facets of extensibility in restricted domains, such as file systems [Rees et al. 86, Bershad & Pinkerton 88], virtual memory management [Lee et al. 94, McNamee & Armstrong 90, A p p e l & L i 9 1 , Harty & Cheriton 91], disk buffer caching SECTION 2.3 SUMMARY OF RELATED WORK 35 [Cao et al. 94], scheduling [Anderson et al. 92], I/O subsystems [Fall & Pasquale 94], the user/kernel interface [Jones 93] and networking [Mogul et al. 87, McCanne & Jacobsen 93, Maeda & Bershad 93]. Each of these systems demonstrates the applicability of letting applica-tions perform their own resource management. Kea represents an attempt to develop a system in which these ideas can be applied in a single coherent manner. 2.2.9 Conclusions on Application Specific Extensibility Of the several other systems built for the explicit support of application specific extensions, some are arguably more functional, and certainly more mature, than Kea. The important differ-ences are that, with Kea, the support of application specific extensibility is only one goal among many, and that the Kea architecture supports several other abilities that are unique. Also, Kea is the only implemented design that uses the explicit remapping of interfaces and an application specific IPC mechanism for the support of application specific extensibility. Further experi-mentation and evaluation of this architecture is necessary to validate (or repudiate) its suitabil-ity for this purpose. 2.3 Summary of Related Work This chapter has examined many systems that are related to that of Kea in terms of their struc-ture or function. Kea has borrowed concepts from these systems where appropriate, and extended them where necessary or desirable, in order to build a configurable system. It has been shown that Kea is the only system that offers completely optimised cross-address-space com-munication for service co-location. Kea also has a different model of extensibility than other systems. Most importantly, it is unique in its ability to make service reconfiguration a dynamic operation, rather than a static one. C h a p t e r 3 The Kea Architecture This chapter describes the low level details of the Kea architecture. These features, particularly service manipulation, are the fundamental basis around which the systems reconfigurability is based, and form the core of the thesis work. Later chapters will build on this knowledge in order to evaluate the system features. In Section 3.1, the chapter discusses the philosophy and organisation of the system. Sections 3.2 through 3.6 then detail the fundamental kernel services. Sections 3.7 through 3.10 cover the specification, creation and manipulation of services, while Sections 3.11 through 3.13 exam-ines some of the problems associated with building a decomposed system, and the specific solutions used in Kea. The chapter concludes with a summary of the implementation effort in Section 3.14 and a discussion of the system's performance in Section 3.15. 36 SECTION 3.1 KERNEL AS SERVICES 37 3.1 Kernel As Services As intimated in Chapter 1, the Kea kernel is not viewed as a single monolithic entity, but as a cooperating group of services. Each of these services has a well defined interface, which appli-cations and other services use to access the resources provided by the service. Through a pro-cess of composition, more complex high level services are built up from primitive, low level services. The questions immediately raised by this structure relate to the definitions of the prim-itive services. What should they be? What level of abstraction should they provide? How many of them should there be? There are several possible answers to these questions. The thesis pro-vides one possible set of answers, but this is by no means the only possible set, and is certainly capable of improvement. The thesis helps define some of the characteristics of this set, and improvements in some areas are suggested in the section on future work in Chapter 7. Many researchers, particularly those involved in microkernel research, espouse the view that the core of any system should be a small nucleus that exports only the base abstractions pro-vided by the underlying hardware [Cheriton & Duda 94, Engler & Kaashoek 95, Hartig et al. 97]. On typical systems, this essentially reduces to four subsystems. The first would be a facility for creating address spaces and installing virtual to physical memory map-pings, the second a method for changing the processor context, the third a means of capturing interrupts and traps, and finally an IPC mechanism, so that services can communicate with each other. While these capabilities are sufficient for the implementation of high level services, we instead chose to implement services that offered more complex functionality. We reason that any meaningful operating system has to provide higher level instantiations of these services (e.g., unless the system is only ever going to execute a single process, there has to be some true scheduling support, using the low level context switching interface), and decided that, for the prototype, we would concentrate on providing core services that would be applicable to, and needed for, any operating system design. In each case, these services do in fact rely on low level interfaces, but we have chosen not to make these available as services to the rest of the system, believing that their functionality is subsumed by the services actually provided. SECTION 3.2 DOMAINS 38 We have also implemented these systems to be policy free, or where this is impossible to achieve, carefully separated the modules implementing policy from those providing mecha-nisms, allowing easy replacement. The exception to these statements is the virtual memory sys-tem. It is certainly possible to provide a substantially different model of virtual memory behaviour, as has been demonstrated by the proponents of single address space systems [Chase et al. 93, Bartoli et al. 93]. However, developing different virtual memory systems would also require fundamentally different sets of higher level services and as we only wished to build one set of services, it was not deemed feasible to address this with the current research. The majority of the remaining sections contain descriptions of each of the lowest level services currently provided by the Kea system, with special emphasis on the facilities provided for ser-vice manipulation (the "service service"). Unless otherwise specified, full details on the inter-face to each of these services can be found in Appendix A . 3.2 Domains Domains are the virtual address space abstraction provided by Kea. A domain defines the set of virtual addresses which are valid for threads executing within the domain. The physical memory pages mapped by domains can be arbitrarily shared, mapped as copy-on-write, or cop-ied between domains. The various parts of a domain's memory can have different protection attributes, such as read only or execute. Non-protection related functionality includes the abil-ity to lock parts of the address space in memory. For device driver support, memory can be mapped at a specified physical address. Kea has not innovated in this area, except to capture the interface to virtual memory as an interface. There are two separate interfaces which imple-ment the domain functionality. The domain interface itself allows for the creation and manip-ulation of domains, with the exception of the functionality for manipulating the virtual memory of an individual domain - this is done through the "vm" interface. Complete details on these interfaces can be found in Appendices A. 1 and A.2 respectively. SECTION 3.3 EVENTS 39 Domains are the key security abstraction within Kea - all other resources are "owned" by a given domain. This is a consequence of the service-centric viewpoint taken by Kea, where ser-vices can execute in their own domains, and the design of the IPC system (discussed in Section 3.6): The implications and management of security are discussed in Section 3.12. 3.3 Events The events abstraction defines the means by which asynchrony is controlled. Under Kea, a domain can register to receive notification of events that are significant to that domain, or that it has an interest in. Events subsume the interrupt mechanism, as all interrupts are turned into events. This permits the implementation of device drivers as separate services, and frees them from any restrictions on their placement (in particular, they need not be in the kernel). Another use of events is the notification of processor generated traps such as page faults and illegal instruction faults. Certain predefined events are also provided by other parts of the system, such as the domain management service, which signals the death of a domain whenever this occurs. Complete details on the event interface can be found in Appendix A.3. 3.4 Names For convenience, Kea includes a simple name system. This system lets arbitrary integer iden-tifiers be attached to names. The identifiers can be used to represent other system objects, par-ticularly domains, threads (see Section 3.5) and services (Section 3.7). Thus, for instance, instead of having to know the name of a service, clients can look it up under an appropriate name, such as "/system/service/file". The name service also allows other services implement-ing a name interface to attach themselves to any point in the name hierarchy. Naming in oper-ating systems is a very complicated issue [Radia 89], and is made more interesting in Kea by the system's configurability. While it would be interesting to explore naming to a greater extent than has been possible, the simple system provided has proved to be adequate to the basic needs of system evaluation. The name interface is described in Appendix A.4. SECTION 3.5 THREADS 40 3.5 Threads Threads in Kea are superficially similar to the concept of threads in other operating systems: from the applications point of view, they are simply a context in which code execution takes place, or a "sequential flow of control" [Birrell 89]. In most operating systems, a thread is more than just a flow of control, but also encompasses the notion of schedulable entity, the "thing" to which CPU time is allocated. Kea separates these notions, dividing a thread into two entities. The first, which we continue to refer to as thread, is the construct to which the scheduler allo-cates CPU time. The second, referred to as an activation, contains all the thread state, in par-ticular the current domain, register contents, user and kernel stacks. One thread can have multiple activations associated with it, organised as a stack, of which only the activation at the top of the stack is active. This thread organisation, and the thread semantics described in the remainder of the thesis, closely follows that of Spring [Hamilton & Kougiouris 93] and the "migrating threads" developed for Mach 3.0 [Ford & Lepreau 94] (from which, in the interests of reducing confusion and the growth of new jargon, the activation term is borrowed). The primary reason for structuring threads in this manner is that it facilitates the design of the Kea IPC mechanism. This mechanism, including the details of its implementation and relation-ship to the thread model, is discussed in the following section. A description of the thread inter-face can be found in Appendix A.5. 3.6 Inter-Domain Calls Conventionally, decomposed systems use message passing to communicate. As described in the introduction, we believe that this is fundamentally the wrong paradigm, and that direct operating system support for procedure calls between address spaces is superior. We refer to such a procedure invocation as an inter-domain call, or IDC. Other systems, such as LRPC [Bershad et al. 89] and Spring have claimed to support this paradigm, but still rely on the mar-shalling of procedural arguments into a single contiguous buffer, usually accomplished within SECTION 3.6 INTER-DOMAIN CALLS 41 an automatically generated stub procedure. The IDC mechanism instead relies on the natural argument layout, usually on the stack and/or in registers, of the architecture on which it is run-ning. There are several facets to the implementation of IDC. The primary entity representing a potential IDC is the portal, while the IDC itself is accomplished by a portal invocation. 3.6.1 Portals To perform an IDC, a service client needs to have some means by which to refer to procedures within a service. This is accomplished through the use of portals. A portal appears to the client as an integer identifier. In order to make an IDC, the client uses the system call p o r t a l I n v o k e () , which takes the portal identifier as its first argument. This call is inter-esting for several reasons. It is the only system call in Kea - every other service, including the low-level kernel provided ones, is accessed through this call. It has the following prototype1: ' i n t p o r t a l l n v o k e ( i n t p o r t a l , i n t * r e t , v o i d * a r g s ) ; Where r e t is a pointer to the return value (if any) from the procedure to be called and a r g s is a pointer to the location of a buffer containing the call arguments. It is the a r g s variable that is machine dependant in it's meaning. On an Intel x86 machine, it will be a pointer to the stack location holding the arguments. On a SPARC based machine, the first six arguments are passed in registers, and a r g s will point to the remainder (if any) on the stack. During a portal invocation, the following actions occur: 1. The portal identifier is used as an index into a system table to obtain the destination domain and entry point. 2. A new activation is created. 1. It should be noted that, for clarity, some of the prototypes and data structures described have been simpli-fied slightly. By far the most common change is to the type of some variables or parameters, to avoid the need to include the appropriate typedefs. These changes do not affect the semantics at all. An example is the replacement of the Kea vaddr (virtual address) type with "void * " . The definitive code is shown in the appen-dices. SECTION 3.6 INTER-DOMAIN CALLS 42 3. The new activation is pushed onto the current thread's activation stack. 4. Execution continues in the target domain, at the designated entry point. The portal invocation process is illustrated in Figure 3.1. Thread ' / Act B A c t A Activation push" Figure 3.1: Portal Invocation Domain A calls Domain B user/kernel boundary Returning from an invocation is accomplished by placing a special marker value at the top of the called domain's stack. When the called procedure returns, this value causes a page fault at the specified address. This fault is interpreted by the kernel to mean that the current activation has finished, and the internal portal return code is called. This code "pops" the current activa-tion off the thread's activation stack, copies any return arguments from the called domain to the caller, and returns control to the calling domain. When portals are created, the creator must specify the signature of the procedure that will be invoked by the portal. The signature is the number and type of arguments that the procedure expects. The argument types are limited to simple scalar data (integers, characters and floating SECTION 3.6 INTER-DOMAIN CALLS 43 point numbers), pointers to single structures or scalars, strings (pointers to zero terminated character strings) and pointers to arrays of structures or scalars. In the latter case, the argument after the pointer must be an integer, which the caller must initialise to be the number of ele-ments in the array. Portals are created with the p o r t a l C r e a t e () call, the prototype for which is: i n t p o r t a l C r e a t e ( s t r u c t domain * d o m a i n , v o i d * e n t r y , s t r u c t p o r t a l _ s i g n a t u r e * s i g n a t u r e , u n s i g n e d p s s i z e , u n s i g n e d n a r g s The arguments to this call are the domain in which the entry point exists, the entry point itself, an array of descriptors representing the signature of the underlying procedure, the number of signature entries, and the total number of arguments that the procedure has. The p o r t a l _ s i g n a t u r e structure has the following definition: s t r u c t p o r t a l _ s i g n a t u r e { i n t p s _ a r g : 4 ; / * argument i n d e x * / i n t p s _ t y p e : 2 ; / * argument t y p e * / i n t p s _ m o d i f i e r : 2 ; / * m o d i f i e r f l a g s * / i n t p s _ l e n g t h : 2 4 ; / * number o f b y t e s * / In this structure, p s _ t y p e is used for one of the three pointer types described above (pointer to single value/structure, pointer to array or string), p s _ m o d i f i e r determines whether the argument is to be copied to the callee (in), from the caller (out) or both (in/out), p s _ a r g deter-mines which argument in the procedure is referred to and p s _ l e n g t h describes the size of the element(s) to be copied. The definitions used for these values are: SECTION 3.6 INTER-DOMAIN CALLS 44 /* * v a l u e s o f p s _ t y p e ::*7 # d e f i n e P T _ P T R _ A B S 0x0 #d e f ine P T _ P T R _ M U L T 0x1 ttdefine P T _ S T R I N G 0x2 / * a b s o l u t e p o i n t e r * / / * a r r a y p o i n t e r * / / * n u l l t e r m i n a t e d s t r i n g * / /* * b i t s i n p s _ m o d i f i e r * / # d e f i n e P M _ I N 0x1 # d e f i n e P M _ O U T 0x2 / * c o p y i n * / / * copy o u t * / It should be noted that the signature entries do not include any scalar arguments, but only spec-ify the types of the pointer arguements (if any) for the procedure. arrays of greater than one dimension or data structures containing pointers (such as lists) cannot be passed by the IDC mechanism. In practice, we have found that these capabilities aren't nec-essary. We also believe that the addition of such a capability would result in a considerable per-formance degradation, due to the necessity for parsing any argument description in the kernel, and might also result in services that were too interdependent. As an example, consider the following function prototype: v o i d f o o ( i n t b a r , c h a r * s t r , i n t * s t u f f , i n t n s t u f f ) ; Where s t r is only needed to be passed to the callee and s t u f f is presumed to be a pointer to n s t u f f integers, and contains information that is read and changed by the callee. The signa-ture array describing this would be: s t r u c t p o r t a l _ s i g n a t u r e f o o S i g n a t u r e [ ] = { { 1, P T _ S T R I N G , P M _ I N , 0 }, { 2 , P T _ P T R _ M U L T , P M _ I N | P M _ O U T , s i z e o f ( i n t ) } }; The Use of signatures results in a slightly more restrictive viewpoint of procedure calls, in that SECTION 3.6 INTER-DOMAIN CALLS 45 For each portal, the kernel keeps the descriptor containing information on the number and types of arguments used by the procedure, and uses this information in the invocation code, in order to copy arguments between the two address spaces. Thus, the kernel code itself effectively mar-shals arguments, instead of this being done by a user-level stub, as is the case in almost all other IPC/RPC mechanisms. In many instances, this can reduce the number of memory copies and results in a corresponding increase in efficiency. 3.6.2 Portal Remapping Kea supports reconfiguration and extension through the remapping of portals. Remapping refers to the ability of IDCs using the same portal to be directed to different destinations. There are two types of remapping, simple and domain specific. Simple remapping is the simplest, and is almost trivial in its implementation. It merely changes the internal mapping between a portal and its destination. It is used when global reconfigurations are made, i.e. every client using the portal must now use a different service (for instance, if the original has been replaced). Domain specific remapping is more complex. It is used for application specific modifications to a service, and results in portal invocations causing control to transfer to one of two (or more) different destinations, based on the domain owning the thread making the call. This is shown in Figure 3.1. In this figure, two applications, A and B, are using a service, S, through portal P j . In turn, this service uses another service, SP, through portal P 2 , to make some policy deci-sion. If A wants a different policy to be used, it can perform a domain specific remapping on P 2 , and arrange for all calls that originate in A to be re-routed to a new policy service S A . The two call paths are shown by the arrowed lines - solid lines for application B, dotted lines for application A. By using separate services for mechanism and policy (e.g. maintaining virtual memory mappings and page replacement strategies) the system becomes configurable in a large number of ways. Domain specific remappings are implemented by giving each portal a table containing alternate portal identifiers. When a portal invocation is made, this table is checked, using the domain of SECTION 3.7 SERVICES 46 Figure 3.2: Domain Specific Remapping the original caller (i.e. the activation at the root of the call stack) as the key. If an entry is found, then that portal is used instead. 3.7 Services While portals provide a means through which individual service calls can be made, they do not provide the service structuring required by the Kea design. The kernel provides a service (the "service service") that is used for the description of services. This is accomplished by aggre-gating portals together: from the kernel's viewpoint, a service is essentially a set of portals. These portals are manipulated as a group, ensuring the consistency of the service - the pro-grammer never has to be concerned about the underlying portal representation. In fact, portals are not visible to the user level programmer at all, except for the p o r t a l l n v o k e () call (which is normally "hidden" inside automatically generated stubs). In particular, all of the por-tal manipulation functions are only available from within the kernel (they are not exported as a service), forcing programmers to think in terms of services. To create a new service, the service implementor must provide two things. The first is a set of source files that implement the service procedures. The second is an interface description file, SECTION 3.7 SERVICES 47 which describes these procedures, and in particular, the types and numbers of arguments each one takes. From the interface description file, it is relatively simple to generate the client side stubs1 needed by other services and applications which access the service. When a service is compiled, the result is not an executable. Instead, all of the object files comprising the service are linked into a single object file, leaving all external references unresolved. This step is taken so that when the service is loaded, it can easily be linked against any other services or libraries already present in the domain. To create a new service, the s e r v i c e L o a d () call must be used. This call has the prototype: i n t s e r v i c e L o a d ( i n t domain, c h a r *name, v o i d * code , u n s i g n e d c o d e S i z e , s t r u c t function_map * fnMap, i n t s e r v S i z e The arguments to s e r v i c e L o a d () denote the domain in which the service code currently resides, the name to be applied to the service, a pointer to the service code and the code size (loaded into the domain from a service file), a descriptor for each of the procedures in the ser-vice, and the number of procedures in the service. The funct ion_map structure used in s e r v i c e L o a d () has the definition: s t r u c t funct ion_map { v o i d * f m _ e n t r y ; char f m_name [MAX_FN_LEN] ; •., • • u n s i g n e d f m _ s t a c k s i z e ; s t r u c t p o r t a l _ s i g n a t u r e *fm__ps; u n s i g n e d fm_psS ize ; u n s i g n e d fm_nargs; 1. The stub compiler for the Kea system is not yet functional. This is mostly attributable to the ease of manual stub generation, and a desire to proceed with more interesting research issues. SECTION 3.7 SERVICES 48 The f unc t i o n _ m a p specifies the entry point, name and argument descriptors for each pro-cedure. The name is used for dynamic linking and co-location purposes. Also included is an indicator of the stack space needed for the procedure to execute. It is intended that this param-eter will eventually be removed, and that stacks will be dynamically growable, but this has not yet been implemented. When a service is loaded, several different actions take place. Firstly, the service symbol table is read and verified. Then the service code and data segments are created, using the code pointer provided. The service code is then linked against any other library code resident in the domain. Typically, this code is the result of other services or applications that have already used the domain. A symbol table is maintained for each domain to facilitate this. External symbols available to the service are restricted to those loaded from libraries. Where a symbol cannot be resolved, it is located and loaded from the libraries available to the domain, and the new symbol added to the domain's symbol table. If the symbol cannot be resolved, the service load will fail. In the final stages of the link phase, operations such as text and data relocation take place. From the f u n c t i o n _ m a p information provided, the kernel next generates each of the portals that will be used by clients of the service. Associated with the internal service table, the kernel keeps separate pointers to the code and data segments, the symbol information for the service itself (as opposed to the domain's symbol table) and any associated relocation information. This ensures that the service can be efficiently relocated into another domain if necessary. The penultimate stage of loading a service is the creation of a new thread that will run the service initialisation routine, allowing the service to set up its internal state. The service initialisation routine has the name formed by appending the string "Servicelnit" to the service name, e.g. the "foo" service would have the function "fooServicelnit" called, if it existed. Once the initialisa-tion function has completed, the final stage in service loading is the insertion of the service structure into the kernel service table, making it available to other users of the system. 1. One of the entry point and name could be eliminated as, given one, the other can be deduced from the ser-vices symbol table. This is currently not done, as it simplifies some code, and also provides a check against badly specified service definitions. SECTION 3.8 SERVICE ACQUISITION 49 An important point in the loading of services is that all of the memory used by the service is . mapped by the kernel in such a way that it cannot be deallocated, even by the owner of the domain. This applies to the service symbol table, code and data, and prevents other services, or applications, with which the service may be co-located, from interfering with the service. For similar reasons, the service code and symbol table are also made read-only. 3.8 Service Acquisition Before using a service, a client must first acquire it,, but before discussing the details of how this is accomplished, it is necessary to examine the client's view of a service. The only part of the service normally visible to the client developer is a function pointer for every procedure in the service. Function pointers have the advantage of being able to be used in a syntactically identical manner to function calls, which enables the usage of these pointers to be hidden from the client developer. More importantly, the kernel can change these pointers to refer to a differ-ent entry point, which is a necessary prerequisite to supporting service co-location, migration and replacement. Hidden from the developer is an array of integers, which is used to represent each of the service portal identifiers, and two stubs for each of the service procedures. The func-tion pointers can refer to one of the two stubs, which are automatically generated by the stub generator. One stub (the portal stub) is used when the client is in a separate domain from the service, the second (the co-location stub) when it is co-located. The first of these stubs is very short, essentially containing only the minimum overhead necessary for a portal invocation. For example, the portal stub for the f oo function described earlier might, for an architecture in which all parameters are passed on the stack, have the following code: void fooStubPortal(int bar, { portalInvoke(portals[FOO_PORTAL], NULL, &bar); } Co-location stubs are slightly more complex. The essential task that has to be accomplished is a call to the co-located code, but this is complicated by the need to support service migration 1. For clarity, additional parameters are denoted only by ellipses. SECTION 3.8 SERVICE ACQUISITION 50 and replacement. To explain this in the appropriate context, discussion of the structure of this stub is deferred to Section 3.10.1. Services are acquired by using the s e r v i c e A c q u i r e () call. This call has several purposes. It allows the kernel to keep a record of clients, which is important for reconfiguration. It also allows the security service to check whether the domain trying to obtain the service is allowed to access the resources managed by that service. For the client, it provides a means through which it can obtain the portal identifiers needed to invoke the procedures making up the ser-vice. In the currently existing system, calls to s e r v i c e A c q u i r e () are contained in code automatically generated with the client side stubs from the interface description file. The pro-totype for s e r v i c e A c q u i r e () is: i n t s e r v i c e A c q u i r e ( ' char *name, - • i n t domain, i n t * p o r t a l s , s t r u c t s e r v i c e S t u b * s t u b s , u n s i g n e d n s t u b s , s t r u c t s e r v i c e C o u n t * c o u n t Given a service name, the domain which is acquiring the service, a pointer to portal identifiers and an array of stub descriptors, the s e r v i c e A c q u i r e () function determines whether the domain is allowed to acquire the service, and if so, initialises the portals to those needed for access to that service. The s e r v i c e S t u b structure is used to pass information on the function pointer used to access the client stubs, and the addresses of the stubs themselves. It has the def-inition: s t r u c t s e r v i c e S t u b { v o i d * s s _ f u n c t i o n ; u n s i g n e d s s_s tubPor ta l , -u n s i g n e d s s _ s t u b C o l o c a t e ; v o i d * s s _ c o l o c a t e F n ; SECTION 3.9 INTERNAL SERVICE STRUCTURE 51 The ss_f unction pointer is used to pass the address of the function pointer used to access the service procedure. It will hold one of the two values passed in ss_stubPortal (the address of the portal stub) or ss_stubColocate (the address of the co-location stub). The ss_colocateFn field is used to support service co-location, migration and replacement, as is the count parameter to serviceAcquire(). The need for, and use of, each of these will be described in Section 3.10. When a service is acquired, the kernel also checks to see i f the service domain is the same as that of the domain acquiring the service. Based on the result of this test, it adjusts the client function pointers to refer to the appropriate stub. 3.9 Internal Service Structure From the previous sections, it should be clear that the kernel keeps detailed information on each service, the relationships between services, and the domains containing and using services. This information is summarised in Figure 3.3. The internal service table records, for each domain, a table of the globally visible symbols, a list of the client domains for that service (this table also includes the addresses of the function pointers used to access this service), and the portal identifiers for the service. Not shown is some incidental information kept, such as the locations of the code and data segments in the service domain. In the figure, the expansion of this information for service B shows that both service A and the application are clients of the service. The necessity for, and use of, this information is described in the following section. 3.10 Service Manipulation There are several ways in which service can be manipulated, contributing to the configurability and extensibility of the system. The primary three are service migration, replacement and inter-position. The latter two operations can also be performed on an application specific basis. The remainder of this section discusses the implementation of each of these facilities. Experimental SECTION 3.9 INTERNAL SERVICE STRUCTURE 52 Service A Service B Application Domain symbol client portals table table Figure 3.3: Service and Domain Relationships evaluations and measurements of various manipulations on a variety of services are reported in chapters 5 and 6. 3.10.1 Service Migration As a system is used, it may be desirable to move services into other domains. The primary rea-son to do this is performance. If a service can be migrated into the domain of the most frequent client of that service, then the time needed for each IDC can be reduced by the time taken for changing address spaces, which is a potentially expensive operation. Alternatively, IDCs into the kernel are far cheaper than those into another domain, as hardware architectures are opti-mised for such transfers of control. Thus, for trusted services such as device drivers, file sys-tems and network protocols, it is desirable that these be moved into the kernel once they have been debugged. This offers several advantages to the service developer, particularly if they are developing kernel services. The primary advantage in this case is that of convenience. It is not uncommon for newly developed kernel services to cause a system crash, necessitating a tedious program/crash/reboot development cycle. Within Kea, services can initially be developed S E C T I O N 3.9 I N T E R N A L S E R V I C E S T R U C T U R E 53 within a private domain, which restricts the crash (if any) or consequences of a programming error, to that domain. Additionally, since the service is running in a separate domain, it can be easily debugged using standard tools. The migration process is, in many respects, almost identical to that of the service load process. Memory in the destination domain is allocated for the service code, data, and symbol informa-tion, which is then copied over. The same linking and initialisation functions described in Sec-tion 3.7 are then carried out. Also, each of the service clients is checked to see if it was, or is now, co-located with the service, and the stub pointers are adjusted accordingly: These tasks are all easily accomplished due to the kernel's knowledge about the service structure. The pro-cess is however complicated by two extra factors not present when a service is first loaded. The first of these is that the service has already been initialised and running for some time, and will probably have some internal state that it is desirable to have migrated with the service. The sec-ond is the maintenance of service for clients that are executing within the service when it is migrated. Service State Transfer A large number of services will maintain some internal state (e.g. active file descriptors in a file system). The simplistic service migration described above will only migrate the text and stati-cally allocated data of the service and not this state, which will be necessary for the continued, and correct, functioning of the service. To ensure the correct migration of services, Kea pro-vides a means through which the service designer can arrange to have this state transferred with the service. When the service is migrated, the kernel checks for the existence of a state pack-aging function (the name of this function is composed by appending "MigrateState" to the name.of the service). If it exists, the kernel makes an upcall to this function. The function should package the service state which needs migration into a single memory buffer, which is returned as a result of the function. This data is then copied to the destination domain with the service code and data, and a pointer to it is then passed as an argument to the service initialisa-tion function, allowing the service to recover the state. Where there is no migration function, S E C T I O N 3.9 I N T E R N A L S E R V I C E S T R U C T U R E 54 or when the service is first loaded, the service initialisation function is passed a null pointer, enabling it to detect this. We believe that passive state (i.e. data allocated by the service, as opposed to active state, such as threads created by the service) could be automatically trans-ferred using new techniques developed for process migration [Smith 97], and plan to pursue this possibility in future work. ; Migration and Service Clients The second major concern with service migration is the problem of clients who are currently using the service. These clients cannot have their calls terminated, and must continue to get the correct results. To solve the problem of calls made after the migration has started, associated with each portal is a variable that indicates whether the service backing the portal is currently being migrated. A check of this value is made during portal invocation, and if migration is tak-ing place the calling thread blocks until the migration is complete/Additionally, any co-located clients have their stub pointers remapped back to the portal stub. The problem of client calls already extant in the service code is more problematic. The state cannot be deemed to be consistent, or able to be gathered, until all clients of the service have completed their execution within that service. The solution to detecting current clients of the service is twofold. Firstly, each service has a counter associated with it, which records the num-ber of portal invocations that have been made for that service. This counter is incremented immediately after the check for migration described in the previous paragraph and decre-mented in the portal return path. Thus, the counter maintains an exact count of the clients that have entered, or are about to enter, the service code. Before calling the service's state acquisi-tion function, the kernel first checks to see if the service invocation count is zero. If not, it relin-quishes the CPU, checking the state again when it is next rescheduled, an action which is repeated until the count is zero. This need to wait for all service clients to finish executing within the service is the major drawback of the migration mechanism, but it is the only feasible solution. In practice, we find that almost all of the kernel services finish execution within a few milliseconds, the exception being the low level disk drivers, which could potentially take hun-SECTION 3.9 INTERNAL SERVICE STRUCTURE 55 dreds of milliseconds for long disk queues, although we have not observed this. This delay probably makes migration (and replacement) unsuitable for use in real-time systems, at least for some clients, but general timesharing systems should have few problems. The second half of the solution concerns those clients that are co-located with the service. As these calls do not go through the portal invocation mechanism, they are not recorded by the ser-vice invocation count. The solution to recording co-located clients involves the use of the s s _ f u n c t i o n field of the s e r v i c e S t u b structure and the s e r v i c e C o u n t structure passed as a parameter to s e r v i c e A c q u i r e (). This structure has the definition: s t r u c t s e r v i c e C o u n t { b o o l s c _ c o l o c a t e d ; u n s i g n e d s c _ c o u n t ; }; The stubs used when the service is co-located make use of this structure to record the number of times they have been invoked. Stubs generated in this manner have, in general, the following structure: s t a t i c s t r u c t s e r v i c e C o u n t s c ; s t a t i c ( * f o o C o l o c a t e P t r ) ( ) ; . v o i d f o o S t u b C o l o c a t e ( . . . ) { sc . sc_count++ ; i f ( ! s c . s c _ c o l o c a t e d ) { sc.sc_count~.; f o o S t u b P o r t a l ( . . . ) ; } e l s e { f o o C o l o c a t e P t r ( . . . ) ; s c . s c _ c o u n t - - ; } } • ' These structures and stubs interact to make co-location and migration possible. The s s _ c o l o c a t e F n parameter for each stub is set to the address of the internal function pointer SECTION 3.9 INTERNAL SERVICE STRUCTURE 56 for the co-location stub (this should not be confused with the stub function pointer, which refers to either the portal or co-location stub). This pointer is adjusted when services are co-located, in order to point directly to the entry point in the co-located service. When the migration func-tion is first initiated, one of the first steps is to change the stub function pointer to the portal stub, and then to set the co-location boolean to false. Clients that call the co-location stub before the function pointer is changed will always increment the counter. Those that are pre-empted after incrementing the counter but before checking the co-location variable will quickly decre-ment the counter and be blocked in the standard portal invocation path, while others are guar-anteed to eventually enter the service via the internal stub function pointer. As in the portal invocation case, the kernel checks the value of the service counter, and waits until it becomes zero before calling the state acquisition function. Migration Prototype The prototype of the service migration function is: i n t s e r v i c e M i g r a t e ( i n t s e r v i c e , i n t d o m a i n ) ; where s e r v i c e is the identifier of the service to be migrated, and domain is the domain the service is to be migrated to. 3.10.2 Service Replacement There are two types of service replacement, global and local (or application specific). Global service replacement is similar to migration, and is performed for similar reasons. The replace-ment may offer greater performance, or may be more reliable, or offer increased functionality. In this case, it is desirable that the system administrator be able to transparently replace the old service, ensuring minimal disruption to the users of the system. This is particularly useful if the machine is intended to be highly reliable, available or fault tolerant, as it ensures that all ser-vices offered by the machine suffer only a very small interruption in service. S E C T I O N 3.9 I N T E R N A L S E R V I C E S T R U C T U R E 57 The global replacement function is essentially a combination of service loading and service migration. The old service is suspended, and its state acquired, exactly as for migration. The , new service is then loaded normally, except that it is given the state buffer from the old service as an initial argument, and instead of the creation of new portals, those from the old service are recycled. L o c a l Replacement Local service replacement is conceptually similar to the global case, in that one service appears to replace another. The difference is that the replacement is done in such a way that the change is visible to only one domain - other domains continue to see the original service layout. Part (c) of Figure 3.41 shows a local replacement. In the figure, domain B has done a local replacement of S 2 with S 5, so that all calls originating in B now use S 5, while calls originating in all other domains continue to use S2. Local replacement is used by applications in order to install services that offer them increased performance for their needs, while not affecting other applications. Local replacement is relatively simply implemented by a process of installing a domain-spe-cific mapping on each of the portals of the service being replaced. The only complication is for clients that are co-located, which cannot use optimised calls, as they do not go through the por-tal invocation mechanism. In this case, the clients are forced back to the portal invocation stub. This creates a potential performance problem, due to the additional overhead of portal invoca-tion, as opposed to a direct call (although, if the services are co-located in the kernel, as is the normal case, this overhead is only a few fractions of a microsecond). 1. This is a copy of Figure 1.3, reproduced here for convenience. S E C T I O N 3.9 I N T E R N A L S E R V I C E S T R U C T U R E 58 B (c) Figure 3.4: Service Reconfigurations Replacement Prototype The service replacement prototype is: i n t serviceReplace( i n t service, bool global, int domain, char *name, void *code, unsigned codeSize, struct function_map *fnMap, in t servSize This function functions exactly like s e r v i c e L o a d () , with the addition of two extra argu-ments, denoting the service to be replaced and whether the replacement is to be global ( g l o b a l = t r u e ) or local ( g l o b a l = f a l s e ) . S E C T I O N 3.9 I N T E R N A L S E R V I C E S T R U C T U R E 59 3.10.3 Service Interposition The final method by which services can be manipulated is interposition. Interposition is illus-trated in parts (a) and (b) of Figure 3.4, where a new service (S4) is interposed above another (Sj). To perform a service interposition, we assume that the interposing service has already been loaded, and acquired any lower level services required1. To perform an interposition, the kernel must examine each of the clients of the service being interposed on, and remap them appropriately. If the client is the service being interposed on, then nothing needs to be done. If not, then the portals in the client domain are remapped to the interposing service. The kernel also records in the kernel service structure for the interposed-upon service that it is interposed on, and by whom, so that future clients who try to acquire the service can be remapped to the interposer. Service interposition, like replacement, can also be done on a global or local basis. The prototype for the service interposition function is: i n t s e r v i c e l n t e r p o s e ( i n t i n t e r p o s e r , i n t i n t e r p o s e e , b o o l g l o b a l where the first two arguments denote the service identifiers for the two services concerned, and the third serves the same purpose as that of s e r v i c e R e p l a c e () . In fact, once the service has been loaded, s e r v i c e R e p l a c e () uses the service interposition function internally for local replacements - both perform the same logical operation, namely doing a domain-specific remapping for each of the service's clients. 1. Service interposition is, strictly speaking, a misnomer for the operation being described. As services are expected to acquire their own lower level services, what we call interposition really only performs one half of the operation, namely the remapping of the original service clients to the interposer. S E C T I O N 3.11 U S E R / K E R N E L U N I F I C A T I O N 60 3.11 User/Kernel Unification On of the interesting requirements of the Kea design is that exactly the same programming interface and semantics be provided to all users of the system, regardless of the address space (user or kernel) in which the code is ultimately run. Fully supporting this capability requires that the kernel be able to dynamically link against libraries required by code loaded into the kernel, that code running at both kernel and user level have the same system call semantics, and that kernel stacks be dynamically growable. With the exception of the last point, all of these capabilities are provided by the Kea system. Currently each thread has a fixed size kernel stack (typically 8 Kb). Although techniques do exist for growable kernel stacks [Draves & Cutshall 97], we have found that, in practice, this facility is not needed. It may how-ever be advantageous to develop it for future versions, as it would allow for the allocation of large automatic variables and recursive code. User/kernel unification offers significant software engineering benefits. It removes many of the problems normally associated with developing kernel level code. Some of these advantages are that standard debugging, profiling etc. tools can be used, there is no need for any special forms of synchronization (e.g. Kea has the same mutex and semaphore operations available to user and kernel threads) and developers only have to be aware of one programming environment (which also has an impact on the size of technical documentation required). These advantages, if they can be achieved with small overhead, make sense in any operating system. 3.12 Protection & Security There are several issues of security interest with the Kea design. As well as the standard secu-rity issues that any operating system must deal with, special problems are raised by the decom-posed system design, and the reconfigurability of the system, both general and application specific. SECTION 3 .12 PROTECTION & SECURITY 6 1 3.12.1 General Security For the scope of this thesis, general security refers to those security issues that all operating sys-tems must be concerned with. These can generally be constrained to the protection of resources, whether those be a process's memory space or a user's files, from access by users who are not privileged to access those resources. As Kea is designed to be a decomposed sys-tem with replaceable components, it tries to isolate the security aspects of the system from the other basic services provided. It is impossible to totally separate security policies from other interfaces, as it is fundamental that security checks take place before many of their operations. With this in mind, it was decided to place all of the security related information in a single structure, which is then made part of the per-domain kernel structure. In the current system, this structure holds only two integer variables, the user identifier (UID) and group identifier (GID) which denote the user and group who own the domain. These values are used to implement simple Unix-like security semantics. Matching the domain structure (which determines what entity is making an action), each thread, event and service has a corresponding structure1, which determines which users and groups are allowed to manipulate or acquire it. For each aspect of kernel functionality where it was decided that a security check was needed, the security service exports a number of boolean procedures which take domain identifiers as arguments and determine whether the operation specified is allowed. For example, only a user with UID 0 or with the same UID is allowed to map memory in another domain. The function used to check this has the following code: b o o l secAllowVmMap(int domainl, i n t domain2) { i n t u s e r l = g e t S e d n f o (domainl)->si_user; i n t user2 = getSecInfo(domain2)->si_user; r e t u r n ( ( u s e r l ==0) || ( u s e r l == u s e r 2 ) ) ; ' } 1. Each structure holds a 32-bit value for each of the U I D and G I D , with bits that correspond to al lowing or denying various operations on that entity. SECTION 3 . 1 2 PROTECTION & SECURITY 6 2 By encapsulating the security information in a single structure (specified in its own header file), and keeping all the security functionality in a single source file, Kea modularises, as far as is possible, the security information for the system. Unfortunately, it is impossible to provide all the potential security structures and checks that may be desired in the future. For instance, if at some time it was judged that the ability to protect sections of a domain's virtual memory space with different per-user protections, this information would need to be added to the internal domain structure, and the s e c A l l o w M a p () function expanded to take address ranges as arguments. Thus, while the security encapsulation provided by Kea goes some way towards making the implementation of other security methods easier, it does not totally eliminate the work required. One complication caused by making domains the holder of security information is that of portal invocation, and the consequent need for threads to change domains. This raises the question of which domain should be seen as performing any action, the one at the root of the activation stack, or the one in which the thread is currently executing? Examining the simple choices implied by the question revealed that neither was feasible. Consider the first option, that of making the original domain responsible. If the thread is executing in another service, this ser-vice may wish to perform actions on its own behalf, rather than on its caller's. If every opera-tion is interpreted as being made by the original domain, this becomes very difficult, although not impossible - the service could create separate threads based in its own domain to: carry out such operations, but this would be extremely awkward. The obvious alternate, making the domain in which the thread is running responsible, is also infeasible, due to the system's recon-figurability, particularly in the face of service interposition. Consider a service (A) using another service (B), where B maintains a small list of domains which are perform various operations, and presume that A is on this list. If a service (C), which is unknown to B, is now interposed between A and B, then B will refuse to process any calls from C, even through they originate in A. SECTION 3 . 1 2 PROTECTION & SECURITY 6 3 To solve these problems, Kea introduces the concept of effective domain, which is essentially: a combination of both the methods described above. Each activation contains a value, the effec-tive domain, which identifies a domain. The effective domain value for the root activation is ' set to the domain in which the thread is created, and each successive activation copies this value from the previous domain (this is done during portal invocation). Any service can change the effective domain, but only to one of two values, that of the domain in which the service is loaded (using the s e t E f f e c t i v e D o m a i n () call), or to the default value, that of the calling domain (using the r e s e t E f f e c t i v e D o m a i n () call). Services that wish to check domains can use the g e t E f f e c t i v e D o m a i n () function to obtain the current effective domain, and determine which of the domains in the threads' call stack wishes to be responsible for the cur-rent call. This system combines the default solution of making the original domain responsible, while allowing individual services to make calls in which they specify that the operation is to be carried out on their behalf, rather than that of the original caller. In several respects, this is similar to the Unix setuid mechanism [Ritchie 79], in which applications with the appropriate file protection bits set can run as if they were executed by the file owner, as opposed to the per-son who executes the file. Important differences are that the effective domain is changed on a thread, rather than a process basis, and is always temporary (it is reset when the thread returns through the portal), and that it is based on address space, rather than user identity (although this Is only one level of indirection removed from the domain). 3.12.2 Protection in a Decomposed System In a decomposed system like Kea, it is desirable to have some means by which services can protect themselves from interference by other services and applications. This is provided by letting services run in separate address spaces, which are efficiently supported by the system hardware. As described earlier in the thesis, the major disadvantage of this means of protection is the need to change address spaces when making inter-domain calls, which is a very expen-sive operation. As a consequence, Kea is designed with the presumption that there are only a SECTION 3.12 PROTECTION & SECURITY 64 limited number of circumstances in which the system designer will wish to run services in sep-arate domains: • Service debugging - when services are being debugged, it is easier to control the ser-vice interactions, and monitor its operations, when it is running in an isolated domain. • Service testing/verification - even if not actively debugging a service, it is desirable to restrict it to a single address space when it is newly developed or installed, in order to verify that it behaves correctly, and to restrict the damage if it does not. • Very reliable systems - in systems where reliability and fault tolerance are extremely important, it may be desirable to increase these by sacrificing some performance. • User provided service - if a user has provided a service, it is unlikely that it should be treated as trusted. Each of these points are related by the theme of service reliability. For the bulk of systems, we believe that it is both unnecessary and undesirable to have services exist in independent address spaces once they have been debugged. In almost all cases, system administrators will choose to configure their systems in such a way as to give the system clients (applications) the greatest performance, which will only be possible if services are co-located in the kernel1. Kea has been designed with this in mind, but provides the means through which administrators and develop-ers can choose for themselves where the trade-off between system modularity, safety and per-formance should be made. The above discussion does not directly address all of the security issues associated with the last point on the list (user provided services), as this section focuses on protection only (which can be provided by address spaces). A complete discussion of the issues for user services is given in Section 3.12.4 on local reconfiguration. 1. It may be possible to get better performance for some services by migrating them to an application domain which makes heavy use of that service, but this is expected to be the exception, rather than the rule. SECTION 3 . 1 2 PROTECTION & SECURITY 65 3.12.3 Global Reconfiguration and Security Kea takes a simple viewpoint of global reconfigurations. It is assumed that system developers and administrators are competent, and understand the implications of system reconfiguration, that the services shipped with the system are safe to install, and are designed to work together. Given these conditions, there are no problems associated with global reconfigurations. 3.12.4 Local Reconfiguration and Security Local reconfiguration is a much more complicated issue. The two primary problems are: • Which services are "reconfiguration safe"? That is, of all the services making up the system, which ones can be interposed on or replaced on an application specific basis? • How can the safety of code be guaranteed? That is, once code has been installed, how can the system administrator be reassured that it will not damage other system compo-nents? The answer to the first of these questions is determined by the design and purpose of the ser-vices themselves. It is unlikely that a typical user should be allowed to install a service which interposed on, or replaced, the default disk driver. However, there is no reason why an applica-tion should not be able to interpose on any service which can be directly acquired by that appli-cation. In general, the answer to the question of which services are reconfiguration safe must be answered by the system architects (for general users) and the system administrators (who may choose to give certain applications greater freedom than others). The second issue in local reconfiguration, determining the safety of code, is one of the principal research questions attacked by several other projects, notably SPIN [Bershad et al. 95] and Vino [Seltzer et al. 96]. Because these systems are answering these questions, it was decided that it would be more sensible to reuse any applicable methods developed, rather than expend resources on what was judged to be only one of the issues involved in the design of a reconfig-urable system. As described in chapter 2, these systems use a variety of methods to accomplish SECTION 3 . 1 2 PROTECTION & SECURITY 66 their goal. Of these, the most promising are software fault isolation [Wahbe et al. 93] and the use of a type-safe and modular language, such as Modula-3 [Nelson 91]. These solutions allow the system to guarantee that externally supplied code will not write to memory outside the ser-vice boundaries, and will access all internal memory as the correct type. Although Kea does not implement these solutions, there is no obvious reason why they could not be incorporated when needed. With a Kea-like system, we believe that the future will bring about two distinct classes of exten-sions. Firstly, each system will be shipped with a large number of prebuilt services, many of which will be explicitly designed for the purpose of application specific changes to the system. These services will incorporate some method of digital signature [Chaum & van Antwerpen 90, Rivest 92, Microsoft 97] through which the system can guaran-tee their safety. As new application demands are made known, third-party services will be developed that provide solutions for these applications. These facilities will suffice for the majority of application demands. The remainder will be those applications that need to make a large number of, or highly sensitive, system changes, and also demand high performance. Examples of such applications might be database or file servers. We believe that these applica-tions will normally be dedicated to specific machines, with minimal interference from other users, and will require enhanced privileges in order to mn effectively. As these services will be confined to restricted environments, the issues of application interference will be much dimin-ished. In fact, such systems will represent a blurring of the lines between operating system and application, with the machine dedicated for a single purpose. While there are other solutions to the problems of service safety, such as transactions (as exem-plified by Vino) or proof carrying code [Necula & Lee 96], these are either too costly in per-formance terms or too immature to be considered for use at this stage. S E C T I O N 3 . 1 3 S C H E D U L I N G 67 3.13 Scheduling Like security, thread scheduling is an issue which is impossible to totally isolate from all other parts of the system. Too many users of the system rely on the knowledge and manipulation of scheduling attributes for it to be possible to realistically consider the possibility of a dynami-cally replaceable scheduler. We have however found that it is very possible to separate the scheduling behaviour from the rest of the system in such a way that it is trivial to build a kernel with a different scheduler. This can be contrasted with other systems which incorporate knowl-edge of the scheduler in many different parts of the system. For example, the FreeBSD Unix code has a large number of references to the scheduling information in many different kernel source files. The Kea system is unique in that it implements the scheduler entirely as a number of callbacks, made from the kernel whenever an event of interest to the scheduler occurs (such as a thread that finishes sleeping or a thread creation). A l l the low level code, such as the details of context switches, is kept in the body of the kernel, freeing the implementor of the scheduler to concentrate on the core algorithms. The scheduling information is defined in a single header file, and included into each thread's control block. Typically, the scheduler implementation is also contained in a single file. This system makes it relatively easy to implement schedulers, and several have been developed, including a fixed priority, round-robin scheduler, a real-time scheduler [Finkelstein et al. 95] and a Unix-style scheduler. Complete details of the scheduler interface are given in Appendix C. 3.14 Implementation Summary Kea has been under development since July of 1994. Initial development was on a Sun SPARC IPC. The primary architecture changed to Intel i486 [Intel 90] and Pentium [Intel 94] based machines in January of 1995. It currently exists as a complete kernel, with all the services described in previous sections completely implemented, and many higher level services (described in the next chapter) providing device access, several file systems and complete net-working stacks. A l l of this development has been the sole product of the author, except for the S E C T I O N 3 .15 S E R V I C E P E R F O R M A N C E 68 low level code for the initial i486 port (done by Peter Smith), most of which has subsequently been reimplemented, the Sun 3 port (by Christian Vinther) the sockets interface to the TCP/IP service (by Davor Cubranic) and the PCI support (by Norm Hutchinson). When compiled, the current version of the Intel kernel occupies 93 Kb (78 Kb code and 15. Kb data). The number of lines of code in each part of the kernel is shown in Table 3.1. Code lines were measured using "wc -1". The components of the kernel measured included a simple C library (libc), miscellaneous code (initialisation, assembler and other unclassified code) and each of the major kernel services discussed in the previous sections. Subsystem Lines libc 9454 misc 4579 domain/VM 3938 thread 1592 event 628 name 665 service 3004 security 181 scheduler 472 Total 24513 Table 3.1: Number of Kea source code lines 3.15 Service Performance The principal performance measurement made in the evaluation of any decomposed system is the time taken for a cross-domain call (although it has been argued that this and other related performance factors are becoming increasingly unimportant [Ousterhout 90, Anderson et al. 91, Bershad 92, Rosenblum et al. 95]). The traditional means of evaluating this factor is to measure the time required for a null procedure call (that is, a procedure call that has SECTION 3 .15 SERVICE PERFORMANCE 69 no arguments and returns no value) between two user level domains. However, there are sev-eral serious weaknesses implicit in evaluating systems by this measure only: • Null procedure call is a poor benchmark. While it captures the cost of transferring'con-trol to another domain, it does not measure the cost of transferring arguments and : results. Due to the need to marshall and unmarshall arguments, and to then copy them between address spaces, these costs can be significant. Measuring only the null proce-dure call encourages implementors to ignore the potential expense of user level soft-ware stubs. Also, it is extremely atypical to have a call of this nature in code - there are almost always parameters and results to be managed. • Relying on a single benchmark, especially one that does not accurately reflect the work-load placed on the system, is bad practice. It is better to design a range of tests, and draw conclusions, or make decisions on optimisation, based on the aggregate test results. • Null procedure call only measures user-user interactions. It is equally important to mea-sure calls between user and kernel space (when services are co-located in the kernel, or the control transfers between a service in the kernel and one in user space) and between services that are co-located (either in the kernel or user space). These measurements are very important, as we believe that service co-location, rather than separation, will be the normal configuration of most systems. By neglecting these weaknesses, and concentrating on the optimisation of null procedure call, OS developers may be being led into poor design decisions. To avoid falling into this trap, we propose a set of tests to measure the complete end-to-end performance of cross-domain calls, and measure these in a variety of different configurations, rather than concentrating solely on the user-user case. In the test suite developed, there are eight distinct types of test, many of which have several variations. The test types can be summarised as: • null - The traditional null procedure call. This gives a first order measure of the over-head for cross-domain communication. SECTION 3 .15 SERVICE PERFORMANCE 70 • fixed argument - null procedure call, with the addition of one or more simple parame-ters such as integers. The tests presented use from one to four arguments. This test, when compared with the null test, corrects for the cost of simple argument processing. • return - null procedure call, but returning a value. This test allows an estimate to be made of the cost of returning a result. • array - a procedure with two arguments, the first being a pointer to an array of integers, the second being the size of the array. This test measures the overhead from copying variable size arrays between domains. Several variations are possible. Firstly, the direc-tion of argument transfer can be either In (from client to service), Out (from service to client) or InOut (copied in both directions). The size of the array is also varied, from 4 to 1024 entries. • structure - a procedure with one argument, a pointer to a structure. This test evaluates .the copying of relatively small, fixed size blocks of memory. Once again, several vari-ations are possible. The size of the structure can be varied - in the test results shown, . the small structure contains 12 bytes, the large structure has 80. While the direction of copying can also be varied (as for the array test), doing this reveals no more information than for the array test, so these results are not reported. • block - moving blocks of anonymous data within the system is important for any appli-cation which performs I/O. The block test measures the time taken to copy large, fixed size blocks of data. Variations of the block size (1 Kb, 4 Kb and 8 Kb) were performed. • string - many C-style procedures manipulate zero-terminated character strings. This test measures the time taken to copy such strings. Variations were done with a small string (5 characters) and a large string (60 characters). S E C T I O N 3 . 1 5 S E R V I C E P E R F O R M A N C E 7 1 • combination - This test combines various facets of each of the above. The first argu-ment is an integer, the second a small string, the third a small structure, the fourth an array pointer, and the fifth the size of the array (16 integers). Each of these tests measures different capabilities of any cross-domain call, and allows an eval-uation of the total system capabilities to be made, as opposed to just those shown by the null RPC test. 3.15.1 Kea IDC Performance Each of the tests described above were run for Kea. The experimental machine was a 100 Mhz Pentium with 256 Kb L2 cache and 64 Mb of R A M . Times were measured using the "rdtsc" (read timestamp counter) instruction, which returns the number of clock cycles executed by the processor. On a 100 Mhz machine, this enables times to be measured with a resolution of 10 nanoseconds. Results from these tests are shown in Table 3.2. These results were obtained by doing the test once (to warm caches) and then repeating each test 1000 times, measuring the aggregate time taken, and dividing to obtain a result. Repetitions of the tests showed minimal variance (typically on the order of 0.1 %). It should be noted that the times for kernel/kernel IDCs are not shown because they are constant at 0.6 ps (kernel/kernel IDC is very efficient because no traps or address space changes are per-formed, and the destination function can be called directly). The kernel/kernel times were still measured through the portal invocation mechanism, instead of being co-located, as if there are application-specific remappings present, services cannot call each other directly, but instead must use the portal invocation mechanism. The results show that, as might be expected, simple calls take a relatively small amount of time, (approximately 25 u\s) which gradually increases with the amount of data to be transferred. When the call is coming from kernel mode to user mode, the times are reduced somewhat by the elimination of one trap and slightly simpler argument processing. There are some small S E C T I O N 3.15 S E R V I C E P E R F O R M A N C E 72 Test user ->• user Time (fis) user -*• kernel kernel user null 25.7 3.1 21.4 fixed 1 26.1 3.0 21.4 li.\cd2 25.8 lll^plllB 21.4 fi.\cd3 25.9 21\4 ' -Iixcd4 26.0 Bllliilllfcl 21.4 return 25.5 2.9 21.2 arrayln4 30.1 4.3 26.4 arraylnl6 31.5 4.6 27.0 arraylnl28 37.0 10.3 32.2 array In 1024 79.7 llj^lll^llll 75.3 arrayOut4 31.6 4.6 27.3 arrayOutl6 32.6 4.9 28.4 arrayOutl28 37.0 8.7 •31.2 arrayOutl024 67.1 31.7 60.9 arrayInOut4 33.8 31.0 arrayInOutl6 35.2 32.2 arrayln0utl28 40.8 37.2 arrayln0utl()24 90.3 43.0 83.3 smallStructln 31.3 4.2 27.3 largeStructln 31.8 4.7 27.8 blocklnlK 45.7 38.7 blockIn4K 80.1 76.0 blockIn8K 144.5 105.7 137.9 blockOutlK 40.2 8.1 34.2 ' blockOut4K 64.9 31.2 59.1 blockOut8K 150.0 106.6 138.6 smallStringln 31.5 27.3 largeStringln 34.4 30.2 combination 39.4 7.6 35.5 Table 3.2: IDC Times. Related tests are shown with similar shading SECTION 3 .15 SERVICE PERFORMANCE 7 3 anomalies in the table that may need additional explanation. The time for the "return" test is marginally smaller than that for the "null" test. This is because of a branch misprediction in all tests that do not return a value. Adding a return value to the other tests reduces the time for each test by 0.2 ps. Also, the "out" tests typically execute faster than the " i n " tests. This is entirely due to cache misses in the data buffers. Another observation about the times in Table 3.2 is that they could potentially be much improved. Currently, the portal invocation code is written entirely in C. It is probable that by rewriting and hand-optimising this code in assembler, substantial performance gains could be realised. The current compiler used for the Kea source code is gcc 2.7.2, which only supports optimisation for i486 processors. Using an experimental version of gcc with support for Pen-tium optimisation, savings on the times reported of between 5 to 10% can be made. Unfortu-nately, the compiler is not quite stable enough to use for production purposes at this time. Also, for the Pentium, memory copies can possibly be done much faster by using the floating point registers to copy memory in 64 bit chunks, rather than the 32 bit copies currently used, although this technique would make context switches more expensive due to the need to save floating point state. Finally, only a relatively small amount of effort has gone into profiling and optimis-ing the code at this time. The primary goal of the system was to develop something that was "fast enough" to enable the reconfiguration experiments, and the current code meets this goal. As will be shown in the following section, Kea compares well with other systems in any case. 3.15.2 Comparisons With Other Systems The results in Table 3.2 should be compared to other systems. Unfortunately, there is no easy way to do this, as the results reported in the literature are typically for user/user null procedure calls only. For this single test however, a number of results are available and are shown in Table 3.3. This table is based on one in [Liedtke et al. 95] and shows times extracted from Liedtke et al. 97, Liedtke et al. 93, Hildebrand 92, Ford et al. 96, Schroeder & Burroughs 89, Draves et al. 91, Bershad et al. 95 and van Renesse et al. 88. SECTION 3 .15 SERVICE PERFORMANCE 74 System CPU, Mhz Time (ps) cycles L4 Pentium, 133 3.12 416 L3 486,50 10 500 QNX 486, 33 76 2508 Kea Pentium, 100 25.7 2570 Fluke Pentium Pro, 200 14.9 2980 Mach R2000, 16.7 190 3173 SRC RPC C V A X , 12.5 464 5800 Mach 486, 50 230 11500 SPIN Alpha 21064, 133 89 11837 Amoeba 68020, 15 800 12000 Mach Alpha 21064, 133 104 13832 Table 3.3: Null Procedure Call Times The important result shown by examining the times in this table is how well Kea IDC compares to other systems. Although three systems (L4, L3 and QNX) appear to perform better than Kea, there are several additional factors that must be taken into account. Firstly, because the times in the table are for null procedure calls only, much of the real costs of handling parameters and data copies are ignored, which is not the case with Kea IDC, which implicitly includes this cost. Secondly, the L4 and L 3 1 systems incorporate significant optimisations not made in, or not available for, Kea. Both systems are written exclusively in assembler, and have had many man-months of effort put into the optimisation of the execution path. Kea on the other hand is written almost completely in C, and has not had the optimisation effort invested in other sys-tems. L4 also uses Pentium segment switching to accomplish address space switches. This technique is much faster than using conventional page table switching (saving a minimum of 3 ps in IPC times [Liedtke et al. 95]), but is only applicable to small address spaces of less than 4 Mb. A l l three systems are extremely small, and fit into the L I cache of the systems on which they are run, which also substantially decreases their IPC times. In particular, for the L4 time reported, fully half the IPCs have all code and data resident in L I cache, while the other half have all code and data entirely in the L I or L2 cache. It is perhaps much more accurate to com-1. L3 is an ancestor of L4. SECTION 3 . 1 6 ARCHITECTURE SUMMARY 75 pare with the times for the Fluke system [Ford et al. 96], a microkernel of similar complexity to Kea. An interesting observation from the results shown is the predominance of Intel i486 and Pentium processors. These architectures have significantly higher costs for both address space switching and trap handling than all other modem processors. Their dominance in the table probably reflects more on their availability and the implementation effort invested for these systems than any other factor. 1 As calls between user and kernel address spaces are analogous to system calls we compared the "return" test for Kea with the "getpid" system call on FreeBSD 2.2.2-RELEASE on an identical machine. The Kea time of 2.9 ps compares vary favourably with the FreeBSD time of 3.3 ps, showing that Kea is faster than a well developed monolithic system on at least one equivalent microbenchmark. It is difficult to compare upcall and kernel/kernel times, as these results are not often published. Perhaps the best comparison to be made is with other extensible systems, as they, like Kea, must include support for call redirection in the kernel. The Vino authors report times of between.67 and 130 ps for the "null graft" case in several of their tests [Seltzer et al. 96]. While this includes support for the Vino transaction mechanism, this is still more than an order of magnitude greater than the Kea times. The SPIN system requires only 0.13 ps for a kernel to kernel call, but this is for services that have been dynamically linked (the Kea overhead for co-location is similar, about 0.3 |is, including the co-location stub overhead), rather than going through a redirection mechanism as in Kea. 3.16 Architecture Summary This chapter has presented a detailed study of the Kea architecture, with particular emphasis on the support for services. The design of services allows for the dynamic reconfiguration of the system structure through service migration, replacement and interposition. By using domain specific portal remapping each of these actions can also be performed on an application SECTION 3 . 1 6 ARCHITECTURE SUMMARY 7 6 specific basis, enhancing the system flexibility. It has been demonstrated with microbench-marks that Kea's IDC performance, vital for the support of these features, is comparable with, or better than, IPC mechanisms in similar systems. The critical design techniques necessary for building a reconfigurable and extensible system that have been described are: • Structuring the kernel as a collection of services. Services are the fundamental entities around which reconfiguration and extensions are constructed. • Low level kernel support for address spaces, threads and asynchrony are necessary for the implementation of services. The interfaces for these services are the base ones upon which all other services are based. • Procedural IPC is implemented using inter-domain calls, which transfer a thread's flow of control directly between address spaces. IDCs provide for efficient cross-domain data copies, simplify stub generation, and make possible service co-location through direct procedure calls (as the natural system paradigm supports the processor's natural call mechanism). • Portals are used by the kernel to represent an IDC entry point. By recording portal addresses and imposing a layer of indirection in the portal invocation mechanism, the portal remapping mechanism can be used to arbitrarily redirect portal invocations, on both a global and application specific basis, independently of the application using the portal. • The stub structure and function pointers make service co-location simple. Both stub and portal invocation mechanisms provide for reference counting on service usage, enabling safe migration. • Representing services as object files lets them be easily linked into any domain. SECTION 3 .16 ARCHITECTURE SUMMARY 77 • User/kernel unification provides significant software engineering advantages, as total system complexity (as seen by the programmer) is reduced, and also enables services to be run transparently in either kernel or user address spaces. Each of these features operate together synergistically to reinforce the others, and provide a system that can be reconfigured and extended in a number of ways. The following chapters introduce some high level services built on the system, and then show how these, combined with the service features described in this chapter, can be used to provide development, admin-istration and performance advantages. C H A P T E R 4 High Level Services The previous chapter described the design and implementation of the lower level services and capabilities of the Kea operating system. This chapter examines the higher level services that comprise the bulk of the system's user visible functionality. These services primarily support file systems and networking. Most of the sections deal with the design of the individual ser-vices, while the remainder describe how these services are composed into functional units. Unless either the operation of a service is significantly different in some way from the equiva-lent normally found in a standard operating system, or some facet of its operation is important to the experiments described in subsequent chapters, its design will not be discussed in any detail. The key point of this chapter is that it is possible to develop these services independently and then compose them into a functional whole - there are certainly many other possible com-binations and decompositions available, and it is not claimed that the ones presented are the best possible. Some of the interfaces for these services are detailed in Appendix B. 78 SECTION 4.1 DISK DRIVER SERVICE 79 4.1 Disk Driver Service As the primary platform to which Kea is targeted is the I B M PC compatible architecture, one of the first services developed (other than for simple testing) was a service to allow reading and writing of IDE disk drives. This service is important for two reasons. Firstly, it forms the base layer for various file system services and secondly, it illustrates the development of a Kea device driver. Device drivers in Kea are especially interesting as they can be developed entirely independently of the kernel - there is no inherent difference between a device driver service and any other form of service. Each can be run in an arbitrary address space, makes use of the same programming interfaces, and can be transparently relocated. This can be contrasted with other systems, in which device drivers require special interfaces, only available to certain priv-ileged processes, or must be developed and executed exclusively in the kernel environment. Full details on the IDE interface can be found in Appendix B . l . 4.2 Buffer Cache Service The buffer cache ("bcache") service provides buffering of disk data, typically for filesystems. The buffer cache service controls the reading of blocks of physical disks, and indexes each block under a client supplied file identifier, as well as the physical device location. Blocks can be assigned priorities, with separate L R U lists for each priority level. This allows clients of the buffer cache service to control the order of block recycling. By interposing services upon the buffer cache service, applications can also control the caching of their blocks (this is described further in Section 6.2.2). Full details on the bcache interface can be found in Appendix B.2. 4.3 Filesystem Services The current Kea system includes three filesystems. The first implements a MS-DOS FAT file-system, the second a BSD fast filesystem (FFS) [McKusick 82] and the third a memory (as opposed to disk) based filesystem. Each of these file systems implements a simple file interface SECTION 4.4 F ILE AND MOUNT SERVICE 80 (detailed in Appendix B.3) that is very similar to the POSIX standard [IEEE 90]. The FAT and memory based filesystems are fully functional, while the FFS filesystem does not currently have support for file creation or extension (i.e. writes cannot occur past the end of a file) or for opening directories. While these restrictions limit the utility of the filesystem, they are not sig-nificant for assessing the reconfiguration capabilities of the system as a whole, which was the primary purpose of its development. 4.4 File and Mount Service Each of the filesystems can be used as a stand-alone system, and when so configured, uses file names relative to the root of that filesystem. To be of use to the majority of applications how-ever, there needs to be some means by which different file systems can be coalesced into a sin-gle namespace. This is accomplished by two more services, mount and file. 4.4.1 The Mount Service The mount service simply associates a set of name prefixes with a service identifier. It provides a procedure that, given a fully specified file name, returns the service identifier of the file sys-tem that handles that particular file, together with the length of the prefix. The mount interface is shown in Appendix B.4. 4.4.2 The File Service The file service effectively groups all of the other file services into a homogeneous whole, by multiplexing each under a single namespace. When a file is opened, it uses the mount service to determine the relationship between files and underlying services. If it has not already acquired the appropriate service, it does so. The mount point prefix is stripped from the file name, which is then passed to the underlying service. Subsequent operations pass directly through the service. S E C T I O N 4.5 F I L E S E R V I C E C O M P O S I T I O N 81 4.5 File Service Composition The IDE, bcache, mount and file services can be composed to provide a fully functional file service to clients. The typical composition of these services is shown in Figure 4.1. (Mount)-*—( File ) Applications Global File Service F A T ) ( F F S ) ( R A M ) FileSystems (bcache) Buffer Cache ( I D E ) ( S C S I ) ( ) Driver Services Disk Drives Figure 4.1: File Service Composition Because the set of file services are composed together in a reasonably complex hierarchy, they are ideal for experimentation with system reconfiguration. Chapter 5 describes a number of such experiments. Chapter 6 uses them in order to investigate some number of some applica-tion specific system extensions. 4.6 Compressed Filesystem Service One other file system service has been developed, the compressed file system (CFS). This is a service designed to be interposed on another filesystem. As such, it relies on the underlying file-S E C T I O N 4.7 N E T W O R K I N G S E R V I C E S 82 system for storage. As the name implies, the compressed file system compresses (and uncom-presses) file contents. This operation saves disk space, and can decrease the time required for file operations, trading CPU time against disk accesses. The compressed filesystem stores files in one of two ways. Where the file is written consecutively, it manages the file as a sequence of compressed blocks, each block preceded by a descriptor giving its length, both compressed and uncompressed. Whenever a write to anywhere but the end of a file occurs, the filesystem falls back to the standard file viewpoint, and only compresses the file when the file is closed. The compressed file system is used in experiments in Chapters 5 and 6. 4.7 Networking Services The second major group of services in Kea support networking. A simple ethernet service allows its client to send and receive ethernet packets. The major client of this service is a port of the .x-kernel [Hutchinson & Peterson 88, Peterson et al. 90], which provides a complete set of network protocols. The x-kernel version of Kea contains a complete implementation of Ber-keley sockets [McKusick et al. 96], through which user level clients access the network ser-vices. As the x-kernel is implemented as a single service, and relies only on the ethernet service, there is limited scope for reconfigurability experimentation, although some experiments are described in Chapter 5. Several proposed projects involve the decomposition of the x-kernel into a number of independent protocol services, which is more in keeping with the Kea philos-ophy. These proposals are discussed further in Section 7.2.1. 4.8 Other Services Several other services have been developed that add to the functionality of the system. Because these services perform only minor operations, or were not used in the experiments described in subsequent chapters, they will only be described briefly. SECTION 4.8 OTHER SERVICES 8 3 4.8.1 Syslog Service The syslog service performs approximately the same task as the Unix daemon of the same name. It provides an interface through which services can register themselves, and then send messages of various priority levels to be printed on the console. Almost all of the other services described use the syslog service for reporting various configuration and error messages. 4.8.2 Unique Identifier Service Originally each filesystem maintained its own set of identifiers, and the buffer cache combined the filesystem domain and file identifier in order to lookup file blocks, while the global file ser-vice remapped each file service identifier into a globally unique identifier for client applica-tions. However, when application specific system filesystem extensions were added, it was realised that having different file identifiers used at different layers of the file system hierarchy made the development of extensions far more difficult. To solve this problem, a service was developed that generated unique identifiers. When opening a file, the low-level filesystems use this service to obtain a file identifier, which is then used by all other services. This had the pleas-ant side effect of making both the buffer cache and global file services slightly simpler. 4.8.3 Console Service A service that is able to write to the console is provided in Kea. It typically maps the physical memory block used by the hardware for the screen, and directly inserts characters into this space! The only clients that use this service directly are the standard I/O routines in the C library. 4.8.4 Keyboard Service Kea includes a simple service that enables its clients to receive keyboard events. It registers itself for the keyboard interrupt, processing each to generate characters for its clients. Like the S E C T I O N 4.9 H I G H L E V E L S E R V I C E S U M M A R Y 84 console service, the only clients that use this service directly are the standard I/O routines in the C library. 4.9 High Level Service Summary A summary of the size of the most important services is shown in Table 4.1. For each service, the table shows the number of lines in the service (measured with "wc -1"), and the size of each of the services compiled data and text segments in Kb. A l l the services, with the exceptions of the ethernet and x-kernel, include migration support. Service Line count Text Size Data Size Total Size IDE 1440 5.7 0.1 5.8 bcache 583 2.4 2.2 4.6 FAT filesystem 3533 : 14.9 0.5 15.4 FFS filesystem 1196 4.3 0.0 4.3 mount 210 1.0 0.0 1.0 file 280 1.2 0.1 1.3 compressed fs 2761 6.5 193.1 199.6 ethernet 1115 4.8 0.0 4.8 x-kernel 26739 178.1 48.8 226.9 Total 37857 218.9 244.8 463.7 Table 4.1: Service Size Summary This chapter has briefly described each of the high level services currently developed for Kea. For the purposes of demonstrating the thesis statement, is not important how these service are built, only that they can be built and composed together to make a complete system. Once the system is composed of a collection of services, various experiments into their efficiency, recon-figurability and extensibility can be undertaken. Such experiments and their results are shown in Chapters 5 and 6. C H A P T E R 5 Performance and Reconfigurability This chapter describes the results of several experiments which were performed to evaluate the performance of the system. These experiments were designed to measure the performance of the system as a whole (i.e. the aggregate system throughput, using the high level services described in chapter 4) as well as that of the reconfiguration primitives, particularly global ser-vice migration, replacement and interposition (application-specific reconfiguration is investi-gated in chapter 6). In particular, the experimental results demonstrate the following points: • That services can effectively be run in either user or kernel address spaces. • That the performance of the system is comparable to other, standard operating systems. • That services can be dynamically and transparently migrated between address spaces. • That service migration is efficient. • That services can be efficiently interposed. 85 S E C T I O N 5.1 E X P E R I M E N T A L O V E R V I E W 86 5.1 Experimental Overview The majority of the experiments described use the filesystem hierarchy described in chapter 4 and illustrated in Figure 4.1 on page 81. These services (ide, bcache, FFS/FAT, file and mount) provide a sufficient base to evaluate the performance of structuring a system as a collection of such services, and more importantly, allow an evaluation of how the system performance can be changed by system reconfiguration. In addition, the service stack allows for experimentation with service interposition, for which the compressed file system service is used. The experiments can be roughly categorised into four areas: • System performance with various service configurations. By configuring services into different address spaces, the performance/protection trade-off can be evaluated. These . experiments also demonstrate that services can be run in arbitrary address spaces. • Performance comparisons, Kea vs. FreeBSD. This allows a comparison of a system configured as a set of co-operating services with that of a monolithic system. • Performance of service migration. Measuring the overhead of service migration pro-vides a measure of its utility. • Service interposition. By interposing services, and measuring either (or both of) the changes in performance or functionality provided, an assessment of the utility of system extensibility can be determined. 5.1.1 Experimental Hardware Each of the experiments were performed on the same machine, a 100 Mhz Pentium, with 64 Mb R A M , 256 Kb L2 cache and a Western Digital Caviar 2850 EIDE disk drive (the prop-erties of this drive are shown in Table 5.1). Where applicable, Kea's performance was com-pared with that of FreeBSD (version 2.2-RELEASE), running on the same machine. SECTION 5 .2 F F S PERFORMANCE 87 cache size 64Kb spindle speed 4,500 R P M average seek 11 ms Table 5.1: WD2850 Parameters 5.2 FFS Performance As stated, each of the Kea services can be developed and loaded into a user address space. These services can also be loaded into the kernel without changing any part of the service, either at the source level, or in the final compiled object. Reconfiguring the system by running these services (that are normally trusted parts of the kernel in other systems) within the kernel should result in performance benefits from shorter IDC times (for user/kernel, as opposed to user/user control transfers), and also from the optimisation of some IDC's to procedure calls due to co-location. Performance should be further enhanced by far less cache and TLB misses, due to a reduced number of address space crossings. The measured performance must also be comparable to other systems in order to demonstrate that there is no significant, performance impact attributable to the decomposed nature of the design. 5.2.1 Reconfiguration and Performance To verify that reconfiguring the system to run services in the kernel increases performance, we measured the time (in microseconds) required for basic file operations in the FFS filesystem hierarchy with a number of permutations in the location of its services. The results are shown in Table 5.2. The table shows the time required for each of the file open, read, write and close operations. The read and write operations used a block size of 8 Kb. Each of the experiments was performed with the system in one of two states - "cold", when the system had just been booted and "warm", immediately after the "cold" measurements were recorded. Making the measurements in each state enables some estimate to be made of the effects of warm caches and cached buffer blocks. The operations were timed with the system in a variety of different configurations. The left-most column shows the times with each service in a separate domain, SECTION 5 .2 F F S PERFORMANCE 8 8 while successive columns show the impact of co-locating increasing numbers of the services into the kernel, with the penultimate column showing all services in the kernel1. The final col-umn shows the times with all the services co-located into a single domain, roughly equivalent to the "single-server" model used with many microkernels. The "cold" times shown were heavily variable, due to differences in disk rotational position - depending on the disk head position when the trial began, successive I/O operations can vary by over 13 ms (the rotational latency of the disk). This was especially obvious for the open times, where three disk operations are required (the root inode, a directory block and the file inode). To remove this influence, the times shown were determined by measuring each of the operations five times, discarding the lowest and highest times, and taking the average of the remaining three. Also, for each opera-tion, the actual I/O time2 was measured for each trial. This was then averaged over all the trials for: the same operation, and the times normalised by using this average instead of the measured I/O time in each case. This provides an accurate indication of operation times, independent of variations in disk seek time and rotational delay. Operation Separate IDE in +bcache +FFS in A l l in Single domains kernel in kernel kernel kernel domain Open 31650 31573 31110 30641 30103 30561 Read 8K 20960 19882 18136 17693 17288 17256 "cold" Write 8K 1745 1562 1369 929 471 852 Close 279 273 267 197 63 120 , Open 617 609 361 252 124 176 Read 8K 791 782 734 572 362 424 "warm" Write 8K 781 756 723 534 352 394 Close 115 112 112 83 53 70 Table 5.2: FFS Filesystem Operation Times 1. The mount and file services are moved together into the kernel. The mount service is only used on open, and was deemed unimportant enough that doing this does not effect the points that the results illustrate. The mount and file service are loaded in independent domains in other result columns. 2. This is the time from the initiation of the disk controller to the reception of the final disk interrupt. SECTION 5 .2 F F S PERFORMANCE 89 There are several important observations that can be made from the results shown in Table 5.2. Firstly, it can be observed that moving services into the kernel can have significant performance advantages (e.g. the decreases in CPU time for the "warm" read/write cases are over 50%, and more than five-fold for open, which requires a larger number of operations utilising all ser-vices). It is also important to note however, that even in the case where each service is run in a separate address space, the performance is still somewhat acceptable, which implies that it is practical to consider developing systems as groups of separate services, and then combining these at a later time. This combination will almost certainly occur, due to a desire for enhanced performance, despite the increased protection implied by running services in separate domains. What is important is that developers and administrators be given the choice, and the tools with which to make this choice. 5.2.2 FreeBSD Comparison To prove the thesis statement, the system's performance must meet that of a traditional system, at least when the services are co-located in the kernel. Table 5.3 shows the comparison between FreeBSD and Kea for the FFS file operations. In order to make the results comparable, the FreeBSD operations have had their disk I/O times normalised relative to those of Kea. The table shows that, except for open, the times are almost equivalent. In the "warm" case, Kea reads take more time, but this is partially compensated for by faster write times. The "cold" open and write times appear to be major anomalies. The former is actually a result of FreeBSD having cached the root inode and directory blocks when the filesystem is mounted. Where Kea must do three disk operations to open the file, FreeBSD does none, resulting in a far faster time. The "warm" open times are therefore a better indication of the systems' performance, and show that the systems are more equally matched, although FreeBSD is still more efficient. This may be partially due to the caching of name/inode translations in the FreeBSD filesystem, an oper-ation which the Kea version does not yet support. The "cold" write time is due to the Kea buffer cache service optimising away disk 1/0 when the block to be written is the same size as that requested by the filesystem. In general, the results are very promising, showing that the Kea S E C T I O N 5.2 F F S P E R F O R M A N C E 90 times, with a relatively simple implementation of the filesystem hierarchy, and no tuning or optimisations made to any part of that hierarchy, can come close to that of FreeBSD, which has been tuned over a number of years. Operation FreeBSD Kea Open 202 30103 "cold" Read 8K Write 8K 17388 7801 17288 471 Close 55 63 Open 90 124 Read 8K 327 362 "warm" Write 8K 354 352 Close 54 53 Table 5.3: FreeBSD FFS Filesystem Operation Times (ps) A second comparison that can be made between FreeBSD and Kea is that of aggregate system throughput. This is determined by measuring the times required to read or write 1 Mb of data, using both 1 Kb and 8 Kb data blocks. The results are shown in Table 5.4. There are a number of interesting observations that can be made based on these results. For the "cold" numbers, the read times are very close. For writes however, FreeBSD is much slower with a 1 Kb block size, and much faster with an 8 Kb block size. There are two factors contributing to this result. Firstly, the Kea filesystem always performs disk 170 based on the natural disk block size used by the file (8 Kb for the file in question), rather than on the block size used in the read or write operation. It appears that the FreeBSD filesystem does the opposite, resulting in many more disk operations for the smaller block size, and consequently a much slower overall time. The second factor is that for writes, the Kea filesystem always reads the block of disk first, even when it is going to be totally overwritten by the write operation. The FreeBSD filesystem fore-goes the read when it realises the block will be totally overwritten, resulting in far faster times for an 8 Kb block size. SECTION 5 .2 F F S PERFORMANCE 91 Operation, Time (s) block size FreeBSD Kea Read, IK 0.497 0.447 Write, I K 1.099 0.388 "cold" Read, 8K 0.459 0.446 Write, 8K 0.110 0.388 Read, IK 0.038 0.050 Write, IK 0.110 0.045 "warm" Read, 8K 0.029 0.043 Write, 8K 0.039 0.036 Table 5.4: FFS Aggregate ReadAVrite Performance The "warm" times show that, in general, FreeBSD is slightly faster than Kea. In the "Write, I K " category however, Kea is over twice as fast. This is again attributable to the handling of smaller blocks in the FreeBSD filesystem not being as efficient as Kea's. Overall, the FreeBSD/Kea comparisons show that the design of the components, and the strat-egies they use (e.g. block size choices) is probably more important than how they are structured within the final system. While the Kea approach may involve slightly more overhead due to an increased number of cross-component calls (approximately 0.2 ps per call/layer in the archi-tecture under test), this is trivial compared to the cost of a single I/O operation. Even when the two systems are performing identical tasks, the individual variances in design decisions make deciding how much of any differences observed are due to the systems structure, as opposed to those attributable to the individual service designs, very difficult. The single most important observation is that is possible to build a system such as Kea that performs favourably when compared to a monolithic kernel. SECTION 5 .3 F A T PERFORMANCE 9 2 5.3 FAT Performance To confirm the results discussed in the previous section, the experiments were repeated for the FAT filesystem. The results for the system reconfiguration are shown in Table 5.5. The numbers produced are comparable to the FFS results and show exactly the same general trends. Operation Separate IDE in +bcache +FAT in A l l in Single domains kernel in kernel kernel kernel domain Open 1294 1297 1292 1053 305 630 "cold" Read 8K Write 8K 26546 4164 25671 4081 25082 3514 25007 3228 24750 2751 25291 . 3077 Close 425 426 321 187 173 190 Open 320 309 292 244 143 166 Read 8K 727 . 727 708 644 368 426 "warm" Write 8K 728 728 623 577 344 403 Close 179 174 150 96 73 99 T a b l e 5.5: FAT Filesystem Operation Times (ps) Comparing the FAT filesystems performance to FreeBSD (Table 5.6) is more interesting. When "cold", FreeBSD takes substantially longer for the open and read operations, and less for the write. The open time is a reversal of the FFS case, as the Kea filesystem reads and verifies the root directory of the filesystem when it initialises itself, as opposed to the FreeBSD filesystem, which only verifies the superblock. The FreeBSD FAT filesystem also performs several disk reads in order to translate the file name, resulting in a longer open time. The read/write times are explained by examining the FreeBSD I/O pattern. For the initial read, it reads ahead on the disk, getting more blocks into the cache (even though they may be overwritten in a subsequent write), slowing the initial read time slightly, at the cost of greater performance for future oper-ations. The "warm" times are generally similar, with FreeBSD having a slight edge in most operations, except for write, where it appears to be substantially slower. These results reinforce the earlier S E C T I O N 5.3 F A T P E R F O R M A N C E 93 Operation FreeBSD Kea Open 27733 305 Read 8K 28124 24750 "cold" Write 8K 518 2751 Close 331 173 Open 132 143 Read 8K 317 368 "warm" Write 8K 483 344 Close 55 73 Table 5.6: FreeBSD FAT Filesystem Operation Times (ps) observation that the design decisions made when building filesystems are of more importance to the overall system throughput than the architecture used to combine system services. One puzzling result not shown in the table concerns the FreeBSD "warm" close time. With the system in single user mode, this operation would take either 55 ps (as shown) or about 600 ps, with the latter being the most frequent by a factor of approximately five. In multi-user mode, the time was consistently 55 or 56 ps. Unfortunately, without detailed traces of system behav-iour, it was impossible to determine exactly where the single-user delay was coming from. The aggregate throughput (reading and writing 1 Mb of file data) for both FAT filesystems is shown in Table 5.7. Once again, times are generally comparable, with two exceptions. For small blocks, with cold caches, the FreeBSD filesystem is over five times slower - it does disk operations in the smallest possible unit (1 Kb), rather than the natural filesystem block size (8 Kb) used by the Kea system, resulting in a huge I/O overhead (this also results in a slower times in the "warm" case, although not to the same marked degree). The second exception is the "warm" read time, which is twice as fast on FreeBSD. This is due to the FreeBSD filesys-tem detecting consecutive reads to the same disk location. In this case the filesystem performs far larger disk reads. The savings observed are apparently due to the need for less buffer man-agement. This does not affect the time for the "cold" reads overly much, as the same number of "real" I/O operations (i.e. those that go to the disk) still need to be performed. It would be S E C T I O N 5.4 S E R V I C E M I G R A T I O N A N D R E P L A C E M E N T C O S T S 94 entirely possible to implement this same optimisation in the Kea filesystem, although this has not been done to date. Operation, block size Time (s) FreeBSD Kea "cold" Read, I K 0.565 0.583 Write, IK 3.457 0.625 Read, 8K 0.578 0.570 Write, 8K 0.625 0.625 "warm" Read, I K 0.037 0.045 Write, IK 0.051 0.038 Read, 8K 0.022 0.044 Write, 8K 0.036 0.037 Table 5 .7 : FAT Aggregate ReadAVrite Performance 5.4 Service Migration and Replacement Costs The next set of experiments were performed to demonstrate that services can be efficiently migrated between user and kernel spaces. The time taken to migrate a service depends on sev-eral factors - the amount of executable code comprising the service, the time needed to link this code into the new domain, the amount of service state to be transferred, the time taken by the services initialisation function to execute, the amount of portal remapping that has to be done due to the change in address spaces, and the cost of modifying any clients of that service in the destination domain, in order to use direct procedure calls rather than portal invocations (or vice versa, in the case of clients in the original domain). To assess the times attributable to each component of the migration process, several measure-ments were made on each of the principal filesystem services. The results for the ide and bcache services are shown in Table 5.8. These services are shown first, as the state to be transferred does not depend on the number of open files (other services had more extensive experiments performed, with correspondingly more complex results). The first column shows the service S E C T I O N 5.4 S E R V I C E M I G R A T I O N A N D R E P L A C E M E N T C O S T S 95 name, the second the size of the service text, the third and fourth the copy and link times for that text, the fifth and sixth the time to save and restore the state, while the final column shows the total time taken. In the case of the IDE service, the "restore state" column has two numbers. The first is the actual time for the restore operation, while the second is the latency introduced by a disk I/O, done to read the partition table off the disk. The major result shown is that the link time is the dominant component of the migration cost. This is generally true for all ser-vices, and making linking more efficient is a goal for future research. In the case if the IDE ser-vice, it is possible to remove the disk I/O, at the cost of increasing the complexity of the save/ restore state functions. If the migration time for this service proved to be a problem - which it has not been to date - then this could be decreased through this modification. size copy link save restore total service (Kb) time time state state time ide 10.6 0.2 13.2 1.2 1.2+31.4 47.7 bcache 6.0 0.2 9.3 1.4 1.2 12.4 Table 5.8: ide and bcache Migration Times (ms) One important point to note about the bcache service is that the state transferred does not include all of the disk blocks in the cache - only those that are currently referenced by a file-system are copied over. This is typically an extremely small number, as a filesystem only holds a reference for the short time needed to copy data from the block to a user buffer. Copying all the cached blocks is not practical, as temporarily at least, twice the amount of memory in the cache would be needed, since the state buffer must be allocated before the cache blocks could be copied to it. The second set of migration results - those for the filesystem services - are shown in Table 5.9. This table shows the same general results as for the ide and bcache services, with the addition of information describing the overhead required for state transfer with varying numbers of open files. The table shows that while the cost of state acquisition and transfer does rise proportion-ally to the number of open files, it does not, compared to the other times, impose much of an S E C T I O N 5.4 S E R V I C E M I G R A T I O N A N D R E P L A C E M E N T C O S T S 96 overhead. As for the other services, the major overhead is the link time, and any disk I/O . required (the FAT filesystem reads root directories). FFS FAT file size (Kb) 4.3 23.5 2.9 copy time 0.1 0.4 0.1 link time 20.1 48.8 10.3 save state 1.3 1.4 1.2 0 open restore 1.4 1.4+71.2 1.2 total 23.2 125.7 13.2 save state 1.3 1.4 1.3 1 open restore 1-4 1.4+71.2 1.4 total 23.2 125.7 13.5 save state 1.5 1.6 1.2 10 open restore 1.7 1.5+71.2 1.5 total 23.8 126.0 13.6 save state 3.5 3.2 1.4 100 open restore 3.6 3.3+71.2 2.0 total 27.7 129.3 14.3 Table 5.9: File Service Migration Times (ms) Both Tables 5.8 and 5.9 show that the various components of the time required to migrate a service can vary substantially between services. While the copy time is small for each (and lin-ear, based on the size of the service, as would be expected), the relative times required to link the service can be highly variable, depending on the number of external library modules need-ing to be loaded and the amount of text and data relocation required. The most important conclusion that can be made is that migrating services can be done quickly enough that there should be little, or no, effects on the time perceived by a user for a service operation, although more data on larger services would be desirable. It is important to point out that the service is only unavailable to clients while the state transfer and initialisation are car-ried out - the copy and link phases are done before service access is blocked. This means that SECTION 5.5 SERVICE INTERPOSITION 9 7 the perceived migration time for service clients is much less than the total migration time:, e.g. for the FAT filesystem, the perceived migration time is between 73 and 77 ms, which is effec-tively only the cost of a small number of disk operations. The time taken to replace a service is also the same as the time needed to migrate that service, as exactly the same set of actions have to be carried out for this operation. The only possible variations are in the size of the code (which, if the service is being replaced for the purpose of bug fixes, should be small), and in (possibly) the creation of a new domain for the service. In the first case, the size of the code has very little influence on the cost of replacement, only affecting the copy time. There may be some effect on the link time, although for most services this should be negligible. In the case where a domain has to be created, this also has no effect on the time the service is unavailable, as this operation is performed before portal invocations are blocked from entering the service. It can therefore be concluded that service replacement is also a viable operation. 5.5 Service Interposition To demonstrate and test global service interposition, some of the measurements described in Section 5.2 (FFS filesystem performance) were repeated using the compressed file system. The results are shown in Table 5.10. The four columns of this table show the operation performed, the case where the compressed file system is interposed above and below the global file service, and for comparison, the original results (from Table 5.2). The results shown are for all services co-located in the kernel - other service configurations have been measured, but they show the same pattern as the results shown. The results from the compressed file system experiments show that opening a file for the first time is very expensive, while the subsequent operation (read) is much cheaper. This is because the compressed file system preprocesses parts of the file when it is opened, requiring expensive disk I/O, while the subsequent read finds that the information required is already in the buffer S E C T I O N 5.5 S E R V I C E I N T E R P O S I T I O N 98 above below original operation „ f i l e „ {Jcf&) "cold" Open 35674 35654 30103 Read8K 1409 1411 17288 Write 8K 7865 7870 471 Close 126 126 63 "warm" Open 142 141 124 Read8K 1258 1259 362 Write 8K 2963 2967 352 Close 94 92 53 Table 5.10: Compressed File System Operation Times cache . The "warm" times show that each of the open, read and write operations are much more expensive, due to the extra processing and CPU time involved in compression, but the write times should be offset by the (eventual) reduced disk I/O times when the blocks are written to disk. This is a common feature of compressed file systems, where the CPU time for compres-sion must be offset against the frequency of I/O operations that need to go to disk, and the requirement for less disk space usage. This is important, as for some file systems, or applica-tions (particularly text-based), compression may be beneficial, while for others it might have lower performance. By enabling the interposition of the compression service in different con-figurations, the system administrator or application can decide on the most effective placement. The trade-off depends entirely on the balance and type of disk activity experienced by the file system. To demonstrate this, the test where 1 Mb of data is written to a file in 8 K b 2 blocks was repeated using the compressed file system. The results are shown in Table 5.11, together with the original results (from Table 5.4). The results show, that for a cold cache, reading is faster, due to the reduced number of I/O operations required, while every other operation is more expensive, due to the extra CPU overhead of compression. The algorithm used is also far more 1. The open time shown should be higher (closer to the sum of the open and read times for the original case). The reason it is not is because a different test file is used, with different disk block allocations (and hence seek times). This is one more reason why such comparisons should be examined carefully. 2. The first 8 Kb of text from this chapter were used for all tests in this section. This compressed to 4.7 Kb, equivalent to a 41% compression rate. Blocks are compressed independently. S E C T I O N 5.6 C O N C L U S I O N S O N R E C O N F I G U R A T I O N P E R F O R M A N C E 9 9 efficient at decompression than compression, resulting in elevated times for write operations, even for a cold buffer cache. Operation Time (s) cfs Original "cold" Read Write 0.367 0.563 0.446 0.388 "warm" Read Write 0.162 0.377 0.043 0.036 Table 5.11: Aggregate cfs Read/Write Performance 5.6 Conclusions on Reconfiguration Performance This chapter has presented the results from a number of experiments designed to measure the performance of the system in a number of areas related to system reconfiguration. Using a file-system service hierarchy, it has been shown that co-location of services through service migra-tion can result in greatly improved performance, and that Kea's performance is comparable to another, mature, operating system. The absolute overhead for service migration has been shown to be small, and generally dominated by the time required for service initialisation, par-ticularly if those services have to perform time consuming operations such as disk I/O. The final experimental section showed that service interposition can be used at different points in the service hierarchy, and that this can be done transparently to the other services involved. In general, the results given in this chapter have demonstrated that global system reconfigura-tion using service migration, replacement and interposition is not only possible, but able to be accomplished with reasonable performance. It has been shown a reconfigurable system can be implemented that does not impose an undue performance penalty. Possibly the most important conclusion if that the system architecture seems to play less of a role in determining the ulti-mate system performance than the design of the individual components making up the system. C H A P T E R 6 Application Specificity Evaluation The previous chapter evaluated the performance of global reconfiguration in Kea. This chapter describes some studies made to evaluate the usefulness of the system's application specific fea-tures. As in the previous chapter, the primary focus is on the filesystem hierarchy, but with an emphasis on techniques for increasing the performance of individual applications. 6.1 In-Kernel Applications Because Kea has a unified kemel/user prograrnming environment, with facilities for migrating code between these environments, it is ideal for the development of in-kernel applications. An in-kernel application is defined as a service that performs actions that are normally viewed as the province of an "ordinary" application, but that, when executed within the kernel, can realise a substantial performance gain. The domain of such applications is normally restricted to appli-cations that carry out trusted tasks, and for which performance is very important. The best examples of such applications are those that are substantially restricted by the I/O bandwidth 100 SECTION 6.1 IN-KERNEL APPLICATIONS 101 of the machine, especially those that offer file services. Particular examples might include Web servers, multimedia file servers and "network appliance" type machines. Each of these has spe-cialised demands, in that they must respond to client requests quickly (often within a fixed time limit), and may need to specialise the caching behaviour of the system. Currently, only very few systems (notably Rialto and SPIN) offer this functionality. Rialto has been used for the imple-mentation of a commercial video file server [Bolosky et al. 96], while SPIN has a small in-ker-nel http server [Bershad et al. 95]. Both of these applications have demonstrated that moving applications into the kernel is both possible and desirable. Rialto, like Kea, also has the added advantage of being able to first implement the application at user level, easing the development process. Although there are not yet any large examples of in-kemel applications for Kea, several micro-benchmarks have been implemented which demonstrate the system's utility in this area. One of these is shown in Table 6.1. The test was to write a server that, given a file name, can open, read, and close that file. This simulates a typical task performed by most file servers. In partic-ular, the file used in the test was a relatively short text file1 (2702 bytes), as would be typical for many of the files read by an http server. The test was run with the service in-kernel (and directly accessing the FFS service, instead of the file service) and out of kernel, for both the warm and cold buffer cache cases . The results show the times with the disk I/O factored out. Location Time (p,s) "cold" In-Kernel 344 Out-Kernel 537 "warm" In-Kernel 252 Out-Kernel 419 Table 6.1: In-Kernel vs. Out-Kernel File Times 1. The BSD /etc/inetd.conf rile. 2. The "cold" times were measured with the cache "warmed" with the directory blocks. This was done to ensure that only the time relevant to file access was measured. SECTION 6 .2 APPLICATION-SPECIFIC EXTENSIONS 1 0 2 The results clearly show that moving an application into the kernel can have significant perfor-mance benefits. The "cold" time shows a 36% reduction, while that for the "warm" time is 40%. If these results scale to other aspects of system behaviour, such as network access, then it will be possible to realise substantially greater throughput than conventional servers running in user mode. 6.2 Application-Specific Extensions Kea also allows applications to interpose services for their own use. This can allow applications to specialise the system in order to increase either their own performance or the total system throughput. One example of a service that can be interposed on an application specific basis is the compressed file service discussed in Chapter 5. By interposing this on the standard file ser-vice, any application can arrange to have the files used by that application transparently com-pressed. While this service does illustrate one example of application-specific remapping, it is primarily a service that is used to reduce the storage required by an application, as opposed to increasing it's performance. To experiment with this facet of application specificity, several other services were developed. These services were designed to be used at various levels of the file system hierarchy, and illustrate some of the trade-offs involved in supporting application specific service interposition. These services are discussed in the following sections. 6.2.1 The Read-Ahead Service A common file access pattern is that of sequential reads. Many filesystems (including the FreeBSD filesystems used for comparisons in Chapter 5) detect this access pattern, and arrange to read ahead in the file. This saves time when the next read request is received, as either the disk I/O will be at least partially complete, or the requested block will already be in the buffer cache. For applications that perform sequential file accesses, with some processing time required after each read, it might therefore be possible for them to gain increased performance from the Kea filesystem by allowing them to read ahead. To accomplish this, a service designed S E C T I O N 6.2 A P P L I C A T I O N - S P E C I F I C E X T E N S I O N S 103 to be interposed on any other file service was designed. For every disk read operation, this ser-vice makes another read immediately following the requested one. This read is carried out in a thread specific to the read-ahead service, and completes asynchronously with respect to its cli-ent. To measure the effectiveness of the read-ahead service, an application that sequentially reads 1 Mb of data in 8 Kb blocks was developed. The time required for each read was measured, and combined to get an aggregate read time. The application was also configured with a CPU bound loop after each read in order to simulate the per-block processing overhead present in any real application. Just running the application by itself is not sufficient to demonstrate the applica-bility of the read-ahead service, as the underlying filesystem (FFS) lays out files on contiguous disk blocks. In the absence of other I/O operations, this results in very small seek times, and total transfer times per block of under 3 ms. To mitigate this factor, and ensure that blocks were driven out of the disk controllers cache, a second application was introduced. This application continuously made 8 Kb reads from random locations within the disk partition, introducing a constant background "noise" into the disk (as would be expected in a typical timesharing sys-tem). The test (reading) application was then tested1, with different loop delays, from 0 to 100 ms. The aggregate I/O time for the application, as plotted against the delay, is shown in Figure 6.1. The curve shown in the figure shows several interesting properties and requires some careful analysis. The most important aspect is that for the first 17 ms of C P U delay, the time required to complete the I/O's decreases as a linear, 1:1 function - that is, for every milli-second of delay, the average I/O completes a millisecond faster, or 128 milliseconds in aggre-gate over all the I/O's. This is the most important feature of the curve, particularly as 17 ms is far longer than would normally be expected for the processing of a single 8 Kb data block. Past 17 ms, the curve exhibits several distinct peaks and troughs, before finally converging to a frac-tion of a second (where all reads are satisfied by the buffer cache). The oscillations in the curve are caused by a complex set of interactions between the three active threads in the system 1. In the test described, the read-ahead service was configured as an application (user) level service, with all other parts of the filesystem hierarchy configured into the kernel. S E C T I O N 6.2 A P P L I C A T I O N - S P E C I F I C E X T E N S I O N S 104 (reader thread, read-ahead thread and the background thread), the average times they take for an I/O operation (about 17 ms for the read-ahead thread, and 20 ms for the background thread, due to its greater range of seek head movement) and the scheduler quantum (40 ms). Detailing the exact relationships between these factors that result in the curve shown is beyond the scope of the thesis - the important point is that read-ahead can result in an apparent decrease in read latency (as seen by the application installing the service). 0 10 20 30 40 50 60 70 80 90 100 Delay (ms) Figure 6.1: Aggregate I/O Time vs. Delay 6.2.2 Buffer Cache Management Although the read ahead service allows an application with specific behaviour properties to specialise the system in order to obtain better performance, it is still a service that is interposed at a relatively high level in the filesystem hierarchy, and does not truly demonstrate that it is possible to efficiently manipulate lower level services in an application specific manner. To S E C T I O N 6.2 A P P L I C A T I O N - S P E C I F I C E X T E N S I O N S 105 demonstrate this feature of the Kea design, the buffer cache ("bcache") service was used. It has been shown that giving an application control over the replacement of blocks for files it owns in the buffer cache allows that application to increase its performance on several tasks [Lee et al. 94]. Instead of repeating one of these previously developed tests, a new type of ser-vice was developed to illustrate a novel viewpoint - that of not increasing a specific applica-tions performance, but allowing that application to specialise the system in order to minimise its resource requirements, and ensure greater performance for other users of the system. This is accomplished by the "throw-away" service. The "throw-away" Service The majority of applications in any computer system read or write files sequentially. A reason-able proportion of these files are only utilised once, and are then not used again for a long (in computer terms) period of time. Examples of such applications are tar, cpio and compress, each of which are often used to manipulate large archive files. When these applications are used, a typical system will read (or write) the appropriate file, with the result that blocks from that file will be entered into the buffer cache. Unfortunately, the file is usually not manipulated again for some time, and these blocks force other, potentially more useful, blocks out of the cache. The "throw-away" service is designed to prevent this by being interposed on the buffer cache service, and modifying buffer cache requests in order to arrange that the blocks requested be thrown out of the cache before other blocks that are already resident. This is accomplished very simply by modifying the "hint" parameter to each of the buffer cache1 request calls. The "hint" parameter determines how likely the given block is to be reused. For each hint value, the buffer cache contains a list in L R U order. The filesystems developed currently use either H I N T _ K E E P (the block is likely to be reused) or H I N T _ D E L E T E (the block is to be deleted from the cache). The "throw-away" service merely changes every H I N T _ K E E P value to H I N T _ D I S C A R D (remove the block from the cache sooner than any other block). Note that this does not result in the block being immediately removed from the cache, it is instead placed 1. The buffer cache interface is described in Section 4.2 and Appendix B.2. S E C T I O N 6.3 C O N C L U S I O N S O N A P P L I C A T I O N S P E C I F I C I T Y 106 at the end of the L R U list that is first considered for replacement when the cache has used up the available physical memory reserves. To test the efficacy of the throw-away service, the memory available to the buffer cache was first artificially limited to one megabyte. Two applications were then run. The first application did 1000 random 8 Kb reads (that is, a read, followed by a seek to a random file location) within a 1 Mb file, while the second sequentially reads a 10 Mb file, also using 8 Kb blocks. The observed (wall clock) time required for each application to complete, and the total number of disk operations performed, were then recorded in two configurations. The first configuration was with the standard, unmodified system, while the second had the sequential reader interpose the "throw-away" service above the buffer cache service. The results are shown in Table 6.2. Test Time (s) random sequential Disk I/O count random sequential without throw-away with throw-away 13.43 15.99 4.08 7.53 420 1280 128 1280 Table 6.2: Throw-away Service Results Table 6.2 shows that the total number of disk operations and the perceived read latency for both applications drops substantially when the throw-away service is used, demonstrating that a very simple application specific modification can have a significant effect on the performance of the system as a whole. 6.3 Conclusions on Application Specificity This chapter has presented some experiments in which Kea's ability to install application-spe-cific extensions have been tested. While Kea has yet to be applied to the development of fully fledged applications that take advantage of these features, we believe that the test results shown clearly demonstrate Keas' potential in this area. However, there are several areas in which fur-ther investigation is needed before the Kea approach to application extensibility can be vali-S E C T I O N 6.3 C O N C L U S I O N S O N A P P L I C A T I O N S P E C I F I C I T Y 107 dated. In particular, there are many circumstances in which it will not be appropriate. The major issues to consider are those of performance, security, and the necessity for an equitable global policy. 6.3.1 Performance Issues There is one vital performance issue associated with Kea's style of application specificity, which appears when services are co-located. Normally, such services call each other through the co-location stub, which has only marginally more overhead than an ordinary procedure call. However, since portal invocation is the point at which application specific remapping takes place on an interface, any services which are remapped in this fashion must instead use the por-tal invocation stub1. In the case where the services are kernel co-located, this adds only 0.2 [is to the invocation time for each call, a trivial amount. In the case where the services are not ker-nel co-located, the overhead may be several 10's of microseconds. While this overhead is rel-atively small, it is unrealistic to allow application specific remappings that will not, at least potentially, recoup this loss in performance. This is the case with the examples discussed in this chapter, for two separate reasons. In the case of the read-ahead service, it is designed to be used directly above the existing file service. In this circumstance, the application-specific remapping is on the file service, which resides at the top of the service hierarchy. Because of this location, any clients of the file service will already be using the portal invocation path, essentially negat-ing any overhead due to the remapping. The throw-away service also meets the performance requirement, as its use has the potential to dramatically reduce the number of disk operations performed, each of which is many orders of magnitude more expensive than the remapping overhead. Also, as explained in Section 3.12.4, such services will probably be provided by sys-tems vendors or other trusted entities, and be able to be kernel co-located, which reduces the overhead to a trivial amount. 1. This process is described in Section 3.10.2. S E C T I O N 6.3 C O N C L U S I O N S O N A P P L I C A T I O N S P E C I F I C I T Y 108 A second performance impact is that of naive or malicious applications using application spe-cific services which are not suited to their requirements. Consider the use of the throw-away service for an application that instead does cyclical reads on a single file. In this case, the appli-cation could well cause the buffer cache service to continually read the disk blocks comprising the file, reducing the system throughput. Unfortunately, there is little that can be done about such behaviour. However, it should be noted that this is no worse than any other "standard" operating system, as an application can read large amounts of information from various files in the system, with similar affects, as this forces more legitimate blocks out of the cache. Also, the behaviour described only affects the system when the buffer cache has a need to recycle or release memory blocks, a behaviour which is somewhat independent of any individual applica-tion. 6.3.2 Security Issues As discussed in Section 3.12.4, there are a number of security issues associated with applica-tion specific extensions. The most important of these are whether to allow the co-location of interposed services, and determining which services can be interposed on, and in which man-ner. The answer to the first of these questions depends entirely upon the implementor of the ser-vice - it is likely that services shipped by the operating system vendor or a major software manufacturer can be trusted, while those produced by ordinary system users cannot be. Deci-sions in this area can only be made by the system administrator. The second issue, that of which services are appropriate for interposition, and how, is more complicated, as the answers depend entirely on the tasks undertaken by the service, the privi-leges it requires to accomplish the task, and the implications of any functionality changes on its clients. The multifaceted nature of this problem requires careful analysis, and is, for the most part, beyond the immediate scope of the thesis. The suggested solution involves the ability to tag every service with a set of attributes, which determine how trustworthy the service is (in particular, which address spaces it can be trusted to be co-located within) and which services it SECTION 6.3 CONCLUSIONS ON APPLICATION SPECIFICITY 109 can be interposed on, with the possibility of the latter being further specialised by categories of users. It is likely that recently developed technologies for code signing will be appropriate for this use. Deciding on and evaluating an appropriate set of these attributes is an important com-ponent of the future development of Kea and similar systems. 6.3.3 Policy Issues Another important consideration when designing services is the policies they implement, and how these may, or should be, affected by application specific extensions. Many kernel services implement a global policy that ensures fairness for all applications using the system. An exam-ple is the buffer cache, which ensures that the most recently accessed file blocks are kept, and does not prefer the blocks of any one application over those of others. Generally speaking, it is important that users be prevented from installing or replacing services that will cause such pol-icies to favour their applications. What may be acceptable is giving applications control over local policy, i.e. those decisions that only affect the application. An example might be letting an application specify an alternate cache block to be replaced, instead of one selected (for that application) by the global policy. The separation of services into global and local policies, where appropriate, is another open research issue. Current extensible systems address the mechanics of installing extensions, rather than the control, and appropriateness of, any policies implemented by that extension. It is likely that the answers to this issue will depend heavily on the structure and functionality of the system. 6.3.4 Application Specific Summary The primary promise of Kea is in its support for in-kemel applications. This facility enables those selected applications that absolutely require high performance and/or direct access to low-level parts of the system to be easily developed. Often, these applications will be the only significant application running on the machine (examples might be dedicated file, W W W or database servers), for which the security and policy concerns discussed in the previous sections will not apply. Secondly, it has been demonstrated that for other, more general purpose appli-SECTION 6.3 CONCLUSIONS ON APPLICATION SPECIFICITY 110 cations, the judicious use of simple services can dramatically increase the system's perfor-mance. We believe that future configurable systems will allow both of these styles of extension, with the latter being restricted to specialised services either shipped with the system or made available by the system administrator. The utility of other application-specific extensions, and the degree to which they can be used, remains an open research issue. C H A P T E R 7 Conclusions and Future Work 7.1 Conclusions The Kea operating system has shown that it is possible to design and implement a fully dynam-ically configurable and extensible operating system, and that this can be done without sacrific-ing performance. Several concepts were vital to the accomplishment of this task: • System as a collection of services. A prerequisite to being configurable is having some-thing that can be configured. The Kea system is viewed as being composed from a col-lection of services, each of which implements one part of the complete operating system. Each of these services can then be configured into various address spaces, com-posed in different fashions, or replaced by equivalent services. The service concept also includes that of isolating the interface of each service from its implementation. • Procedure call oriented IPC. Making communications between services (IDCs, or Inter-Domain Calls, in Kea parlance) appear as procedure calls instead of message passing 111 SECTION 7.1 CONCLUSIONS 112 considerably simplifies both the engineering of service composition and the generation of service stubs. While an undesirable side affect may be extra cost for simple proce-dure calls, this is offset by increased performance for more complex calls, particularly those that pass memory buffers. • Unification of kernel and user programming environments. Kea offers developers an identical programming environment at both user and kernel levels. This ensures that services can be easily located in any address space, making reconfiguration possible. It also ensures that system extensions in the form of in-kernel applications are able to be easily and transparently developed. • Full co-location of services. As performance is one of the most important consider-ations for operating system acceptance, Kea optimises the IDC's to (almost) direct pro-cedure calls when services are co-located within an address space. This is managed by the kernel, and is totally transparent to the services. This facility is only made simple by a combination of the previous points. • Portal remapping. The kernel level view of an IDC entry point is the portal. The kernel service manipulation routines modify portal identifiers and service stubs in order to transparently reconfigure the inter-service relationships. Portals also permit applica-tion-specific remappings, which allows portal invocations to be re-routed to other ser-vices, dependant upon the application at the root of the call chain, making various application-specific extensions possible. Through a combination of these points, Kea accomplishes the thesis goals of a configurable and extensible operating system. By careful design, features such as service migration and interpo-sition allow the system to be configured in many different ways. Kea offers many advantages, particularly to system developers and administrators. Services can be developed and debugged in a user address space, and then transparently relocated into the kernel for performance purposes. Upgrading the system becomes a simple matter of replac-SECTION 7 .2 FUTURE W O R K 113 ing a service, which can be done dynamically, without the need to recompile, shutdown and reboot the machine on which the system is currently running. In summary, by giving increased control of service location and configuration to the system developers and administrators, and letting services be co-located when safe and appropriate, the Kea system can give the best of all worlds - safety, modularity and performance, depending upon the system requirements. Reviewing the thesis, Chapter 3 provides a detailed design of the base Kea abstractions, con-centrating on the algorithms and data structures developed for service manipulation, and briefly covered the low level services inherent to Kea. Chapter 4 describes the higher level services that make up the current Kea system, primarily those required for filesystem support. Chapter 5 then uses these services in several experiments designed to measure the overhead of system reconfigurability. The primary conclusion made in this chapter is that the Kea design may have some minor performance degradation when compared to a standard monolithic system, but that overall, the design of individual components has a far greater impact. Chapter 6 concentrates on application-specific extensions to the system, with results showing that some simple ser-vices can dramatically increase performance for selected application behaviours. Overall, the conclusion is drawn that configurable and extensible systems such as Kea have much to offer, both in terms of convenience and performance. 7.2 Future Work During the design and implementation of Kea, many interesting issues arose, several of which there was not the time to pursue. The following sections examine some of the work that remains to be done in order to make Kea fully functional, and describes some of the research directions arising from the thesis work. 7.2.1 Further development Although Kea offers many of the services of a traditional operating system, several of these are still in a raw or undeveloped stage. In particular, the networking services of the x-kernel need SECTION 7 .2 FUTURE W O R K 114 to be converted to real services. Simply using the x-kernel as a single enormous service would be a necessary first step, but the splitting of this into separately configurable services would be required in order to fully evaluate the necessity for, and performance of, such a system. Although the x-kernel is already statically configurable, and includes a relatively clean inter-protocol messaging mechanism, substantial work would be required in order to convert many of the components into services, due to various assumptions made about the sharing of memory buffers and synchronisation made by the system. There are several other obvious areas of functionality where Kea could be improved. Currently only a small number of devices are supported, and increasing this number would be an impor-tant step in making the system more functional. Also, many of the services that are provided are very simple. The only terminal (keyboard and screen) handling that is provided is very raw, and is an obvious area where a stack of services could be used. On the application side, it would be interesting to provide libraries that offer POSIX [IEEE 90] compatibility1, to more easily enable the porting of other programs to Kea. Another obvious area of improvement is in the portal invocation code. Although it was shown in Section 3.15 that portal invocation is relatively fast, it is still not as efficient as is theoreti-cally possible. In order to achieve the best performance, it will probably be necessary to rewrite this part of the kernel in customised assembly language, and re-evaluate the trap-handling and argument processing functionality. This will make it possible to take advantage of various machine specific optimisations that are only available to,assembly programs. As is true of any system of Kea's functionality and size, there are also still bugs and errors in some parts of the system. Although these are fixed as they are found, it would be useful to design a set of tests for each part of the system, in order to identify and remove as many of these as possible. 1. Some progress in this area has been achieved in the area of filesystems - it is currently possible to run GNU diff on the system. S E C T I O N 7.2 F U T U R E W O R K 115 7.2.2 Finer Grain Decomposition One of the premises underlying reconfigurability is the presence of services that can be recon-figured. The granularity of these services, i.e. how small the functionality of the system is split, is a major factor in determining the number of possible useful service configurations. Although Kea demonstrates that filesystem reconfiguration is possible, each of the filesystems is a large monolithic system, with fixed policies on behaviour such as file layout and access characteris-tics. While some services have been presented that can modify some of these policies, it would enhance the reconfigurability of this aspect of system behaviour if the filesystem could be com-posed from smaller services. With recent research on this topic, the Hurricane file system [Krieger & Stumm 97] has demonstrated that it is possible to construct a highly flexible and customisable filesystem. Hurricane composes filesystems from primitive building blocks in a manner that is very similar to service composition in Kea. Similar research has also been per-formed with network protocols [Hutchinson et al. 89, Bhatti & Schlichting 95], demonstrating that the fine-grain decomposition of services is possible in more than one domain. It would be a valuable exercise to use this technology in conjunction with Kea's reconfigurable design to further demonstrate the utility and applicability of reconfigurable systems. 7.2.3 Improved State Migration When migrated, services must package any state they wish to be retained between domains into a contiguous memory region, which is then passed to the new service instantiation as an argu-ment for unpackaging. Generally speaking, a service has two "types" of state that need to be transferred. The first, "static" state, is the contents of various local variables and allocated memory. The second, "active" state, is information about any threads that have been created by the service (such as the interrupt thread created by device drivers, or the read-ahead threads in the read-ahead service), namely their existence and current CPU state. It should be possible to combine the technologies developed for process migration in other operating systems [Nuttall 94] with new techniques for heterogeneous process migration [Smith 97] and remove all, or at least a substantial part of, the requirement for service developers to write the code for S E C T I O N 7.2 F U T U R E W O R K 116 service state migration. This would considerably ease the development of future services, while also increasing Kea's functionality. 7.2.4 Dynamic Code Generation Systems such as Synthesis [Pu et al. 88] and Synthetix [Cowan et al. 96], as well as recent techniques for dynamic code generation [Auslander et al. 96] have demonstrated that it is pos-sibly to efficiently generate code to specialise system operations. An interesting application of this technology would be to the portal invocation path. Currently, the kernel executes a general code path, that must be capable of handling all possible procedure signatures, copying any parameters and data between domains as appropriate. Instead, it should be possible to generate specialised code for each portal. This would result in major efficiencies within the portal invo-cation code, and presumably enhance the performance of the system considerably. 7.2.5 IO Buffer Manipulation One of the major performance problems with configuring services in different address spaces is the need to copy procedure parameters. While this can be accomplished relatively quickly for small arguments, copying larger areas of memory (typically those involved in I/O, such as file data) is more time consuming, and also results in double buffering of data, as multiple blocks of memory are allocated to hold essentially identical information. In order to reduce this overhead, it would be desirable to implement a system in which special areas of memory could be set aside as I/O buffers, and arrange for the portal invocation code to map this memory into each of the domains in the invocation path. This would eliminate copying, as each domain would be able to directly access the memory provided. Two systems, fbufs [Druschel & Peterson 93] and container shipping [Pasquale et al. 94] make use of similar ideas for arranging copy-free I/O paths in standard operating systems, while the Scout operating sys-tem [Montz et al. 95] uses similar technology to support "paths" between producers and con-sumers of information. Implementing this extension would require the development of data structures describing the domains for each IO buffer, and an extension to the procedure signa-SECTION 7 .2 FUTURE WORK 117 ture types (I/O buffers would be another argument type, extending the current pointer and string types), together with the appropriate application support to make effective use of these features. 7.2.6 Service Format Currently, service files are represented on disk in the native object file format of the destination architecture, and internally by data structures describing the text and data segments, relocations inside those segments, and a symbol table. Currently, one of the major costs in migrating ser-vice between address spaces is the relinking stage. It would be interesting to investigate alter-native file formats and data structures in order to increase the speed of the relinking process. 7.2.7 Application Specific Extensions One of the more interesting applications of Keas configurability is in the area of application specific system extensions. While the experiments described in Chapter 6 showed that it is pos-sible to construct application specific services that can result in increased performance and sys-tem throughput, it would be valuable to construct more of these services, and evaluate their effect with real application loads. Particularly interesting is the development of "network appli-ance" type systems, in which the operating system is configured to support only one major application, providing services to other machines on the network. With its high configurability and ability to easily run applications within the kernel, Kea should be ideal for development of this style of system. Related to the network appliance style of system, it may be interesting to redesign the low level service currently provided by the kernel. These were designed in their current form in order to more easily support the development of other, high-level, services. It should be possible to redesign the low level service closer to the hardware, and provide services such as virtual mem-ory as high-level services. This would allow the configuration of systems with fundamentally different modes of operation. An example might be embedded systems, which seldom require SECTION 7 .2 FUTURE WORK 118 full virtual memory support, but are designed to run in a fixed address space, or systems that require address spaces, but not paging. Another related area is the determination of security guidelines for application specific exten-sions. As described in Chapter 6, there is a need for a means to specify which services are trust-worthy, and where and how they can be installed. What these properties should be, and how they are determined, is one of the most challenging research issues remaining for extensible systems. Bibliography Anderson et al. 91 Thomas E. Anderson, Henry M . Levy, Brian N . Bershad and Edward D. Lazowska. The Interaction of Architecture and Operating System Design. In 'Proceed-ings of the 1991 International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), April 1991, pp. 108-120 Anderson et al. 92 Thomas E. Anderson, Brian N . Bershad, Edward D. Lazowska and Henry M . Levy. Scheduler Activations: Effective Kernel Support for the User-Level Manage-ment of Parallelism. ACM Transactions on Computer Systems, 10(1), February 1992, pp. 53-79 Appel & L i 91 Andrew W. Appel and Kai L i . Virtual Memory Primitives for User Programs. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991, pp. 96-107 Auslander et al. 96 J. Auslander, M . Philipose, C. Chambers, S.J. Eggers and B.N. Bershad. Fast, Effective Dynamic Compilation. In Proceedings of the Conference on Program-ming Language Design and Implementation, May 1996 Banatre al. 91 Michel Banatre, Pack Heng, Gilles Muller and Bruno Rochat. How to Design Reliable Servers Using Fault Tolerant Micro-Kernel Mechanisms. In Proceed-ings of the USENLX Mach Symposium, November 1991, pp. 223-232 119 BlBLOGRAPHY 120 Banerji et al. 97 Arindam Banerji, John M . Tracey and David L. Cohn. Protected Shared Libraries - A New Approach to Modularity and Sharing. In Proceedings of the USENIX 1997 Annual Technical Conference, January 1997, pp. 59-75 Bartoli et al. 93 Alberto Bartoli, Sape J. Mullender and Martijn van der Valk. Wide-Address Spaces - Exploring the Design Space. ACM Operating Systems Review, 27(1), January 1993, pp. 11-17 Bershad & Pinkerton 88 Brian N . Bershad and C. Brian Pinkerton. Watchdogs - Extending the UNIX File System. Computing Systems, 1(2), Spring 1988, pp. 169-188 Bershad et al. 89 Brian N . Bershad, Thomas E. Anderson, Edward D. Lazowska and Henry M . Levy. Lightweight Remote Procedure Call. In Proceedings of the Twelfth ACM Sym-posium on Operating System Principles, December 1989, pp. 102-113 Bershad 92 Brian N. Bershad. The Increasing Irrelevance of IPC Performance for Microker-nel-Based Operating Systems. In Proceedings of the USENIX Workshop on Micro-ker-nels and Other Kernel Architectures, April 1992, pp. 205-212 Bershad et al. 95 Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski, David Becker, Craig Chambers and Susan Eggers. Extensibility, Safety and Performance in the SPIN Operating System. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, December 1995, pp. 267-284 Bhatti & Schlichting 95 Nina T. Bhatti and Richard D. Schlichting. A System for Construct-ing Configurable High-Level Protocols. In Proceedings of SIGCOMM '95, August 1995, pp. 138-150 Birrell 89 Andrew D. Birrell. An Introduction to Programming With Threads. Technical Report SRC-35, DEC Systems Research Center, January 1989 Black et al. 92 David L. Black, David B. Golub, Daniel P. Julin, Richard F. Rashid, Richard P. Draves, Randall W. Dean, Alessandro Forin, Joseph Barrera, Hideyuki Tokuda, Ger-ald Malan and David Bohman. Microkernel Operating System Architecture and Mach, In USENIX Workshop on Microkernels and Other Kernel Architectures, April 1992, pp. 11-30 Bolosky et al. 96 W.J. Bolosky, J.S. Barrera III, Richard P. Draves, R.P. Fitzgerald, G.A. Gib-son, Michael B. Jones, S.P. Levi, N.P. Myhrvold and Richard F. Rashid. The Tiger Video Fileserver. In Proceedings of the Sixth International Workshop on Network and Operating System Support for Digital Audio and Video, April 1996 BlBLOGRAPHY 121 Cao et al. 94 Pei Cao, Edward W. Felten and Kai L i . Implementation and Performance os Application-Controlled File Caching. In Proceedings of the First Symposium on Oper-ating Systems Design and Implementation, November 1994, pp. 165-177 Carter et al. 94 N.P. Carter, S.W. Keckler and W.J. Dally. Hardware Support for Fast Capa-bility-Based Addressing. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), October 1994, pp. 319-327 Chambers et al. 96 Craig C. Chambers, Susan J. Eggers, J. Auslander, M . Philipose, M . Mock and Przemyslaw Pardyak. Automatic Dynamic Compilation Support for Event Dispatching in Extensible Systems. In the Workshop on Compiler Support for System Software, February 1996 Chase et al. 93 Jeffrey S. Chase, Valerie Issarny and Henry M . Levy. Distribution in a Single Address Space Operating System. ACM Operating Systems Review, 27(2), April 1993, pp. 61-65 Chaum & van Antwerpen 90 D. Chaum and H. van Antwerpen. Undeniable Signatures. Advances in Cryptology - CRYPTO '89 Proceedings, Springer-Verlag, 1990, pp. 212-216 Cheriton & Duda 94 David R. Cheriton and Kenneth J. Duda. A Caching Model of Operat-ing System Kernel Functionality. In Proceedings of the First Symposium on Operating Systems Design and Implementation, November 1994, pp. 179-193 Clark 85 David D. Clark. The Structuring of Systems Using Upcalls. In Proceedings of the Tenth ACM Symposium on Operating System Principles, December 1985, pp. 171-180 Condict et al. 94 Michael Condict, Don Bolinger, Dave Mitchell and Eamonn McManus. Microkernel Modularity with Integrated Kernel Performance. Presented at the Mach-Chorus Workshop at the First Symposium on Operating Systems Design and Implemen-tation, November 1994. Available at h t t p : / /www.c s . u t a h . e d u / ~ l e p r e a u / o s d i 9 4 / c o n d i c t / a b s t r a c t . h t m l Corbato et al. 72 F.J. Corbato, J. H. Saltzer and C T . Clingen. Multics - The First Seven Years. In Proceedings of the American Federation of Information Processing Societies Spring Joint Computer Conference, 1972, pp. 571-583. Reprinted in P. Freeman, Soft-ware Systems Principles, Science Research Associates, 1975. Cowan et al. 96 Crispin Cowan, Tito Autrey, Charles Krasic, Calton Pu and Jonathon Wal-pole. Fast Concurrent Dynamic Linking for an Adaptive Operating System. In Proceed-ings of the International Conference on Configurable Distributed Systems, May 1996 BlBLOGRAPHY 122 Draves etal. 91 Richard R Draves, Brian N . Bershad, Richard F. Rashid and Randall W. Dean. Using Continuations to Implement Thread Management and Communication In Operating Systems. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles, December 1991, pp. 122-136 Draves & Cutshall 97 Richard P. Draves and Scott M . Cutshall. Unifying the User and Kernel Environments. Microsoft Research Technical Report MSR-TR-97-10 Druschel etal. 91 Peter Druschel, Larry L. Peterson and Norman C. Hutchinson. Service Composition in Lipto. In Proceedings of the International Workshop on Object Orien-tation in Operating Systems, October 1991, pp. 108-111 Druschel et al. 92a Peter Druschel, Larry L. Peterson and Norman C. Hutchinson. Beyond Microkernel Design: Decoupling Modularity and Protection in Lipto. In Proceedings of the Twelfth International Conference on Distributed Computing Systems, June 1992, pp. 512-520 Druschel et al. 92b Peter Druschel, Larry L. Peterson and Norman C. Hutchinson. Modular-ity and Protection Should be Decoupled. In Proceedings of the Third Workshop on Workstation Operating Systems, April 1992, pp. 95-97 Druschel 93 Peter Druschel. Efficient Support for Incremental Customization of OS Ser-vices. In Proceedings of the Third International Workshop on Object Orientation in Operating Systems, December 1993, pp. 168-190 Druschel & Peterson 93 Peter Druschel and Larry L. Peterson. Fbufs: A High-Bandwidth Cross-Domain Transfer Utility. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, December 1993, pp. 189-202 Engler et al. 95 Dawson R. Engler, M . Frans Kaashoek and James O'Toole Jr. Exokernel: An Operating System Architecture for Application-Level Resource Management. In Pro-ceedings of the Fifteenth ACM Symposium on Operating Systems Principles, December 1995, pp. 251-266 Engler & Kaashoek 95 Dawson R. Engler and M . Frans Kaashoek. Exterminate A l l Operat-ing System Abstractions. In Proceedings of the Fifth Workshop on Hot Topics in Oper-ating Systems, May 1995, pp. 78-83 Fall & Pasquale 94 K. Fall and J. Pasquale. Improving Continuous-Media Playback Perfor-mance with In-Kernel Data Paths. In Proceedings of the First IEEE International Con-ference on Multimedia Computing and Systems, May 1994, pp. 100-109 BlBLOGRAPHY 123 Finkelstein et al. 95 David Finkelstein, Norman C. Hutchinson, Dwight J. Makaroff, Roland Mechler and Gerald W. Neufeld. Real Time Threads Interface. Computer Science, Uni-versity of British Columbia Technical Report TR-95-07, March 1995 Ford & Lepreau 94 Bryan Ford and Jay Lepreau. Evolving Mach 3.0 to a Migrating Thread Model. In Proceedings of the 1994 Winter USENIX Conference, January 1994, pp. 97-114 Ford et al. 96 Bryan Ford, Mike Hibler, Jay Lepreau, Patrick Tullman, Godmar Back and Stephen Clawson. Microkernels Meet Recursive Virtual Machines. In Proceedings of the Second USENIX Symposium on Operating Systems Design and Implementation, October 1996, pp. 137-152 Ford et al. 97 Bryan Ford, Godmar Back, Greg Benson, Jay Lepreau, Albert Lin and Olin Shivers. The Flux OSKit: A Substrate for Kernel and Language Research. In Proceed-ings of the Sixteenth ACM Symposium on Operating Systems Principles, October 1997, pp. 38-51 Gingell 89 Robert A . Gingell. Shared Libraries. Unix Review, 7(8), August 1989, pp. 56-66 Golub et al. 90 David Golub, Randall W. Dean, Alessandro Forin and Richard F. Rashid, Unix as an Application Program. In Proceedings of the 1990 Summer USENIX Confer-ence, June 1990, pp. 87-95 Guillemont et al. 91 Marc Guillemont, Jim Lipkis, Douglas Orr and Marc Rozier. A Second Generation Micro-Kernel Based Unix; Lessons in Performance and Compatibility. In Proceedings of the Winter 1991 USENIX Conference, 1991, pp. 13-22 Hamilton & Kougiouris 93 Graham Hamilton and Panos Kougiouris. The Spring Nucleus: A Microkernel for Objects. In Proceedings of the 1993 Summer USENIX Conference, June 1993, pp. 147-159 Hartig et al. 97 Hermann Hartig, Michael Hohmuth, Jochen Liedtke, Sebastian Schonberg and Jean Wolter. The Performance of p-Kernel-Based Systems. In Proceedings of the Sixteenth ACM Symposium on Operating System Principles, October 1997, pp. 66-77 Harty & Cheriton 91 K. Harty and D.R. Cheriton. Application-Controlled Physical Memory using External Page-Cache Management. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Sys-tems (ASPLOSIV), April 1991, pp. 187-197 Heiser et al. 93 Gemot Heiser, Kevin Elphinstone, Stephen Russell and Jerry Vochteloo. Mungi: A Microkernel for Objects. In Proceedings of the 1993 Summer USENIX Con-ference, June 1993, pp. 147-159 BlBLOGRAPHY 124 Hildebrand 92 Dan Hildebrand. An Architectural Overview of QNX. In Proceedings of the USENLX Workshop on Micro-Kernels and Other Kernel Architectures, April 1992 Hsieh et al. 96 Wilson Hsieh, Marc Fiuczynski, Charles Garrett, Stefan Savage, David Becker and Brian Bershad. Language Support for Extensible Operating Systems. In the Workshop on Compiler Support for System Software, February 1996 Hutchinson & Peterson 88 Norman C. Hutchinson and Larry L. Peterson. Design of the x-Kernel. In Proceedings of the SIGCOMM 1988 Symposium, August 1888, pp. 65-75 Hutchinson etal. 89 Norman C. Hutchinson, Larry L. Peterson, Mark B. Abbott and Sean O'Malley. RPC in the x-Kernel: Evaluating New Design Techniques. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, December 1989, pp. 91-101 IEEE 90 IEEE Std. 1003.1-1990 Standard for Information Technology - Portable Operating System Interface (POSTX) - PART 1: System Application Programming Interface (API) [C Language]. IEEE, Piscataway, NJ Intel 90 Intel Corporation, i486 Microprocessor Programmers Reference Manual. Osborne McGraw-Hill, 1990 Intel 94 Intel Corporation. Pentium Microprocessor Programmers Reference Manual. Osborne McGraw-Hill, 1993 Jones 93 Michael B. Jones. Interposition Agents: Transparently Interposing User Code at the System Interface. In Proceedings of the Fourteenth A CM Symposium on Operating Sys-tems Principles, December 1993, pp. 80-93 Kaashoek et al. 97 M . Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Hector Briceno, Russell Hunt, David Mazieres, Thomas Pinkney, Robert Grimm, John Jannotti and Kenneth Mackenzie. Application Performance and Flexibility on Exokernel Sys-tems. In Proceedings of the Sixteenth ACM Symposium on Operating Systems Princi-ples, October 1997, pp. 52-65 Khalidi & Nelson 93a Yousef A . Khalidi and Michael N . Nelson. An Implementation of UNIX on an Object-oriented Operating System. In Proceedings of the 1993 Winter USENLX Conference, January 1993, pp. 469-479 Khalidi & Nelson 93b Yousef A. Khalidi and Michael N . Nelson. Extensible File Systems in Spring. Operating Systems Review, 27(5), December 1993, pp. 1-13 Kiczales et al. 91 Gregor Kiczales, Jim des Rivieres and Daniel G. Bobrow. The Art of the Metaobject Protocol. MIT Press, 1991 BlBLOGRAPHY 125 Kiczales et al. 92 Gregor Kiczales, M . Theimer and Brent Welch. A New Model of Abstrac-tion for Operating System Design. In Proceedings of the International Workshop on Object-Oriented Operating Systems, 1992, pp. 346-350 Krieger & Stumm 97 Orran Krieger and Michael Stumm. HFS: A Performance-Oriented Flexible File System Based on Building-Block Compositions. ACM Transactions on Computer Systems, 15(3), August 1997, pp. 286-321 Kulkarni 93 Dinesh C. Kulkarni. Pi: A New Approach to Operating System Structuring for Flexibility. Technical Report 93-4, Department of Computer Science and Engineering, University of Notre Dame, April 1993 Lee et al. 94 Chao-Hsien Lee, Meng Chang Chen and Ruei-Chuan Chang. HiPEC: High Per-formance External Virtual Memory Caching. In Proceedings of the First Symposium on Operating Systems Design and Implementation, November 1994, pp. 153-164 Lepreau et al. 93 Jay Lepreau, Mike Hibler, Bryan Ford, Jeffrey Law and Douglas Orr. In-Kernel Servers in Mach 3.0: Implementation and Performance. In Proceedings of the USENIX Mach III Symposium, April 1993, pp. 39-55 Levin et al. 75 R. Levin, E. Cohen, W. Corwin, F. Pollack, and W. Wulf. Policy/Mechanism Separation in Hydra. In Proceedings of the Fifth ACM Symposium on Operating Sys-tems Principles, November 1975, pp. 132-140 Liedtke et al. 93 Jochen Liedtke. Improving IPC by Kernel Design. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, December 1993, pp. 175-188 Liedtke et al. 95 Jochen Liedtke. On p-Kernel Construction. In Proceedings of the Fifteenth ACM Symposium on Operating System Principles, December 1995, pp. 237-250 Liedtke et al. 97 Jochen Liedtke, Kevin Elphinstone, Sebastian Schbnberg, Hermann Hartig, Gemot Heiser, Nayeem Islam and Trent Jaeger. Achieved IPC Performance (Still the Foundation for Extensibility), In Proceedings of the Sixth Workshop on Hot Topics in Operating Systems, May 1997, pp. 28-31 Maeda & Bershad 93 C. Maeda and Brian N . Bershad. Protocol Service Decomposition for High-Performance Networking. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, December 1993, pp. 244-255 Maes 87 P. Maes. Concepts and Experiments in Computational Reflection. In Proceedings of the 1987 Conference on Object-Oriented Programming Systems, Languages and Appli-cations, 1987, pp. 147-155 BlBLOGRAPHY 126 Malan etal. 91 Gerald Malan, Richard Rashid, David Golub and Robert Baron. DOS as a Mach 3.0 Application. In Proceedings of the USENIX Mach Symposium, November 1991, pp. 27-40 Massalin & Pu 89 Henry Massalin and Calton Pu. Threads and Input/Output in the Synthesis Kernel. In Proceedings of the Twelfth ACM Symposium on Operating Systems Princi-ples, December 1989, pp. 191-201 McCanne & Jacobsen 93 S. McCanne and Van Jacobsen. The BSD Packet Filter: A New Architecture for User-level Packet Capture. In Proceedings of the 1993 Winter USENIX Conference, January 1993 McKusick 82 Kirk McKusick. 4.2BSD File System. In Proceedings of the USENIX Tools . Users Group Joint Conference, July 1982, pp. 31-45 McKusick et al. 96 Marshall K. McKusick, Keith Bostic, Michael J. Karels and John S. Qua-terman. The Design and Implementation of the 4.4 BSD Operating System. Addison Wesley, 1996 McNamee & Armstrong 90 Dylan McNamee and Katherine Armstrong. Extending the Mach External Pager Interface to Accommodate User-Level Page Replacement Poli-cies. In Proceedings of the USENIX Mach Workshop, October 1990, pp. 17-29 Microsoft 97 Microsoft Corporation. How Software Publishers Can Use Authenticode Tech-nology, h t t p : / / w w w . m i c r o s o f t . c o m / i n t d e v / s i g n c o d e Mogul et al. 87 Jeff Mogul, Richard Rashid and M.J. Accetta. The Packet Filter: An Efficient Mechanism for User-level Network Code. In Proceedings of the Eleventh ACM Sympo-sium on Operating Systems Principles, November 1987, pp. 39-52 Montz et al. 95 Allen B. Montz, David Mosberger, Sean W. O'Malley, Larry L. Peterson and Todd A . Proebsting. Scout: A Communications-Oriented Operating System. In Pro-ceedings of the 4th Workshop on Hot Topics in Operating Systems (HotOS-IV), May 1995 Mukherjee & Schwan 93 Bodhisattwa Mukherjee and Karsten Schwan. Experimentation with a Reconfigurable Micro-Kernel. In Proceedings of the USENIX Symposium on Microkernels and Other Kernel Architectures, September 1993, pp. 45-60 Necula & Lee 96 George C. Necula and Peter Lee. Safe Kernel Extensions Without Run-Time Checking. In Proceedings of the Second Symposium on Operating System Design and Implementation, October 1996, pp. 229-243 Nelson 91 G. Nelson (editor). System Programming in Modula-3. Prentice-Hall, 1991 BlBLOGRAPHY 127 Nuttall 94 Mark Nuttall. A Brief Survey of Systems Providing Process or Object Migration Facilities. Operating Systems Review, October 1994 Organick 72 Elliot I. Organick. The Multics System: An Examination of Its Structure. MIT Press, 1972. On etal. 93 Douglas B. Orr, John Bonn, Jay Lepreau and Robert Mecklenburg. Fast and Flexible Shared Libraries. In Proceedings of the 1990 USENIX Summer Conference, • June 1993, pp. 237-251 Ousterhout 90 John K. Ousterhout. Why Aren't Operating Systems Getting Faster as Fast as Hardware? In Proceedings of the USENIX Summer Conference, June 1990, pp. 247-256 Pardyak & Bershad 96 Przemyslaw Pardyak and Brian Bershad. Dynamic Binding for an Extensible System. In Proceedings of the Second USENIX Symposium on Operating Systems Design and Implementation, October 1996, pp. 201-212 Pasquale et al. 94 Joseph Pasquale, Eric Anderson and P. Keith Muller. Container Shipping: Operating System Support for I/O-Intensive Applications. Computer, 27(3), March 1994, pp. 84-93 Peterson et al. 90 Larry L. Peterson, Norman C. Hutchinson, Sean W. O'Malley and Herman C. Rao, The x-kernel: A Platform for Accessing Internet Resources. Computer, 23(5), May 1990, pp. 23-33 Pu et al. 88 Calton Pu, Henry Massalin and J. Ioannidis. The Synthesis Kernel. Computing Systems, 1(1), Winter 1988, pp. 11-32 . Pu et al. 95 Calton Pu, Tito Autrey, Andrew Black, Charles Consel, Crispin Cowan, Jon Inouye, Lakshmi Kethana, Jonathon Walpole and Ke Zhang. Optimistic Incremental Specialization: Streamlining a Commercial Operating System. In Proceedings of the Fifteenth ACM Symposium on Operating System Principles, December 1995, pp. 314-324 Radia 89 Sanjay Radia. Names, Contexts and Closure Mechanisms in Distributed Computing Environments. Ph.D. Thesis, University of Waterloo, Department of Computer Science, 1989 Rees et al. 86 Jim Rees, Paul H. Levine, Nathaniel Mishkin and Paul J. Leach. An Extensible I/O System. In Proceedings of the 1986 Summer USENLX Conference, June 1986, pp. 114-125 BlBLOGRAPHY 128 Ritchie 79 Dennis M . Ritchie. Protection of Data File Contents, United States Patent (4,135,240), United States Patent Office (January 16, 1979). Assignee: Bell Telephone Laboratories, Inc., Murray Hi l l , NJ, Appl. No: 377,591, Filed: July 9, 1973 Rivest 92 R. Rivest. The MD5 Message-Digest Algorithm. Network Working Group RFC 1321, April 1992. Rosenblum et al. 95 Mendel Rosenblum, Edouard Bugnion, Stephen Alan Herrod, Emmett Witchel and Anoop Gupta. The Impact of Architectural Trends on Operating System Performance. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, December 1995 Rozier et al. 92 Marc Rozier, Vadim Abrossimov, Francois Armand, I. Boule, Michel Gien, Marc Guillemont, Frederic Herrmann, C. Kaiser, S. Langlois, P. Leonard and W. Neu-hauser. Overview of the Chorus Distributed Operating System. In Proceedings of the USENIX Workshop on Micro-Kernels and Other Kernel Architectures, April 1992, pp. 39-70 Sabatella 90 Marc Sabatella. Issues in Shared Library Design. In Proceedings of the Summer 1990 USENIX Conference, June 1990, pp. 11 -24 Schroeder & Burroughs 89 M.D. Schroeder and M . Burroughs. Performance of the Firefly RPC. In Proceedings of the Twelfth ACM Symposium On Operating Systems Principles, December 1989, pp. 83-90 Schulman et al. 92 A. Schulman, D. Maxey and M . Pietrek. Undocumented Windows, Addi-son-Wesley, 1992. Scott et al. 88 Michael L. Scott, Thomas J. LeBlanc and Brian D. Marsh. Design Rationale for Psyche, a General-Purpose Multiprocessor Operating System. In Proceedings of the 1988 International Conference on Parallel Processing, 1988, pp. 255-262 Seltzer et al. 96 Margo I. Seltzer, Yasuhiro Endo, Christopher Small and Keith A . Smith. Dealing With Disaster: Surviving Misbehaved Kernel Extensions. In Proceedings of the Second USENIX Symposium on Operating Systems Design and Implementation, October 1996, pp. 213-227 Sirer etal. 96 Emin Gun Sirer, Marc Fiuczynski, Przemyslaw Pardyak and Brian Bershad. Safe Dynamic Linking in an Extensible Operating System. In Proceedings of the Work-shop on Compiler Support for System Software, February 1996 Small & Seltzer 96 Christopher Small and Margo Seltzer. A Comparison of OS Extension Technologies. In Proceedings of the USENIX 1996 Annual Technical Conference, Jan-uary 1996, pp. 41-54 BlBLOGRAPHY 129 Smith 97 Peter W. Smith. The Possibilities and Limitations of Heterogeneous Process Migration. Ph.D. Thesis, Computer Science, University of British Columbia, October 1997 Temple 94 Philip Temple. The Feisty Parrot. New Zealand Geographic Magazine, October 1994. Also available at http: // van Renesse et al. 88 R. van Renesse, H. van Staveren and Andrew S. Tanenbaum. Perfor-mance of the World's Fastest Distributed Operating System. Operating Systems Review, 22(4), October 1988, pp. 25-34 .Veitch 96 Alistair C. Veitch and Norman C. Hutchinson. Kea - A Dynamically Extensible and Configurable Operating System Kernel. In Proceedings of the Third International Conference on Configurable Distributed Systems, May 1996, pp. 236-242 Veitch 98 Alistair C. Veitch and Norman C. Hutchinson. Dynamic Service Reconfiguration and Migration in the Kea Kernel. In Proceedings of the Fourth International Confer-ence on Configurable Distributed Systems, May 1998, pp. 156-163 Wahbe et al. 93 Robert Wahbe, Steven Lucco, Thomas E. Anderson and Susan L. Graham. Efficient Software-Based Fault Isolation. In Proceedings of the Fourteenth ACM Sym-posium on Operating Systems Principles, December 1993, pp. 175-188 Walpole et al. 92 Jonathon Walpole, Jon Inouye and Ravindranath Konoru. Modularity and Interfaces in Micro-Kernel Design and Implementation: A Case Study of Chorus on the HP PA-RISC. In Proceedings of the USENIX Workshop on Micro-Kernels and Other Kernel Architectures, April 1992, pp. 71-82 Wulf et al. 74 William Wulf, E. Cohen, W. Corwin, A . Jones, Roy Levin, C. Pierson and F. Pollack. H Y D R A : The Kernel of a Multiprocessor Operating System. Communications of the ACM, 17(6), June 1974, pp. 337-345 Wulf et al. 81 William Wulf, Roy Levin and Samuel P. Harbison. Hydra/C.mmp: An Exper-imental Computer System, McGraw-Hill, 1981 Yokote 92 Yasuhiko Yokote. The Apertos Reflective Operating System: The Concept and its Implementation. In Proceedings of the Seventh Annual Conference on Object-Oriented Programming Systems, Languages and Applications, October 1992, pp. 414-434 A P P E N D I X A Kernel Service Interfaces This appendix contains details of the interface to each of the services provided by the Kea ker-nel, with the exception of the service interface, which is discussed in detail in chapter 3. A . l Domain Interface Domains are virtual address spaces. A new domain is created with the d o m a i n C r e a t e call: i n t d o m a i n C r e a t e ( ) ; This function creates a new domain, and returns its identifier. Domains are destroyed with the d o m a i n T e r m i n a t e call, which takes the identifier of the domain to be destroyed as its argu-ment: i n t d o m a i n T e r m i n a t e ( i n t i d ) ; 130 S E C T I O N A .2 V M I N T E R F A C E 131 All the threads in a domain can be suspended and resumed using the domainSuspend and domainResume calls: i n t domainSuspend( int i d ) ; i n t domainResume(int i d ) ; The user identifier of the domain owner can be set and retrieved with the domainSetOwner and domainGetOwner calls: i n t domainSetOwner( int i d , uns igned owner) ; i n t domainGetOwner( int i d ) ; Finally, the identity of the current domain can be found by calling the domainID function: i n t domainID ().; A.2 V M Interface The "vm" interface allows for the manipulation of a domain's virtual memory. New memory is allocated with the v m A l l o c a t e procedure: i n t v m A l l o c a t e ( i n t domain, vaddr * a d d r e s s , u n s i g n e d s i z e , vaddr e x a c t ) ; In this procedure, domain is the domain in which memory is to be allocated, s i z e is the size of the desired memory region in bytes (this will be rounded up to the nearest number of whole pages), and address is used to return the address of the allocated memory. If exact is not zero, then the vm service will attempt to allocate the memory at the specified address. Virtual memory can also be allocated, and backed by a specified physical address range. This functionality is used by device drivers that need to access physical I/O locations mapped into the processors physical address space. The v m P h y s A l l o c a t e function is used for this pur-pose: SECTION A . 2 V M INTERFACE 1 3 2 i n t v m P h y s A l l o c a t e ( i n t domain, paddr paddre s s , vaddr * v a d d r e s s , u n s i g n e d npg ) ; In this function, domain specifies the target domain for the allocation, paddress is the physical address, vaddr ess the (returned) virtual address to which the physical address is mapped and npg the number of pages to be allocated. Virtual memory is freed with the vmFree function: i n t v m F r e e ( i n t domain, vaddr addres s , u n s i g n e d s i z e ) ; where domain specifies the domain, address the virtual address within the domain (this address should be within a range returned from vmAl l o c a t e or vmPhysAl l o c a t e , and will be rounded down to the nearest page boundary) and s i z e the size (number of bytes) of memory to be freed. A region of memory can be made read-only or read-write (the default) with the vmProtect procedure: i n t v m P r o t e c t ( i n t domain, vaddr addres s , u n s i g n e d s i z e , b o o l r e a d ) ; The domain and memory region is specified as in the other procedures, while the r e a d param-eter determines the protection of the memory region - true for read-only, false for read-write. Memory can be "pinned" or made unpageable with the vmLock procedure, and unpinned with the vmUnlock procedure1. v m L o c k ( i n t domain, vaddr addres s , u n s i g n e d s i z e ) ; v m U n l o c k ( i n t domain, vaddr addres s , u n s i g n e d s i z e ) ; 1. Kea does not yet have paging support, so these procedures remain largely unimplemented. SECTION A.3 EVENT INTERFACE 133 The final vm function is vmMap, used to map various memory regions from one domain to another. The prototype for this procedure is: i n t vmMap(int srcDomain, vaddr s r c A d d r e s s , u n s i g n e d s i z e , i n t dstDomain, vaddr * d s t A d d r e s s , b o o l exac t , b o o l d s t P r o t , i n t mapping) ; The first three parameters determine the domain and virtual region from which the memory is to be mapped. The next three parameters determine the destination domain and region, in a manner similar to that of vmAl l o c a t e . The d s t P r o t parameter acts to determine the pro-tection of this memory (as for vmProtect) . The final parameter, mapping, determines the type of mapping to be made. Currently, three different values are supported: • VM_MAP_COPY - The memory is copied to the target domain. Any references in either domain do not affect the other. For efficiency, the copy is lazily evaluated using copy-on-write. • VM_MAP_SHARE - the underlying physical pages are shared between domains. Any update done by one will also be seen by the other. • VM_MAP_MOVE - the memory range is moved into the target domain. This is equiva-lent to a vmMap with VM_MAP_SHARE, followed by a vmFree in the source domain. A.3 Event Interface Events provide support for the asynchronous delivery of significant system events, such as page faults, interrupts and domain termination. A domain registers itself for an event by calling e v e n t R e g i s t e r : i n t e v e n t R e g i s t e r ( i n t domain, u n s i g n e d event , v o i d * h a n d l e r ) ; SECTION A . 4 NAME INTERFACE 134 eventRegister specifies the domain to receive the event, an event identifier, and the address of a function that will be called when the event occurs. The e v e n t D e r e g i s t e r function is used to remove notifications of an event. i n t e v e n t D e r e g i s t e r ( i n t domain, u n s i g n e d e v e n t ) ; Finally, the eventWait call is used by a specific thread to wait for an event. The function takes a pointer to an argument that will be filled in by the kernel on event reception (all event handlers also get an argument as a parameter when invoked). i n t e v e n t W a i t ( u n s i g n e d event , i n t * a r g ) ; A.4 Name Interface The name interface provides a very simple hierarchical naming structure for the system. New names are registered with the nameRegis ter call: i n t nameReg i s t e r ( cons t char *name, b o o l d i r , i n t i d ) ; nameRegis ter takes as parameters a string (the name to be created) a flag describing whether the name to be created is a directory (container for other names) or a plain name, and an identifier to be associated with the name. The identifier can represent entities such as domains, threads or services, depending upon the application's purposes. Names are removed from the system with nameDeregis ter : i n t n a m e D e r e g i s t e r ( c o n s t char *name); Applications can determine whether a name exists with the nameExis t s call: i n t n a m e E x i s t s ( c o n s t char *name); S E C T I O N A . 5 T H R E A D I N T E R F A C E 1 3 5 Sometimes, applications may need to wait until a name is denned before they can continue. The nameWait call fulfills this need: i n t nameWait(const char *name); The identifier associated with a name can be determined by the nameGetID call: i n t nameGetID(const char *name); For directory names, the count of names in the directory can be found using n a m e D i r e n t r i e s , while the name with a given index can be found using nameGetDirentry : i n t n a m e D i r e n t r i e s ( c o n s t char * d i r ) ; i n t n a m e G e t D i r e n t r y ( c o n s t char * d i r , u n s i g n e d index , char *name); Finally, a new naming service can be "attached" to an existing name by using the nameAt-t a c h call. This will result in all function calls on names that begin with the name prefix (i.e. the first argument to nameAttach) being passed to the service identified for completion. i n t nameAttach(cons t char *name, i n t s e r v i c e ) ; A.5 Thread Interface The thread service supports threads, the Kea context of execution. New threads are created using the t h r e a d C r e a t e call: i n t t h r e a d C r e a t e ( i n t domain, vaddr e n t r y , u n s i g n e d maxstack, v o i d * a r g p , u n s i g n e d n a r g s ) ; This function creates a new thread in the specified domain, at the entry point entry . The stack size allocated to the thread is determined by the maxstack argument. Finally, argp is a SECTION A . S THREAD INTERFACE 136 pointer to a memory buffer containing arguments to be passed to the thread, while nargs is a count of the number of arguments. Threads are created in a suspended state, and do not run until threadResume is called. Threads can be terminated by the threadTermina te call: i n t t h r e a d T e r m i n a t e ( i n t i d ) ; An application can check for the existence of a thread with the t h r e a d E x i s t s call: b o o l t h r e a d E x i s t s ( i n t i d ) ; Any thread can find out what its own identifier is by calling threadID: i n t threadID() ; Threads can be indefinitely suspended until resumed at a later time by the following calls: i n t t h r e a d S u s p e n d ( i n t i d ) ; i n t threadResume( int i d ) ; A thread can sleep for a variable time period by using the t h r e a d S l e e p call: v o i d t h r e a d S l e e p ( s t r u c t t ime * ) ; The scheduling attributes for a thread can be set and retrieved by the following two functions. The definition of the s c h e d _ i n f o structure is dependant upon the scheduler being used. i n t threadSe tSched ( i n t i d , s t r u c t s c h e d _ i n f o *) ,-i n t t h r e a d G e t S c h e d ( i n t i d , s t r u c t s c h e d _ i n f o * ) ; Finally, the following three calls are used to set, reset and retrieve the thread's effective domain. The use of these functions is described in Section 3.12.1. SECTION A . 5 THREAD INTERFACE 1 3 7 v o i d threadSetEdomain(); v o i d threadResetEdomain(); i n t threadGetEdomain(); A P P E N D I X B High-Level Service Interfaces This appendix contains interface details for some of the high level services discussed in chapter 4. B.l IDE Interface The interface presented by the IDE service is very simple. There are only three procedures making up the interface: i n t i d e R e a d ( i n t p a r t i t i o n I D , unsigned o f f s e t , v o i d *data, unsigned s e c t o r s ) ; i n t i d e R e a d ( i n t p a r t i t i o n I D , unsigned o f f s e t , v o i d *data, unsigned s e c t o r s ) ; i n t i d e S i z e ( u n s i g n e d p a r t i t i o n I D ) ; The first two procedures deal with reading and writing data. They each take a partition identi-fier, offset (in 512 byte sectors) into the partition, a pointer to the data to be transferred, and the 138 S E C T I O N B .2 B C A C H E I N T E R F A C E 139 number of sectors to be transferred. The third function returns the number of sections in a spec-ified partition. Clients of the IDE service obtain partition identifiers from the name service, where they are associated with names such as "/device/ideO/bsdO"1 by the IDE service when it is initialised. B.2 Bcache Interface The bcache service provides buffering of disk blocks for file services. New blocks are read from and written to the disk with the b c a c h e R e a d and b c a c h e W r i t e procedures. Each of these has identical paramters. The prototype of b c a c h e R e a d is shown below: i n t b c a c h e R e a d ( i n t f i d , u n s i g n e d p h y s O f f s e t , u n s i g n e d s i z e , b c a c h e H i n t h i n t , u n s i g n e d b u f O f f s e t , v o i d * b u f f e r , u n s i g n e d b u f S i z e ) ; Each of the read and write procedures have seven arguments - a unique file identifier, the phys-ical offset of the block, the size of the block, a hint (used to control how quickly blocks are recy-cled from the cache), and three arguments describing a memory buffer into (or from) which the cache is to copy the data. The last two of these point to the buffer and its size (i.e. how many bytes are to be copied) while the first determines an offset into the physical block. This allows filesystems to control how much of the block they currently wish to access. On successful com-pletion, these procedures return the number of bytes copied, or a negative value if a failure occurs. Note that there is no device parameter to this function. This is because the current ver-sion of the Kea system only contains a single IDE disk. It is planned that future versions will have a parameter identifying the device. The h i n t parameter is used to control how quickly the block is removed from the cache. Cur-rently four values are supported: 1. This can be interpreted as the first partition with a B S D filesystem on the first I D E device. SECTION B.3 FILE INTERFACE 140 • HINT_KEEP - probable block will be used again. • HINT_MAYBE - possible block will be used again. • HINT_DISCARD - not expected to be used again. • HINT_DELETE - block is invalid: don't need to write, even if dirty. Three other procedures are provided that allow clients to force the writing of dirty blocks to disk. These procedures allow cache flushing on a either a file, block or global (all blocks in the cache) basis: v o i d b c a c h e S y n c F i l e ( i n t f i d ) ; v o i d b c a c h e S y n c B l o c k ( i n t c a c h e l D ) ; v o i d b c a c h e S y n c A l l ( ) ; B.3 File Interface All of the Kea filesystems conform to a single interface, that is very similar to the POSIX file procedures: i n t f i l e O p e n ( c h a r *name, i n t o f l a g s ) ; i n t f i l e C l o s e ( i n t h a n d l e ) ; i n t f i l e R e a d ( i n t h a n d l e , v o i d * b u f f e r , u n s i g n e d b y t e s ) ; i n t f i l e W r i t e ( i n t h a n d l e , v o i d * b u f f e r , u n s i g n e d b y t e s ) ; i n t f i l e C r e a t e ( i n t h a n d l e , char *name, s t r u c t f i l e S t a t u s * b u f ) ; i n t f i l e R e m o v e ( i n t h a n d l e , c o n s t char *name); i n t f i l e S t a t ( i n t h a n d l e , s t r u c t f i l e S t a t u s * b u f ) ; i n t f i l e W s t a t ( i n t h a n d l e , c o n s t s t r u c t f i l e S t a t u s * b u f ) ; i n t f i l e S e e k ( i n t h a n d l e , i n t o f f s e t , seekType whence) ; The f i l e O p e n procedure is used to open a file for writing, returning a positive file handle on success. Each of the other procedures use this handle for successive operations. The f i l e S t a t u s structure is analogous to the Unix "stat" structure, and is used to store informa-tion about the file, such as its size, owner and last access time, and is read and written by the S E C T I O N B .4 M O U N T I N T E R F A C E 141 f i l e S t a t and f i l e W s t a t calls respectively. The f i l e C r e a t e and f i l e R e m o v e pro-cedures require a handle to the directory in which the file is to be created or removed, together with the name of the file (which must not contain any pathname separators, such as or "/"). B.4 Mount Interface The mount service provides a mapping between name prefixes and service identifiers. A new mapping is added with the mount call, and removed with the dismount call: i n t m o u n t ( c h a r * p a t h , i n t s e r v I D ) ; i n t d i s m o u n t ( c h a r * p a t h ) ; Given a file name, the mountQuery procedure searches for the longest path which matches the file prefix, and returns the service identifier of the backing service: i n t m o u n t Q u e r y ( c o n s t c h a r *name, u n s i g n e d * p l e n ) ; The length of the prefix matched is returned using the p i en parameter. A P P E N D I X C Scheduler Interface This appendix contains details of the callback scheduler interface provided by Kea. The unique properties of this interface are described in Section 3.13. This interface provides a full separa-tion of scheduler policy (when and which threads are run) from the operating system mecha-nisms required to support those operations (clock ticks, context switching, etc.) Following is a list of function prototypes in the scheduler interface ("schedlnt") and their purposes. There is one special global variable also associated with the scheduler, doPreempt . This variable is of type bool (boolean) and can be set to true by any of the scheduler functions. When set, the system guarantees to preempt the current thread, and call s c h e d l n t S e l e c t , immediately upon return. v o i d s c h e d l n t l n i t ( ) This function is called once at system start-up. v o i d s c h e d I n t A d d ( s t r u c t t h r e a d * t ) 142 APPENDIX C SCHEDULER INTERFACE 143 This function is called when a new thread is created v o i d s c h e d l n t R e m o v e ( s t r u c t t h r e a d * t ) This function is called when a thread is destroyed. v o i d s c h e d l n t S e t l n f o ( s t r u c t t h r e a d * t , s t r u c t s c h e d _ i n f o *s) Called when thread scheduling properties are changed. v o i d s c h e d l n t R u n n a b l e ( s t r u c t t h r e a d * t ) Called when a thread is made runnable. v o i d s c h e d l n t N o n R u n n a b l e ( s t r u c t t h r e a d * t ) Called when a thread is made non runnable (i.e. the thread is going to sleep, or has been sus-pended). s t r u c t t h r e a d * s c h e d I n t S e l e c t ( ) Called when a new thread should be selected to be run. Returns the thread selected, or NULL if no thread is currently runnable. v o i d s c h e d l n t T i c k ( ) Called on every clock tick v o i d s c h e d l n t E n q u e u e ( s t r u c t t h r e a d * t , mutex *m) Called when a thread is going to sleep on a mutex/semaphore. v o i d schedlntDequeue(mutex *m) APPENDIX C SCHEDULER INTERFACE 144 Called when a thread acquires a mutex/semaphore. These last two functions enable a scheduler to implement custom handling of semaphore queues. 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items