Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The raven kernel: a microkernel for shared memory multiprocessors Ritchie, Duncan S. 1993

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1993_spring_ritchie_duncan.pdf [ 5.72MB ]
JSON: 831-1.0051302.json
JSON-LD: 831-1.0051302-ld.json
RDF/XML (Pretty): 831-1.0051302-rdf.xml
RDF/JSON: 831-1.0051302-rdf.json
Turtle: 831-1.0051302-turtle.txt
N-Triples: 831-1.0051302-rdf-ntriples.txt
Original Record: 831-1.0051302-source.json
Full Text

Full Text

THE RAVEN KERNEL: A MICROKERNEL FOR SHARED MEMORYMULTIPROCESSORSByDuncan Stuart RitchieB. Sc. (Computer Science) University of British Columbia, 1990A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF COMPUTER SCIENCEWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIAApril 1993© Duncan Stuart Ritchie, 1993In presenting this thesis in partial fulfilment of the requirements for an advanced degree at theUniversity of British Columbia, I agree that the Library shall make it freely available for refer-ence and study. I further agree that permission for extensive copying of this thesis for scholarlypurposes may be granted by the head of my department or by his or her representatives. Itis understood that copying or publication of this thesis for financial gain shall not be allowedwithout my written permission.Department of Computer ScienceThe University of British Columbia2075 Wesbrook PlaceVancouver, CanadaV6T 1Z1Date:4pri:I 31)1 19‘13AbstractThe Raven kernel is a small, lightweight operating system for shared memory multiproces-sors. Raven is characterized by its movement of several traditional kernel abstractions into userspace. The kernel itself implements tasks, virtual memory management, and low level excep-tion dispatching. All thread management, device drivers, and message passing functions areimplemented completely in user space. This movement of typical kernel-level abstractions intouser space can drastically reduce the overall number of user/kernel interactions for fine-grainedparallel applications.iiTable of ContentsAbstract^ iiList of Tables^ viList of Figures^ viiAcknowledgement^ ix1 Introduction 11.1 Motivation ^ 21.2 Raven kernel overview ^ 32 Run-time environment overview 52.1 Hardware Overview ^ 52.2 Software environment 143 Kernel level implementation 183.1 The interrupt model ^ 193.2 Low level mutual exclusion 203.3 List and Queue Management ^ 233.4 Low level console input/output 293.5 Memory management ^ 313.6 Physical memory management ^ 333.7 Memory mapping and cache management ^ 353.8 Virtual memory management ^ 433.9 Task management ^ 49iii43.10 Kernel management for global semaphores ^3.11 Interrupt management ^3.12 User/Kernel Shared memory regions ^User level kernel implementation737584874.1 Upcall handling ^ 884.2 User level spin-locks 924.3 Thread management ^ 934.4 Semaphore management 1014.5 Interprocess communication ^ 1074.6 The global nameserver 1194.7 User/User Shared memory regions ^ 1215 Performance Evaluation 1235.1 Benchmark tools ^ 1235.2 Function calls vs. System calls ^ 1245.3 Thread management performance 1245.4 Interrupt handling performance ^ 1275.5 Task signalling performance 1295.6 Interprocess communication performance^ 1305.7 Memory management performance 1355.8 Ethernet driver performance^ 1366 Related Work 1376.1 Low-level mutual exclusion ^ 1376.2 Threads ^ 1406.3 Interprocess communication ^ 1437 Conclusion 145iv7.1 Summary ^  1457.2 Future Work  146Appendices^ 148A Kernel system call interface^ 148A.1 System calls provided to the user level kernel ^  148A.2 System calls provided for application programs  149B User kernel library call interface^ 151B.1 Thread management ^  151B.2 Synchronization primitives  151B.3 Asynchronous Send/Receive port IPC ^  151B.4 Synchronous Send/Receive/Reply port IPC  152B.5 Nameserver ^  152B.6 User level memory management ^  152B.7 Interrupt and exception management  152C Unix version^ 154Bibliography^ 156vList of Tables2.1 GNU C computer and kernel register usage^  145.2 4 parameter user level function call vs. kernel system call^ 1255.3 Thread management performance. ^  1255.4 Interrupt service routine invocation latencies. ^  1285.5 Task signalling invocation latencies. ^  1305.6 Performance of asynchronous and synchronous local ports, 4 byte data message. 1325.7 IPC performance for asynchronous and synchronous global ports, 4 byte datamessage^  1335.8 IPC primitive breakdown^  1345.9 Virtual memory operation execution times. ^  1355.10 Time to transfer 10MB of data over Ethernet  136viList of Figures1.1 High level system organization. ^  42.2 88200 CMMU address translation table format^  72.3 Hypermodule physical address map, showing DRAM, VMEbus, and device utilityspace. ^  83.4 Kernel source code organization. ^  183.5 kprint module system call summary.  293.6 Memory management system decomposition. ^  323.7 Virtual memory system call summary.  433.8 Typical user level memory space^  443.9 Initialized supervisor memory space.  493.10 task. c module system call summary.^  503.11 Destroying a remote task. ^  653.12 Task state transitions.  673.13 Task ready queue structure^  683.14 Kernel level global semaphore support routines^  733.15 Interrupt handler registration system calls.  753.16 Register r28 usage^  763.17 Kernel interrupt dispatching process^  813.18 User/Kernel shared memory regions  854.19 User level kernel source code organization^  874.20 User level upcall dispatching routines.  884.21 Summary of upcall events and the user level upcall handlers. ^ 90vii4.22 Thread context save area and stack buffer. ^  95viiiAcknowledgementFirst, I would like to thank Gerald Neufeld for giving me the opportunity, support, and guidanceto lead me through this project. I would also like to thank the National Science and EngineeringResearch Council. The importance of scholarship funding cannot be understated. Specialthanks go to Norm Hutchinson for answering an endless supply of questions and for taking thetime and patience as a proofreader.To the digital side of my life, I would like to thank bowen. Finally, extra special thanks goesto Wendy, who helped make the time between long hours of thesis work more enjoyable.ixChapter 1IntroductionOne of the goals of a multitasking operating system is to present the illusion of parallelism tousers. To the naked eye, programs running in such an environment appear to run concurrently.At a conceptual level, the system can use the notion of threads and processes to present a parallelmodel for constructing programs. On uniprocessor hardware, the parallelism seen in these twocases is indeed illusion. It is the relative speed of microprocessors and operating system designtechniques such as time-slicing and cooperative scheduling that make this illusion possible.The recent proliferation of low-cost multiprocessor hardware has made it possible to changethis illusion into reality. The parallelism that a user or application developer sees is no longeran illusion — it is real. With multiple processors, separate threads of execution can executeconcurrently. While the amount of true parallelism is bounded by the number of processors inthe system, the illusion of a general purpose parallel environment can be enhanced by usinguniprocessor techniques such as time-slicing and cooperative scheduling.However, while scheduling techniques can be borrowed from uniprocessors and applied tomultiprocessors, their implementation cannot. Traditional uniprocessor operating systems oftenlack desireable features that are possible with parallel hardware. Also, they can be difficultto adapt to a multiprocessor environment, leading to inefficient operation [JAG86]. Thereare two main factors which contribute to inefficient operation: the increased requirement forconcurrency control within the kernel; and new scheduling possibilities.This thesis presents a new operating system kernel which addresses the above factors fora shared memory multiprocessing environment. This design is geared specifically towards uni-formly shared memory architectures, and not non-uniform architectures (NUMA) or distributed1Chapter 1. Introduction^ 2memory architectures.This thesis is organized as follows. The remainder of this introduction chapter introducesthe reader to motivation of this work, and describes the overall design of the kernel. Chapter2 describes the hardware and software run-time environments. Chapter 3 and 4 discuss theimplemention of the supervisor kernel and user level kernel, respecitively. Chapter 5 presentsperformance figures for the various kernel services. Chapter 6 provides a look at previous workthat influenced the design of the system. Finally, chapter 7 concludes the thesis with a summaryof future work.1.1 MotivationThe availability of low cost multiprocessor hardware has opened up new avenues for providinghigher performance programming environments for applications. However, the availability ofgeneral purpose operating system software to take advantage of new techniques is slow toappear.Traditional microkernel architectures, such as Mach 3.0 and Chorus, are designed withthe view that the kernel should only provide a minimum set of primitive abstractions (tasks,threads, virtual memory, device management, etc). These primitives are then used by userspace modules to provide additional services to the operating system environment, such asnetworking protocols and file servers.When used as general purpose computing environments, this architecture is sufficient. Ro-bustness is the most important requirement, with performance being second on the list. Thesystem is secure against malicious or errant user programs, and performance is adequate.Recently, however, the speed of network and other input/output devices has increased byan order of magnitude or more. The traditional kernel mediated operating system now exposesits performance problems more than ever. For many applications running under such an envi-ronment, the cost of crossing user/kernel boundaries for each primitive abstraction becomes aconcern. Typical system call overhead is likely 10 times greater than a procedure call [ALBL91].Chapter 1. Introduction^ 3By moving several of the high-use kernel services into user space, less time is spent invokingoperations. The general motivation is to reduce the overall number of user/kernel interactions.Several techniques can be employed to do this:• User level thread scheduling. Rather scheduling threads in the kernel, move the schedulingcode into the user space.• User level interrupt handling. Allow interrupt handlers to upcall directly into the userspace. Device drivers can be implemented completely in user space, eliminating the costsof moving data between the user and kernel.• User level interprocess communication. By making extensive use of shared memory be-tween client and server address spaces, data copying through the kernel is eliminated.• Low level synchronization primitives. Provide a simple mechanism to allow an event to bepassed from one address space to another. With appropriate hardware, remote processorinterrupts can be implemented completely at the user level.1.2 Raven kernel overviewThe Raven kernel is a small, lightweight microkernel operating system for shared memory mul-tiprocessors. The kernel executable compiles into less than 32KB of code. The most intriguingpart of the Raven kernel design is the dislocation of many traditional kernel services into userspace.Two main abstractions are provided by the kernel: tasks and virtual memory. All otherservices are provided by the user level: threads, semaphore synchronization, interprocess com-munication, and device management'. Extensive use of shared memory allows disjoint addressspaces to efficiently communicate scheduling information and interprocess communication data.Figure 1.1 demonstrates the supervisor kernel in relation with the user level.'The supervisor kernel initially handles interrupts, but they can be dispatched to the user level.The system is implemented for a four processor, Motorola 88100 shared memory multipro-cessor [Gro90]. A special gdb-based kernel level debugger, known as g88 [Bed90], allows thedevelopment environment to be hosted under a friendly Unix account, and provides downloadingfeatures to run programs.Several sample applications have been developed to demonstrate the kernel operation. Auser level serial port device driver implements a file server and terminal tty server to a Unixhost. An ethernet device driver demonstrates network data throughput through the kernel IPCservices.Chapter 2Run-time environment overviewThe main goal for an operating system is to provide a reliable and convenient work environment.Hidden beneath the operating system is the raw physical hardware, an environment far tooexacting for higher level users to deal with. Just as users are given a run-time environmentto work in, the operating system has its own run-time environment: main memory, exceptionsand interrupts, and the various devices that make up a general purpose computer. As well, theoperating system uses tools such as compilers and debuggers. This chapter provides an overallview of the hardware and software environment available to us. The first part of this chapterprovides an overview of the current hardware platform. The second part then describes the lowlevel software environment and development tools.In the spirit of efficiency and simplicity, the Raven kernel makes assumptions about thehardware architecture model: the kernel is designed for shared memory multiprocessor plat-forms, where each processor has the same view of memory and devices. Non-uniform memoryaccess (NUMA) machines and distributed memory machines require different considerations interms of memory management, scheduling, and input/output. Some systems, such as Mach[ABB+86], provide hardware compatibility layers to isolate porting details, and this can alle-viate porting to different classes of hardware architectures. But this generality has its costs interms of additional code size and complexity.2.1 Hardware OverviewThe hardware platform used for the implementation of the kernel is known as the MotorolaMVME188 Hypermodule [Gro90]. The Hypermodule is a general purpose shared memory5Chapter 2. Run-time environment overview^ 6multiprocessor, based on Motorola's 88100 RISC architecture [Mot88a]. The Hypermodulecontains four 88100 processors, each with a uniform view of 32MB of shared memory. Inaddition, the Hypermodule contains many devices such as timers, interrupt management, aVMEbus controller, and serial ports to make up a general purpose computer. This section willexamine many of the relevant aspects to this hardware, as it pertains to the kernel.2.1.1 The MC88100 RISC MicroprocessorEach of the four 88100 processors runs at a clock speed of 25MHz. Most instructions executein a single clock cycle. Using four processors, the Hypermodule can theoretically achieve aperformance rating of about 60 MIPS'.The instruction set is typical of many 32-bit RISC microprocessors: a simple set of singlecycle instructions used to build higher level constructs. All instructions are represented inmemory as 32-bit words, simplifying the decoding phase. A delayed branching feature can beused to alleviate pipeline stalls when executing branch instructions.There are 32 general purpose user level registers, r0 — r31, each 32-bits wide. Registerr0 is read-only and always holds a value of zero. In supervisor mode, the 88100 contains 21additional control registers: cr0 to cr20. These registers reflect the state of the processor mode,data unit pipeline, integer unit pipeline, and general purpose scratch registers. The floatingpoint execution unit contains 11 more registers: fcr0 to fcr8, and user registers fper62 andfpc63.2.1.2 The MC88200 Cache/Memory Management UnitThe 88200 CMMU augments the 88100 processor by providing instruction and data caches,as well as adding memory management. Each processor on the MVME188 uses two 88200's,giving each CPU 16KB of code and 16KB of data cache. A memory bus snooping protocolallows cache coherency between all caches in the system.1 VAX MIPSChapter 2. Run-time environment overviewThe memory management section implements a two level page table scheme with a pagegranularity of 4096 bytes. Figure 2.2 shows the layout of this page table scheme. Up to 4GBcan be mapped to a single address space. Each page in that address space can have variouscombinations of the following attributes: no translation, cache-disable, writethrough, writeinhibit (for read-only pages), and snoop enable. The detailed settings for these attribute bitscan be found in the 88200 technical documentation.Access to the 88200 registers for programming is done through the Hypermodule utilityspace (discussed below). Each of the 88200's in the system can be accessed separately, allowingcache flushes and page table management to be performed by the host processors.2.1.3 System ControllerIn addition to the processor, CMMU's and memory, the Hypermodule has a system controllerboard that contains all of the additional functionality and glue that make up a general purposecomputer. Components such as timers, serial ports, and the VMEbus controller are found onthis system controller board.Chapter 2. Run-time environment overviewThe system controller board maps all of its devices and memory into the processor's physicaladdress space. To access a device on the controller board, the processor executes load/storeoperations to the memory-mapped device registers. Other devices and resources can be addedto the physical address space through the VMEbus interface.Figure 2.3 shows the default physical address space that the Hypermodule resources residein. The remainder of this section will be devoted to examining each portion of this addressspace, and noting the features pertinent to kernel operation. For additional detail, consult thehardware manuals [Gro90] and [Mot884Figure 2.3: Hypermodule physical address map, showing DRAM, VMEbus, and device utilityspace.Chapter 2. Run-time environment overview^ 9Utility address spaceAll onboard devices and resources are located in an upper region of memory called the utilityspace. The utility space address map spans a 4MB region from Oxffc00000 to Oxffffffff.Each resource within the utility space are given their own section of address space to reside in,padded by null memory space.Access to the utility space by the operating system kernel is a neccessity. While the defaultbase address of the utility space can be modified by switching jumpers, it cannot be made togo away completely. In fact, the address translation hardware of the 88200's contain hardwiredentries to always make the utility space available in the supervisor memory map. This is doneso that the operating system can find the utility space in the event of translation table errorsor software malfunction.The following devices are made available inside the utility address space. Some of thesedevices are shared by user level memory maps. For instance, in order to implement a user levelserial port driver, the DUART registers must be mapped in user space.MC68681 DUART/TimerThe MC68681 provides the system software with two RS-232C compatible serial ports and aprogrammable interval timer. The serial port speeds can be programmed to support varyingdata rates, but also support 18 preprogrammed rates from 50bps to 38.4kbps. The kernel usesone serial port in polled mode for low level debugging support, and another interrupt drivenport to connect with attached terminals or Unix hosts.The kernel also uses this chip's interval timer. Using the 3.6864MHz clock on the DUART,an interrupt can be generated with a period anywhere from 540 nsec to 563 msec. The kernelallows this period to be configured at boot time, in terms of ticks per second. This value isthen used as the system clock tick.Chapter 2. Run-time environment overview^ 10Z8536 Counter/TimerIn addition to the interval timer provided by the MC68681, the Z8536 Counter/Timer deviceserves as an enhanced timer service. The Z8536 contains three individual 16-bit timers, eachusing a clock rate of 4MHz. The timers can be programmed seperately, or cascaded together toprovide higher resolutions. The kernel cascades two of these timers, allowing intervals from 1usec to 4295 sec (71 minutes) to be timed with 1 usec accuracy. This resolution is good enoughto benchmark relatively small sections of code with reasonable accuracy.The third timer on the Z8536 can be enabled as a watchdog timer for the Hypermoduleboard. When programmed to do so, the timer will trigger a reset sequence which reboots themachine. This feature allows the system to self-recover from fatal system crashes.MK48T02 Time-of-Day ClockAs if two timers were not enough, the Hypermodule also contains a time-of-day clock. Thisclock can be used to maintain the correct wall clock time (it is battery backed and will correctlymaintain the time when power is off). Currently, the kernel doesn't use the notion of wall clocktime anywhere, so this device is not used.MVME6000 VMEbus ControllerAccess to VMEbus peripherals is provided by the MVME6000 "VMEchip" controller [Mot88d].This device manages the interface between the Hypermodule memory bus (master) and theVMEbus, and the attached devices (slaves). It allows regions of the VMEbus address spacesto be mapped into the Hypermodule's local physical address space. For example, the Ethernetdevice driver uses the VMEchip to map in the Ethernet board's device registers at locationOx10000000.In addition to providing access to slave devices, the VMEchip contains features that facilitateoperation in a multiprocessor environment. If more than one Hypermodule board is installedon the VMEbus, the VMEchip on each board can be used for coordination and synchronizationChapter 2. Run-time environment overview^ 11services.128KB SRAMThe Hypermodule board contains 128KB of non-volatile (battery backed) SRAM, which canbe used to retain data and program instructions while the power is turned off.Currently, the kernel uses this section of memory to store the g88 debug monitor. Thedebug monitor code rarely changes, so once it is downloaded into the SRAM, it is there forgood. The next time the Hypermodule is started, the debugger checksums the monitor area tosee if it is complete. If the monitor checksum passes, there is no need to download a fresh copy(which can take a while at serial-port speed).512KB EPROM and 188BUG DebuggerAn EPROM chip module contains a standalone, onboard monitor/debugger known as 188Bug[Mot88c}. This is a low level interactive debugger which can be accessed via the serial port usinga dumb terminal. The debugger is non-symbolic but full featured, with the ability to performraw operations on disk, tape and serial port devices. This allows code such as operating systemsto be easily bootstrapped from disk, tape, or serial port.In addition to the software debugging features, a full suite of diagnostics are provided totest and exercise the hardware.Since 188BUG is non-symbolic, it can be difficult to debug programs written in high levellanguages like C. Instead, the kernel uses a gdb-based debugger called g88 to run and test thesystem. g88 allows kernel code to be downloaded and interactively debugged using the familiargdb environment. Breakpoints can be set, data examined, just like debugging a regular userlevel process under Unix.Chapter 2. Run-time environment overview^ 12DRAM address spaceThe Hypermodule can be configured to hold up to 64MB of 6Ons DRAM, in 16MB increments.All of this physical memory is symetrically visible and available to all processors in the system.Peak memory bandwidth is 44.4MB/sec for reads, and 66.7MB/sec for writes.The onboard debugger, 188Bug, reserves the lowest 64KB region for its own use (execeptiontable, local variables, code, and stack). This region must be preserved for the 188Big to operateproperly. Sometimes 188Bug is useful, so the kernel doesn't intrude on its territory. Useablememory begins at Ox10000 and grows upwards.VMEbus address spaceThe Hypermodule VMEbus chipset supports 32-, 24-, and 16-bit VMEbus addressing modes.The 16-bit VMEbus SHORTIO space is hardwired to live in the upper 64KB region of memory,so that master and slave devices always know how to locate this region. All other addresseswithin the physical address space, between the end of DRAM and the beginning of utility space,can be used for mapping in portions of the 32-bit and 24-bit VMEbus address spaces.As a convention, the kernel uses 32-bit VMEbus address mapping for external devices.Currently only the Ethernet board is mapped in this fashion, so this convention is certain tochange for other kinds of devices.2.1.4 Interrupt managementIn many earlier multiprocessor systems, one special processor was designated as the I/O proces-sor. Doing so would help simplify an already complex memory and interrupt bus. This specialprocessor is specifically wired to accept all interrupts from external devices. While this doesleave other processors to do useful work for non-I/O bound operations, a processor that re-quires I/O must always go through the special I/O processor. This can introduce a throughputbottleneck in the system, and can complicate I/O and device driver code.The Hypermodule interrupt management scheme is fully symmetric. Any processor in theChapter 2. Run-time environment overview^ 13system can respond to any particular interrupt by setting appropriate bits in the processor'sinterrupt enable register (TEN). In a system with four processors, there are 4 interrupt enableregisters named IENO to IEN3.The hardware supports up to 32 different interrupt sources. Each bit in the IEN registerscorresponds to one of the possible interrupt sources. Some of the common interrupt sources are:DUART timer, DUART serial port, VMEbus IRQO — IRQ7, and software interrupts (SWI).Individual interrupt sources can be set to occur on any combination of the processors inthe system. It is possible to configure the IEN registers such that multiple processors receivethe same interrupt. In this case, all interrupted processors will halt execution and branch tothe interrupt handler when the particular interrupt occurs. This could generate a great dealof contention if spin-locks are used within the interrupt handler. Therefore, certain interruptsshould be enabled on only a single processor at a time. The kernel interrupt architecturemanages the setting of the interrupt enable bits to minimize latency to interrupt handlers.For example, the Ethernet device driver is a user level task that requires to be executedevery time an interrupt is generated on the Ethernet board. Switching in the task can involvecache flushes and address space changes and is therefore an expensive operation. This taskswitch is avoided if by setting Ethernet interrupt enable bit properly on the processor whichis currently executing the Ethernet task. Doing so localizes the interrupt to a single processorthat has the correct address space activated.If there is more than one processor allocated to an interrupt driver task, only one processorat a time has the device interrupt enable bit set. Which processor it is, is determined by thekernel. If the interruptable processor is relinquished, another processor allocated to the devicetask is assigned the interrupt bit.When a processor receives an interrupt exception, one of the first things it needs to do is findout exactly which device caused the exception. The interrupt status register (IST) allows this tobe easily accomplished. The IST is bit-for-bit similar to the IEN registers, except that it reflectsthe status of all interrupts in the system. If a bit is set in the IST, then the correspondingChapter 2. Run-time environment overview^ 14Register Compiler and system usager0 Read-only 0 constantrl Function call return addressr2 — r9 Function call parametersr10 — r13 Function temporary registersr14 — r25 Function preserved registersr26 Temporary scratch registerr27 Temporary scratch registerr28 Locking and upcall status bitsr29 User thread context save pointerr30 Stack frame pointerr31 Stack pointerTable 2.1: GNU C computer and kernel register usage.device is requesting an interrupt.Note that there is only one IST, and not one for each processor. The IST gives the globalstatus of all interrupting devices, and the IEN registers give the enable mask for each processor.So to check whether a particular processor received a particular interrupt, the processor mustAND the IST its IEN register: status = ien_reg [cpu] & istreg.2.2 Software environmentIn the hardware overview section, various features of the hardware runtime environment werediscussed. In this section, the software development environment is examined. This environ-ment includes the g88 debugger and simulator, compiler tools, and software conventions suchas register usage. Other factors which influence the runtime environment are the processorsupervisor/user state settings, and supervisor register conventions.2.2.1 The gcc compiler and toolsThe C compiler used is the ANSI-compliant GNU gcc version 1.37.29. All other tools used forcompiling and linking executable code, such as the linker and assembler, are also GNU tools.Chapter 2. Run-time environment overview^ 15The C compiler uses a standard format for processor register allocation. Table 2.1 showsthis register convention. Functions can always rely on registers r10 — r13 to be available fortemporary scratch purposes. Registers r14 — r25 can also be used for general purpose storage,but they must be preserved by the called function if they are to be used. The C stack framecontains the frame pointer, saved registers of the previous function, function return address, andlocal variable storage for the function. This information is useful when interfacing C programswith handcoded assembler.2.2.2 The g88 kernel debuggerThe g88 cross-debugger/simulator [Bed90] is a GNU gdb [Sta89] based kernel debugger. Im-plemented as an extension to gdb 3.2, g88 allows Hypermodule users to download code andinteractively debug their programs from within a standard gdb terminal session. g88 can setbreakpoints, step through code, examine variables, etc. — just about everything one expectsfrom a Unix version of gdb.g88 runs on a Unix workstation and connects to the Hypermodule hardware via two serialports. One serial port is used for the gdb command protocol, console input/output and down-loading code. The other is used as a software controlled interrupt and reset line. This line isconnected to the Hypermodule reset and interrupt logic, and is used to reboot and interruptthe target. For example, when control-C is pressed within g88, the interrupt logic is toggled,generating an ABRT interrupt on the Hypermodule. The onboard g88 monitor program recog-nizes this interrupt, suspends kernel execution, and returns control back to the g88. This allowsusers to interactively debug their code running on the Hypermodule, with the same control asif it were a regular Unix process.When starting a session, g88 downloads a monitor program to the Hypermodule memory.This monitor, known as g88mon, handles the g88 serial port communication protocol betweenthe Unix host and Hypermodule. It manages all the gdb features such as breakpoints, singleChapter 2. Run-time environment overview^ 16stepping and memory examination. The monitor is designed and implemented to be as unob-trusive as possible. It is downloaded into a portion of the Hypermodule non-volatile SRAM,and resides there until it is erased. The program being debugged does not have any knowledgeof g88mon existance in the system.2.2.3 The g88 Hypermodule simulatorg88 also contains a complete instruction level simulator of the MVME188 Hypermodule. Thesimulator provides a virtual environment that is a clone of the Hypermodule: four 88100 pro-cessors, eight 88200 CMMUs, and all the system controller devices. An environment variablecontrols the size of the simulated physical memory size. Programs running in the simulator arevirtually oblivous to the fact that they are running in a simulated environment. g88 providesthe customary gdb interface to the simulator, so programs can be downloaded, executed, anddebugged.When the simulator is used to execute something, no physical serial links are necessarybecause g88 simulates the hardware directly in its address space on the Unix host. The overheadassociated with sending commands across the serial link is eliminated. Thus the downloadingof code and overall communication with the "physical" hardware is much faster. This propertymakes the simulator ideal for debugging and development, where a fast edit/compile/test cycleis desirable. It takes about 60 seconds to download the Raven kernel to the real hardware,compared to a fraction of a second for the simulated hardware. Similarly, overall debuggingcommands on the simulator are much faster.Raw execution speed of the simulator is of course much slower than the real hardware (about100 times slower on a SPARC 1+). However this does not impact the useability of the Ravenkernel on the simulator for debugging purposes. The kernel itself and most user level programsat this point are not compute bound, so most executions are fast enough.The current version of g88 has one major limitation: it can only be used to debug supervisorcode. This means that g88 cannot debug user level programs. This limitation stems from theChapter 2. Run-time environment overview^ 17fact that gdb 3.2 is only able to manage one execution context (ie, one program) at a time.However, g88 is still under development by its author, in cooperation with Horizon Research,for Mach 3.0 work. An upgrade to g88 which will allow user level debugging and performanceimprovements will soon be available.Chapter 3Kernel level implementationThe Raven kernel is split into two distinct entities: the supervisor kernel, and the user levelkernel. Each of these entities is implemented as separate executable programs. This chapterdiscusses the design and implementation of the supervisor kernel part.Figure 3.4 shows the modular breakdown of the system, and the source code files involved.The first section in this chapter discusses the overall programming model of the kernel, and thefollowing sections describe the code modules in detail.Several of the kernel modules export a system call interface to the user level. These systemChapter 3. Kernel level implementation^ 19calls are broken into two categories: calls available for general purpose user applications, andcalls available for user level kernels only. For modules which export such a system call interface,each description in this chapter will contain a table summarizing the interface provided by thatmodule.System calls, and internal kernel functions, often return a success or failure condition tothe caller. Throughout the implementation of the kernel, the following convention is usedfor function return values. Functions that complete successfully always return the value OK.Functions that fail in some manner during their invocation return the value FAILED.3.1 The interrupt modelThe implementation of the kernel is based on the interrupt model, as opposed to the processmodel. In the process model, a kernel is composed of several cooperating processes, each ofwhich has their own stack and local state variables. Interrupts and preemption are normallyallowed during execution. The processes must be scheduled by the kernel, and special cases forpreemption locking and concurrency locking must be explicity coded.In the interrupt model, the kernel can be viewed as one big interrupt handler. All kernelinvocations, including system calls, device interrupts, and exceptions, enter into the kernelthough the exception vector table, excp_vects.s. From this point, the low level exceptionhandler routines allocate a stack for the processor and dispatches the event.Any number of processors can be executing within the kernel at the same time. While aprocessor is executing within the kernel, interrupts and preemption is disabled for that proces-sor. This simplifies the implementation quite substantially, because there is no need for specialpurpose preemption locking and protection. All calls to the kernel are non-blocking: except forthe processor relinquishment call, all kernel calls return immediately to the user after executing.When control returns to user space, interrupts and preemption is re-enabled.Each processor in the system uses its own dedicated kernel stack. In a four processorsystem, there are four dedicated kernel stacks. The kernel stacks are located at well-known fixedChapter 3. Kernel level implementation^ 20locations in the supervisor memory space, and are allocated at very beginning of initializationtime. Each time the kernel is invoked, the processor stack pointer is set to the top of its kernelstack. The system can assume that there is only one execution context per processor whilerunning inside the kernel. This substantially simplifies implementation issues.3.2 Low level mutual exclusionIn a shared memory parallel environment like the Hypermodule, many algorithms require thatsections of their code have atomicity. These critical sections of code must be executed atomicallyin mutual exclusion with their neighbours, or risk leaving their state inconsistent. In the kernel,these algorithms range from simple enqueue/dequeue operations on a shared queue, to ensuringsequential access to device registers.Several techniques exist to ensure mutual exclusion. One technique is to avoid criticalsections altogether by implementing data structures as lock-free objects [Ber91] and optimisticsynchronization [MP89]. In this case, a compare-and-swap operation allows data structures tobe concurrently accessed with consistency. However, the lack of proper hardware support forcompare-and-swap on the Hypermodule hardware does not make this algorithm practical.Other techniques range from the simple spin-lock to the higher level semaphore. The lattertechnique relies on the operating system to "schedule around" critical hot spots by relinquishingthe processor to another thread when such a spot is reached. This technique requires somecooperation with the operating system, commonly in the form of semaphore data structures,which in themselves require mutual exclusion. Hence the requirement for a more primitivemutual exclusion technique.3.2.1 Spin locksThe spin lock is a brute force method of providing mutual exclusion around sections of code.The algorithm tests a shared lock variable to see if the lock is free or used. If the lock is free, thelock is claimed by setting its state to "locked", and execution continues. If the lock is not free,Chapter 3. Kernel level implementation^ 21continually check until it is free. A lock is freed by storing a "free" value to the lock variable.While waiting for a lock to become free, the processor cannot do anything else. In a systemwith many processors, this property can become a bottleneck when frequently accessing a sharedresource.3.2.2 lock_wait() implementationOn many processors, a test-and-set instruction is used to ensure atomicity of the lock variabletest and set stage. The test stage is broken into two parts: one instruction loads the value of thelock variable, another instruction tests its value. If the value of the lock variable is altered afterthe load instruction but before the test, the algorithm will fail. To prevent this problem, the88100 instruction set includes the xmem instruction. The xmem instruction atomically exchangesthe contents of a register with the contents of a memory location. The load and store accessesof the xmem instruction are indivisible: the instruction cannot be interrupted part-way throughits execution.Using the atomic exchange instruction, the lock acquire routine can be safely implementedin the following manner:_lock_wait:^ ; r2 <- lock variable addror^r10, 1.0, 1^; set a lock valuelw: xmem^r10, r2, r0 ; atomic exchangebcnd^ne0, r10, lw^; try lock again if not^r1^ ; return to callerlock_wait() begins by putting a "locked" value into register r10. This register is thenexchanged with the lock variable stored in memory at the address r2. As a result of theexhange, r10 is loaded with the previous lock value. If this value is non-zero, then the bcndinstruction branches to the top of the routine, where the test starts again. This sequence isrepeated until a "free" value is found in the lock value. The memory bus transactions generatedby the repeated accesses to the lock variable can quickly saturate the memory bus, hinderingother processors in the system from doing useful work.Chapter 3. Kernel level implementation^ 22An optimization can be achieved by relying on the data cache to maintain a coherent copyof the lock variable. In this case, initially the processor spins on a cached copy of the lock,generating negligible memory bus accesses. If the lock is freed, the processor performs an xmemto grab the lock. If this xmem fails, then return to spinning on the cached copy. This algorithmis implemented as follows, and can be found in lock. s:void lock_wait(int *lock_addr);_lock_wait: ; r2 <- lock variable addrld r10,^r2, 0 ; read the lock valuebcnd ne0,^r10, _lock_wait ; loop if lock is busy.or r10,^rO, 1 ; set a lock value.xmem r10,^r2, r0 ; atomic exchangebcndjmpne0,^r10,ri_lock_wait ;;try lock again if nonzeroreturn to caller3.2.3 lock_free() implementationFreeing a lock is trivial. All that needs to be done is to store a "free" value to the lock variable.This can be accomplished in a single st instruction on the 88100.void lock_free(int *lock_addr);_lock_free:^ ; r2 <- lock variable addrjmp.n r1 ; return to callerst^rO, r2, 0^; store "free" valueIn this routine, the jmp .n rl instruction demonstrates the use of delayed branching on the88100. The instruction cycle immediately after a control transfer instruction, such as jump,is known as the delay slot. This is where the processor figures out where execution shouldcontinue. While it does this, another instruction can be executed. In this case, the st rO, r2,0 instruction is executed in the jmp rl delay slot.3.2.4 Lock initializationBefore using a spin lock, a lock variable must be allocated and initialized. The lock variable isof type int and can be allocated in any appropriate fashion, such as statically at compile time.Chapter 3. Kernel level implementation^ 23Initializing a lock variable is as simple as assigning zero to it:int my_lock = 0;lock_wait(&my_lock);lock_free(&my_lock);3.2.5 SummaryThe simple spin lock can become surprisingly complex, as shown by [And89] and [KLMO91].These more complex techniques arise from differences in hardware characteristics, such as thenumber of processors, memory bus architecture, and instruction sets.In a shared memory system such as the Hypermodule with only four processors, spinningon a cached copy of the lock is sufficient to attain good performance. This algorithm is im-plemented by the routines lock_wait 0 and lock_free(). Modules throughout the kernel usethese routines to provided mutual exclusion for shared data structures and hardware devices.3.3 List and Queue ManagementList and queue data structures are primitive and fundamental building blocks for operatingsystems. Much of the information stored within the kernel, such as task control blocks andmemory regions, are kept track of using linked lists. The operations involved are insertions,removals, and traversals. This section describes these operations implemented in the kernel andthe data structures involved.All of the queue management routines are implemented as C macros, so they are easilyinlined to avoid procedure call overhead. The routines could be coded in assembler, but inliningability and portability would be sacrificed.The queue macros are designed to be type-flexible. This allows a small set of macro routinesto be general enough to handle any queue node structure. To do this, the macros require thatthe caller specify the node type. When the macros are expanded at compile time, the properChapter 3. Kernel level implementation^ 24type casting is performed using the supplied type. This technique is similar to the queue macrosseen in the Mach kernel.There are some algorithms where an ordered list of items is a basic requirement. Forexample, the system clock timer maintains an ordered list of task control blocks. The list issorted by the wake_time key. This allows the clock timer to examine only the head node onthe list to check for a timer expiry.The queue macro package does not provide any macros for ordered list or priority queuemanagement. It is felt that such a structure can be efficiently constructed using the supplied,more primitive macros.In a multiprocessor environment, concurrency control is required to protect against multipleaccesses to common data structures. Many of the system queues are shared amongst all theprocessors. The queue management library does not implement locking, leaving it to be doneexplicitly by the user. This option can reduce overhead when locking is not required.The queue and linked list structures are simple enough to allow the use of the spin locklibrary seen in the previous section. The use of spin locks to manage queue structure access isdemonstrated in the code fragments below.3.3.1 Single linked listsMost of the data that is kept on the linked lists are arrays of statically allocated control blocks.Some control block data structures are very simple and don't even need the flexibility providedwith a double-linked list. For example, the rawpage_table 0, which keeps track of physicalpages in the system, uses a single-linked list to remember all the free pages. To get a new page,rawpage_alloc() dequeues the head pointer from the free list. To free a page back to the pool, enqueues the page onto the free list head. The rawpage routines don't needto be able to remove elements from the middle or end of the list, so a backward link is notrequired.Simple linked lists as in the above example are used throughout the system whenever aChapter 3. Kernel level implementation^ 25simple allocate/free operation of some data object is required. Since the lists are so simple,the queue management library doesn't provide any help — it's up to the kernel programmerto provide the head and next pointers for the list structure. This allows the list types to betailored for use within a code module.3.3.2 QueuesThe most common type of list used in the kernel is the double-linked list. These lists areused extensively as FIFO queues for task, thread, and semaphore management, to name a few.Since the queue operations are so basic to the operation of the kernel, a set of macros and datastructures are provided to simplify the job and make the code look a bit cleaner.All queues have a master handle of type QUEUE which maintain the head and tail pointerfor the linked list. The other data structure that is used is the QUEUE_LINK structure. TheQUEUE_LINK structure contains the next and previous pointers, and is used as the "link" foreach node in the list. Each node in a queue must have at least one of these structures. Usingmore than one link per node allows nodes to be linked to several different queues at once.The task control block table, task_table 0 , provides a good example of how these structuresare used. Each TASK structure contains several fields of data, two of which are the QUEUE_LINKstructures:/* next/prey link for queue structures */QUEUE_LINK sched_link; /* scheduler queue */QUEUE_LINK timer_link; /* timed event queue */These links are used to connect the task control block to the scheduler queue and timerqueue s .Enqueue operationsThere are three possible enqueue operations: enqueue at head of list, enqueue at the tail of list,and enqueue before a given node. For queues that need to pay attention to FIFO ordering, the1 The timer queue is used by the system clock timer to notify tasks of timed events.Chapter 3. Kernel level implementation^ 26enqueue/dequeue operation always occur on opposite ends of the queue. The kernel follows thestandard convention that nodes are always enqueued at the tail and dequeued from the head,as seen in the following example from task_create 0:/* queue the task on the suspended queue */lock_wait(&task_susp_q_lock);ENQUEUE_TAIL(task_susp_q, task, TASK, sched_link);lock_free(&task_susp_q_lock);The following macros can be used for enqueue operations:• ENQUEUE_HEAD(queue, node, node_type, node_link)• ENQUEUE_TAIL(queue, node, node_type, node_link)• ENQUEUE_ITEM(node, next_node, node_type, node_link)This macro inserts a node directly before the supplied next_node. This macro can not beused when next_node is either the head or tail of the list — in those cases, the other twoenqueue routines should be used.Dequeue operationsThe following two dequeue operations are provided.• DEQUEUE_HEAD(queue, node, node_type, link)• DEQUEUE_ITEM(queue, node, node_type, link)Note that these macros do not check for empty queues. It's an error to try and dequeue anode from an empty list. To protect against this error, the QUEUE EMPTY() macro can be usedto check for the empty/non-empty condition. For example, task_create 0 uses the followingcode to get a new task control block:lock_wait(&task_free_q_lock);Chapter 3. Kernel level implementation^ 27if ( QUEUE_EMPTY(task_free_q) ){lock_free(&task_free_q_lock);kprint("task_create: no free tasks!\n");*task_id = -1;return(FAILED);}/* get a free task descriptor */DEQUEUE_HEAD(task_free_q, task, TASK, sched_link);lock_free(&task_free_q_lock);Miscellaneous queue operationsAs the previous example illustrates, the following queue operations can sometimes be useful:• QUEUE_EMPTY(queue)This macro evaluates to non-zero if the specified queue is empty.• QUEUE_HEAD(queue)This macro returns the first node of the queue.• QUEUE_NEXT(queue, node, link)This macro returns the next node on the link after the specified node.Allocating and initializing a queueBefore a queue list can be created, a master queue handle must be allocated and initialized.The kernel uses the policy of static allocation wherever possible, so the handles are normallyallocated at compile time by declaring a variable of the QUEUE type. The following code fromtask. c demonstrates this:QUEUE task_free_q; /* queue of free tasks */QUEUE task_susp_q; /* queue of suspended tasks */int task_free_q_lock; /* spin locks for the above queues */int task_susp_q_lock;Chapter 3. Kernel level implementation^ 28Once the master queue handle is allocated, it must be initialized to contain the value ofan empty list. The QUEUE_INIT 0 macro performs this task, as demonstrated in the followinginitialization code from task_init ():/* create task system queues. */QUEUE_INIT(task_free_q); task_free_q_lock = 0;QUEUE_INIT(task_susp_q) ; task_susp_q_lock = 0;3.3.3 Other linked list schemesThe Xinu operating system [Com84] uses an interesting technique to implement double-linkedlists for its process queues. In this scheme, the node links are stored in an array of links,completely separate from the process descriptors. There is a one-to-one mapping between linksin the array and process descriptors. Both the links array and process table are indexed by aninteger. So, if you know which link you are, you automatically know which process descriptoryou belong to. Next/previous pointers in the links allow for double-links, and an extra key fieldallows ordered lists to be constructed.This way of structuring a queue can make the enqueue/dequeue operations very efficient.However, it does have the limitation that a node can only reside on one queue at a time. TheRaven kernel task scheduler and system clock timer require that a task be queued on two queuesat once, so this scheme cannot be used.3.3.4 SummaryQueues provide a means of ordering and organizing data. They are a basic component inoperating system kernels. Both single linked and doubly linked lists are used throughout thekernel to organize resources such as task descriptors and memory pools. The routines usedin the kernel to manage these queues are implemented as C macros which are portable andefficient.Chapter 3. Kernel level implementation^ 293.4 Low level console input/outputDuring the development stages of an operating system, one of the greatest aids is a low levelconsole output routine. Sometimes, the only way to debug at the kernel level is to outputdebugging strings to the console. The output routine should be simple enough that it can beexecuted from anywhere, independently of the rest of the system. This allows information to beprinted out whether or not the kernel is functioning properly, or from within interrupt handlersand other delicate routines.Using polled output in the presence of interrupt drivers can have negative consequences.So rather than always sending console output through the serial port, a dedicated portion ofmemory is set aside for console messages. This section of memory is known as the kmsg buffer.Synpopsis Interface availability Descriptionkprint () user application Prints out a string to the console (polled output).kgetstr () user application Waits for a string from the console (polled input).Figure 3.5: kprint module system call summary.The kprint . c module implements the console polled input/output driver. The table inFigure 3.5 summarizes the system call interface exported to the user level by this module.3.4.1 Console outputvoid kprint (char *str);This routine is the lowest level output routine. It takes a null-terminated string of charactersas its argument, and uses polled output to send the string to the console.void kprintf(char *fmt, ...);This is the kprint() routine for formatted printing. It passes the format string and ar-guments to sprintf 0 for formatting, and then calls kprint() to output the resulted string.kprintf 0 understands the following format sequences: %c, %s, %d, and %x.Chapter 3. Kernel level implementation^ 30Special care should be taken when calling kprintf 0: the output string should not begreater than 200 bytes, or the stack will be damaged. kprintf () allocates a temporary 200byte output buffer on the caller's stack.On the real hardware, the g88mon monitor controls communication across the serial portto provide access to the g88 console under Unix. In the simulated environment, the simulatorcontains a character console device at Oxffff0000. Access to both interfaces uses strictly polledI/O.void kprint_mode(int mode);This routine controls the supression of console output strings to the serial port device. Insome cases, the blocking nature of polled I/O has negative consequences. Outputting a stringacross the serial port can take a long time in relation to other devices. For the Ethernet device,the time taken to output a debug message may result in lost packets.kprint_mode() can help prevent this problem by allowing console output to be shut off underprogram control. This is useful when debugging code where interrupt activity is necessary, suchdebugging an Ethernet protocol stack. Passing a non-zero value for the mode argument allowsoutput to the serial port. Passing a zero value for the mode argument supresses serial portoutput. In either case, the kmsg buffer logs all output strings, which can be viewed at a latertime.3.4.2 Console output bufferThe console output buffer, or kmsg buffer, is a large buffer in the kernel memory space whichlogs all strings that have been outputted to the console using kprint(). Even strings that aresent with the mode disabled are placed in the kmsg buffer. This allows all console output to beviewed at a later time using the debugger.The kmsg_base variable points to the beginning of the kmsg buffer. The g88 debugger canbe used to print out console strings starting at the kmsg_base address. For example:Chapter 3. Kernel level implementation^ 31[0] (gdb) x/4s kmsg_baseReading in symbols for kprint.c...d one.^Oxle80000:^(char *) Oxle80000 "CPU 0 started\n"Ox1e8000f:^(char *) Ox1e8000f "Executing on real hardware.\n\n"Ox1e8002d:^(char *) Ox1e8002d "total system memory: 8192 pages\n"0x1e80065:^(char *) 0x1e80065 "avail user memory:^7751 pages\n"[0] (gdb)The size of the kmsg buffer is controlled at compile time by the KMSG_BUF_SEGS constant.This buffer is allocated in segment sizes of 512KB, to facilitate memory mapping.3.4.3 Console inputint kgetstr(char *buf, int buf_size);In addition to providing a means to output information, the kernel also has a primitiveway to input information. This can be useful to ask confirmation questions at boot time, forinstance. The caller supplies the preallocated buffer to place the inputted string in buf and themaximum length of the string in buf_size. The size of the inputted string is returned.3.4.4 Console I/O initializationBefore doing any low-level console I/O, the structures must be initialized. This is done at boottime by the kernel initialization kprint_init();This routine sets the console output mode to 1, enabling output to the serial port device,and clears the kmsg buffer.3.5 Memory managementDuring the normal operation of an operating system, memory allocation and deallocation arecommon tasks. Beneath the operating system lies a contiguous area of physical memory, piecesChapter 3. Kernel level implementationof which are parcelled off for various uses, and later returned. User level tasks do not workdirectly with these raw regions, however. The memory management system provides a protectedlinear address space for user level programs to work in.The memory management system divides its work into three distinct modules: virtual mem-ory management, memory mapping and cache management, and physical memory management.Figure 3.6 shows the layered relationship of each module. This section describes each modulein detail.The virtual memory module provides the user level system call interface to the memory man-agement system. Routines are provided to allocate, deallocate, and share regions of memory.A region is a contiguous, page aligned portion of memory in a virtual address space.The virtual memory module relies on the memory mapping module to manage the 88200CMMU page tables. Regions of memory can allocated at specific addresses and with variouspage protection statuses. The processor instruction and data caches must also be maintainedthroughout memory allocation/deallocation operations. The mapping module also providesuser level address space switching functions.The physical memory module provides a simple and efficient interface to the virtual memorymodule for allocating and deallocating physical pages of memory. Pages are allocated anddeallocated on a page granularity.Chapter 3. Kernel level implementation^ 333.6 Physical memory managementThe physical memory management module is very simple. Its job is to allocate physical pagesfrom the raw memory pool, and return physical pages when the virtual memory layer is finishedwith them. While the virtual memory layer manages contiguous regions of an address space, thephysical memory layer works on a simple page-by-page basis. The allocation and deallocationroutines always work with single page units.The rawpage_table 0 keeps track of each physical memory page in the system. The tableis statically allocated at compile time to contain one entry for each page in the system. Whenmore physical memory is added to the system, more elements need to be allocated in thistable. The static allocation strategy is used mainly for simplicity. However, in the future, thisstatic table could easily be replaced with a table that is allocated at runtime, after probing thehardware for the true physical memory size.The following data structures are used to manage the physical memory pages:typedef struct raw_page_s{int count;int lock;struct raw_page_s *next;} RAW_PAGE;/* page reference count *//* spin-lock to protect access *//* next free page */RAW_PAGE r awpage_t able [RAWPAGE_TABLE_SIZE] ;RAW_PAGE *pages_free_head;^/* linked list of free pages */int pages_free_lock;^/* spin-lock to protect list access */The RAW_PAGE structure provides a handle for each physical memory page in the system.The count field implements a reference count for the page: when a page is allocated or shared,its reference count is incremented. When a page is freed, its reference count is decremented.This allows the system to keep track of which pages are used and which pages are unused.There are three routines provided by the raw page module. All of these routines are availablewithin the kernel only, they are not exported to the user level.Chapter 3. Kernel level implementation^ 343.6.1 Allocating a physical pageTo allocate a free physical page, the following routine is used.void *rawpage_alloc();No parameters are required. The base address of the physical memory page is returned. Ifthere are no free pages, the system panics (something more elegant could be done in future).Finding a free page is very simple. First, the free list lock is acquired. Then the first entryon the list is dequeued, and the lock is freed. The reference count is initialized. The address ofthe physical page is computed from the position of the page entry relative to the head of theraw page table. This address is returned to the caller.3.6.2 Freeing a physical pageFreeing a physical page is equally as simple. Using the page address, the raw page table entryis computed. The entry lock is acquired and the reference count is decremented. If the newreference count is 0, the page entry is queued to the beginning of the free page rawpage_free(void *page);The rawpage_free 0 call decrements the reference count for the given page, and returns itto the raw page pool if the count reaches Sharing a physical pageA reference count allows pages to be shared by different processes, keeping them from beingfreed and reused while they are being used elsewhere. If two processes are sharing a page, andone process frees the page from its address space, the page is not returned to the free memorypool. The page is only freed when the last process discards the rawpage_reference(void *page);The rawpage_reference() call increments the reference count for the specified physicalpage.Chapter 3. Kernel level implementation^ 353.6.4 Raw page initializationAt initialization time, the rawpage_table is initialized and the free pages linked list is created.Even at boot time, many pages are already allocated for the kernel data and code areas. Thesepages are removed from the free list and their reference count is marked for use. Access to thelist of free pages is protected using the pageslree_lock spin-lock. The allocate/free routinesmust acquire the lock before pages can be removed or added from the free rawpage_init () ;The rawpage_init 0 function is called by the boot processor to initialize the physical mem-ory pool. It returns the total number of free physical pages in the pool.3.7 Memory mapping and cache managementThe map module is responsible for controlling the 88200 memory management unit [Mot88b].All aspects to do with address translation and cache management are encapsulated in thismodule. Changes in the memory management hardware would only require changes to thismodule.Routines are provided to create and manage the hardware translation tables and cachingparameters. These routines are used by the virtual memory layer to provide the user andsupervisor space with virtual addressing. These address spaces include physical memory, aswell as access to hardware device control registers. None of these routines are user-accessible,they are only used locally within the kernel.3.7.1 Translation tablesAs described in the hardware overview section, the 88200 uses a two-level translation table toallow mapping of a 4GB address space. The first level segment table contains 1024 entries thatpoint to page tables. Each page table contains 1024 entries that provide the translation for asingle page of 4096 bytes. Refer to Figure 2.2 for a description of the translation table format.Chapter 3. Kernel level implementation^ 36In addition to address mapping information, the translation table entries contain attributebits that control cache and protection status. Different settings of these bits are used for differenttypes of memory pages. For example, since code pages are not modified during execution,code pages are marked read-only/cache-enabled. Device control registers are marked read-write/cache-inhibited. These attribute bits, as well as the 88200 control register offsets, can befound in the registers.h file. Refer to the 88200 user manual [Mot88b] for detailed informationon the attribute bits.Each 88200 maintains two registers called area pointers: the user area pointer (UAPR), andthe supervisor area pointer (SAPR). The UAPR and SAPR contain a master pointer to thetranslation table for the user and supervisor address spaces, respectively. Context switching anaddress space requires setting the area pointer from one map to another. Throughout executionof the system, the supervisor area pointer remains constant, while the user area pointer changesfor each running task. The map module keeps track of area pointers using the MAP structure:typedef struct{int map; /* 88200 area pointer */int lock; /* spin lock for this memory map */} MAP;The MAP structure contains the area pointer and a spin-lock to protect accesses to the trans-lation table. The lock must be acquired by any map routine before changes to the translationtable are allowed.Each processor in the system uses two 88200 units: one for the code caching and translation,and one for data caching and translation. Each 88200 can use their own set of translation tables,or they can share a common single table. The Raven kernel opts to have two tables: one forcode, and one for data. Thus each task descriptor in the system contains two map descriptors:MAP code_map;MAP data_map;Chapter 3. Kernel level implementation^ 373.7.2 Translation lookaside buffer (TLB)In addition to the instruction and data caches, the 88200 units also contain translation tablecaches, or translation lookaside buffers (TLB). The TLB stores frequently used translations,so that translation table searches are not required for each memory access. On the 88200, theTLB is termed the physical address translation cache (PATC), and contains 56 entries.But unlike the instruction and data caches, the TLB caches are not automatically coherentacross multiple cache units. Therefore, modifications to an activated translation table in theform of attribute changes or freed pages require TLB flushes if the address space is enabledon more than one processor. If this is not done, then remote TLB caches will not contain thecorrect translation table information.Another type of translation lookaside buffer in the 88200 is the block address translationcache (BATC). There are ten entries in this cache which can be set programmatically. Theentries remain until explicitly reprogrammed. Rather than mapping single 4KB pages, theBATC entries maps contiguous 512KB blocks. The operating system software can use theseentries to provide mappings for high-use code or data regions.3.7.3 Translation table storage areaA special region of the physical memory space is set aside at boot time for translation tablestorage. The kernel allocates segment and page translation tables for each task in the systemfrom this common storage area. This area is located at the very top of physical memory. Its sizeis defined at compile time, in a granularity of 128 pages, by the PAGE_TABLE_SEGS constant inthe file kernel . h. This allows the 88200 BATC entries to map the whole storage area into thekernel space at all times. Using the programmable BATC entries help avoid the chicken-and-eggproblem that occurs when an unmapped page is allocated to store a page table.Two macro routines are used to manage the allocation and deallocation of translation tablepages. These routines are very efficient: they simply keep track of a free list of pages in thepage table storage area. The PTE_ALLOC 0 macro allocates a page table by dequeuing the nextChapter 3. Kernel level implementation^ 38free one. The PTE_F'REE() macro returns a page to the storage area by enqueuing it back onthe free queue.3.7.4 Allocating a translation mapWhen a user task is created, two address maps are allocated to create the tasks code and dataaddress spaces. The virtual memory layer uses the following routine to allocate these maps:int map_alloc( MAP *map, int attrb );The caller passes in a pointer to a preallocated, empty MAP structure and an attributesetting. An empty translation table is allocated using PTE_ALLOC(), and using the address ofthis table along with the attributes, the map area descriptor is created.3.7.5 Freeing a translation mapWhen a user task is destroyed, the virtual memory layer frees its memory, and then calls thisroutine to free the tasks' translation tables. Since there are two translation tables for each usertask, this routine is called twice to completely free a user task:int map_free( MAP *map );The caller passes in a pointer to the MAP structure for the table to be freed. The routinetraverses the translation table structure and uses PTE_FREE() to release each table back to thefree pool.3.7.6 Enabling an address spaceBefore a user level task can execute, its address space must be activated. Enabling an addressspace is simply done by assigning the map area pointer to the 88200 UAPR register. But beforethis can be done, the previous user level TLB entries must be flushed. Otherwise, TLB entriesfrom the previous address space would pollute the translations of the new addresss space.Chapter 3. Kernel level implementation^ 39int map_enable_task(int cpu, MAP *data_map, MAP *code_map);The first parameter specifies the physical processor number to enable the address space on.This cpu number is used to calculate the appropriate 88200 register addresses to assign theprovided map descriptors to (there are eight 88200 units in the system, two for each processor).The proper TLB entries are flushed and the new area pointers are assigned to the UAPRregisters.3.7.7 Mapping pages in an address spaceWhen the virtual memory layer allocates physical memory to an address space, a new entry inthe address space's translation tables is created. The allocation of memory to an address spaceincludes the operation of allocating new memory regions, in addition to sharing pages betweenaddress spaces. The map module provides two routines to allows page mapping on single ormultiple contiguous map_page( MAP *map, void *page, void *logical_addr, int attrb );The map_page 0 routine efficiently maps a single physical page to a logical address in anaddress space. The map parameter specifies the address space to map the page. The pageparameter specifies the physical address of where the page to map is located. The logical_addrparameter specifies the address in the address space to map the page at. attrb specifies theattribute bits for the map_pages( MAP *map, void **addr, void **hint, int num_pages, int attrb );This routine allows multiple contiguous pages to be mapped into an address space. Unlikethe simpler map_page 0 call, this routine also allocates new physical pages from the rawpagemodule. map_pages() is most commonly used by the virtual memory layer to efficiently allocatecontiguous regions of memory for a task. By supplying *addr = NULL, the routine will choosethe next available free memory space and allocate num_pages pages starting there. The hintis supplied to provide a good starting location for the free memory search.Chapter 3. Kernel level implementation^ 403.7.8 Unmapping pages from an address spaceWhen the virtual memory layer removes physical pages from an address space, the associatedtranslation table entries must be removed. For active address spaces, the TLB caches of theassociated cache units must be notified of the changes (otherwise remote TLB caches maycontinue to contain the unmapped translation table entry). Every TLB in the system that ispotentially caching the changed entries must be flushed. This includes remote processors thatare executing the same address space.On many multiprocessor systems, flushing a TLB on a remote processor involves sendingthe remote processor an interrupt and stalling the remote processor until the flush is complete[BRG+88]. This is because some machines do not allow flush operations on TLBs other thantheir own. Fortunately, the 88200 does allow flush operations to be invoked from remoteprocessors. All the kernel needs to do is write an invalidate command to every 88200 unit inthe system that is using the address space. Flushing the whole TLB when only one page tableentry is changed can be wasteful. Fortunately, the 88200 allows system software to specifythe flush granularity on a page basis. Before issuing the flush command, the page entry toinvalidate is unmap_page( MAP *map, void *logical_addr );The urunap_page 0 routine removes the specified logical page from the specified addressspaceint unmap_pages( MAP *map, void *logical_addr, int num_pages, int free );The unmap_pages() routine is the plural form of the above unmap_page() routine. In thiscase, multiple contiguous pages can be removed from an address space, starting at the specifiedlogical address. If the free flag is non-zero, then each logical page is also returned to thephysical memory pool using rawpage_free 0 . This allows large regions of an address space tobe efficiently freed in one call.Chapter 3. Kernel level implementation^ 41If a page table becomes empty after the last logical page is removed from it, that page table isnot returned to the page table pool (normally done by the PTE_FREE() macro). This could leaveseveral page tables consumed for apparently no purpose. Curing this problem would requiresome form of garbage collection in the unmap routines. However, this problem is not as bad asit may seem. Any subsequent memory allocations will likely consume the next available entryin these empty page tables. So in fact, not garbage collecting these tables would allow themto be used directly without having to allocate new ones. Thus, an PTE_FREE0 /PTE_ALLOC 0iteration is saved.3.7.9 Finding free logical addressesBefore the virtual memory layer can allocate memory to an address space, it needs to knowwhere to place the memory in that address space. For example, when a user program callsmalloc (), the user does not specify the location to allocate the memory. Instead, the nextavailable address is chosen by the malloc library. Likewise, user level calls to the virtualmemory allocator do not always specify the address to place the memory at. The virtualmemory layer must decide where to allocate the memory. It uses the following routine to dothis:int map_find_free(MAP *map, void **addr, void **addr_hint, int num_pages);Given a address map map, map_find_free() will traverse the map starting at address*addr. hint, looking for a free contiguous region of pages. The size of the region is specified bynum_pages. The found address is returned in *addr. The *addr_hint address is updated to thenext page beyond this contiguous region. Returns OK if successful, or FAILED if a contiguousfree region could not be found.3.7.10 Sharing and moving translationsAnother common memory operation that the supervisor and user level tasks require is theability to share and move memory between address spaces. Sharing memory allows multipleChapter 3. Kernel level implementation^ 42address spaces to read and write the contents of a single region of memory. The move operationallows contiguous memory regions to be passed from one address space to another, removingthe mapping from the source. The following map_share 0 routine provides the virtual memorylayer support to easily do map_share(MAP *src_map, void *src_addr, MAP *dest_map,void **dest_addr, void **dest_hint, int num_pages,int attrb, int share);The routine takes a source map, a source logical address, and the number of contiguouspages to map into a destination map. If the destination address is not known, then *dest_addr= NULL is passed, and map_find_free() finds a suitable address. Passing the dest_hint addressgives the search algorithm a good place to start. Passing a non-zero value for share retainsthe mapping in the source address space, passing a zero value causes the source region to beunmapped.3.7.11 Map module initializationThe only state maintained by the map module are the translation tables. Storage for thesetables is allocated at initialization time at the top of physical memory. The size of the storagearea is configured at compile time using the PAGE_TABLE_SEGS constant in kernel.h. Thepte_init() routine builds a free list to manage the allocation and deallocation of translationtables from this storage area.void map_init();This routine is called by the virtual memory initialization code. First, it calls pte_init 0to initialize the translation table storage area. Then, it programs the 88200 BATC entries tomap in the kmsg_buf and page table storage areas. These areas will then be available to thekernel when the supervisor address space is enabled.Chapter 3. Kernel level implementation^ 433.8 Virtual memory managementAll address spaces, user and supervisor, are created and managed by the virual memory layer.The virtual memory layer provides the kernel and user level with an interface to allocate,deallocate, and share contiguous regions of memory within an address space. It uses the mapmodule for setting up address translation, and the rawpage module for managing physical pages.The table in Figure 3.7 summarizes the routines exported to user level by this module.Synpopsis Interface availability Descriptionvm_alloc () user application Allocates a region of virtual memory.vm_free() user application Deallocates a region of virtual memory.vm_share() user application Shares a region of memory between tasks.vm_move () user application Moves a region of memory between tasks.vm_map_device () user application Maps in a region of device registers.vm_unmap_device () user application Unmaps a region of device registers.Figure 3.7: Virtual memory system call summary.Figure 3.8 shows a typical memory space for a user level task. A user level memory spacecontains four basic components: an executable code segment; a data segment for global variablesand constants; dynamically allocated heap space; thread stacks; and hardware device mappings.The dark areas denote regions of memory that are mapped. The code and data regions arealways contiguous, they contain the executable image as loaded from a file. The heap areastarts at USER_HEAP_START_ADDR and continues to USER_HEAP_END_ADDR (currently 0x700000and Oxffbf0000 in context .h). This heap area gives user programs approximately 3.9GB ofvirtual address space to work in.Memory regions in the heap space are allocated and deallocated on demand throughoutexecution of the user level program. Thus, as Figure 3.8 illustrates, the heap space can becomefragmented. A sparsely populated memory space can consume many translation tables, soallocation routines try to minimize this fragmentation by allocating space close to neighbours.However, this feature can be easily overridden, in the event that a specific address is required.Chapter 3. Kernel level implementationUser level device drivers use the virtual memory layer to map in hardware device registers.While hardware devices reside at very specific physical addresses, their logical address canappear anywhere within the user's virtual space. Therefore, devices can be dynamically mappedanywhere within the heap storage area.3.8.1 Region descriptorsThe virtual memory module manages memory in objects called regions. A region is a contiguous,page aligned portion of memory in a virtual address space. The following VM_REGION structuredescribes a region:typedef struct{int id;^/* this region descriptor id */int task_id;^/* task that this region belongs to */MAP *map;^/* page table mapping descriptor */void *addr;^/* virtual address of region start */int size;^/* size of region */int attrb;^/* page attributes for region */Chapter 3. Kernel level implementation^ 45/* next/prey link for VM region queue structures */QUEUE_LINK link;} VM_REGION;VM_REGION region_table[REGION_TABLE_SIZE];QUEUE region_free_q;int region_free_q_lock;A region entry contains address mapping and ownership information. The start address,size, attributes, and map specify the mapping information for the region. The link field allowsregion entries to be queued together into lists. For example, the region_free_q keeps track ofall unused region descriptors. The statically allocated region_table maintains a global pool ofregion entries. To allocate a new region entry, the VM_GET_REGION 0 macro dequeues the nextavailable region. Old regions are recycled by enqueuing them back on the free list.3.8.2 Allocating a region of memoryAll user level tasks allocate heap memory through one single system call, vm_alloc 0 . Toallocate memory, a new region entry is dequeued, and its values are initialized. Then, themap_pages() routine performs the physical memory allocation and mapping for the region.Finally, the region entry is enqueued onto the tasks heap list.Each task maintains a list of heap regions that are allocated in its address space. When atask is destroyed, its memory must be returned to the system. The heap list allows each regionto be kept track of, which can be efficiently freed all at vm_alloc(int *region_id, int task_id, void **addr, int size, int attrb );task_id specifies the task to allocate memory into. *addr is set to contain the startingaddress to allocate the region of memory at. However, if the caller passes *addr = NULL, thenthe next available contiguous free region of memory is chosen. This address is then returned tothe caller in *addr. The size parameter specifies the size of region to allocate, in bytes. Allsizes are rounded up to the nearest page.Chapter 3. Kernel level implementation^ 46The attrb parameter specifies the page attributes for the 88200 memory managementunit. These attribute settings are discussed in detail by the 88200 technical manual. Theregisters .h file contains all possible bit settings for this parameter. For example, to allocatea page of memory that can be shared amongst all threads in an address space, use attrb =GLOBAL BIT I VALID BIT.The newly allocated region identifier is returned in *region_id. Further operations on thisregion of memory are specified by supplying this region identifier to the virtual memory systemcalls. For example, to free the region, call vm_free() with the returned region_id.3.8.3 Freeing a region of memory, vm_free()int vm_free(int region_id);To free a region of memory, the vm_free() call is used. The call examines the suppliedregion entry, and uses the umnap_pages() routine to remove the address mapping and free thephysical pages.3.8.4 Sharing memorySharing memory provides a convenient and efficient way for address spaces to communicate.Memory regions can be shared between any number of address vm_share( int src_region_id, int *dest_region_id, int dest_task_id,void **dest_addr, int dest_attrb );The caller supplies a source region identifier in src_region_id, and a memory address*dest_addr in the destination task to map the region at. If the destination address is notknown ahead of time, then passing *dest_addr = NULL will automatically choose the nextavailable free location. The chosen memory address will be returned in *dest_addr. Thedest_attrb parameter specifies the page attributes for the destination address mapping. Forexample, a server may wish to provide a read-only page of memory for clients to examine. Aregion identifier for the newly created region is returned in dest_region_id.Chapter 3. Kernel level implementation^ 473.8.5 Moving memory between tasksA common operation throughout execution of device drivers and client/server interactions isthe movement of data between address spaces. For example, initially a device driver will reada chunk of data from a device, and give it to a server for processing. The server will then sendthe data to clients. Each step could involve the movement of data across address spaces.Sometimes, the amount of data passed between clients and servers is small, and thereforean explicit memory copy operation is the most efficient way to pass memory between addressspaces. But for high speed devices, and especially block-mode devices with large block sizes, amemory mapping technique can eliminate the data copy altogether. The vm_move 0 system callprovides a convenient interface to do this. Memory pages in one address space can be mappedout and mapped in to another address vm_move(int src_region_id, int dest_task_id, void **dest_addr,int dest_attrb);The src_region_id specifies the source region to move. The destination task for the memoryregion is specified by dest_task_id. *dest_addr should be set to contain the starting logicaladdress for the region of to appear in the destination address space. If the destination addressis not known ahead of time, then passing *dest_addr = NULL will automatically choose thenext available free location. The chosen memory address will be returned in *dest_addr. Thedest_attrb parameter specifies the page attributes for the destination address mapping. Theregion identifier src_region_id remains the same in the destination task.3.8.6 Mapping hardware devicesOn the 88100 Hypermodule, access to hardware peripherals and on-board devices is accom-plished through memory mapped registers. These registers appear at well-known memory lo-cations in the physical address space. Many of these physical memory locations are hardwired,such as the 88200 registers, while others such as the VMEbus devices, can be configured viajumper settings or programmable registers.Chapter 3. Kernel level implementation^ 48User level device drivers must know exactly where in the physical address space their deviceregisters reside. Using that information, the vm_map_device() can map in the appropriatephysical memory locations to provide access to these device vm_map_device( int *region_id, void *phys_addr, void **addr, int size );The caller must specify the physical address of the device in phys_addr, and the size of thememory region in bytes (rounded up to PAGE_SIZE boundaries). The caller can also use *addrto specify the desired logical address to map the physical region. If the logical address is notknown ahead of time, then passing *addr = NULL will automatically choose the next availablefree location. The chosen memory address will be returned in *addr. A region identifier forthis newly created region is returned in region_id.The previous memory mapping and allocation routines allowed the caller to specify a pageattribute for each memory page. Device register mappings have more stringent requirements:they must be mapped as cache-inhibited to make sure that accesses truly hit the device, andnot just the cache. vm_map_device() sets all of its page attributes to CACHEINHIBIT_BIT vm_unmap_device( int region_id );The vm_unmap_device() system call allows device register mappings to be removed from thecallers address space. The region identifier region_id specifies the memory region to remove.3.8.7 Virtual memory initializationWhen the kernel boots, initial execution of the system happens within the physical memoryspace, no memory translation is done. When the virtual memory module gets a chance toinitialize, it initializes the lower layers, and then begins to setup the supervisor memory space.During this setup phase, translation is disabled until the whole supervisor translation table isbuilt.Chapter 3. Kernel level implementationThe lower layers of the virtual memory system include the rawpage module and the mapmodule. These modules are simply initialized with the rawpage_init 0 and map_init() calls.To begin building the supervisor memory space, the map_alloc 0 call is used to create thesupervisor code and data address space maps: kernel_code_map and kernel_data_map. Theaddress space is then built using the map_page 0 call. Figure 3.9 shows the layout of a fullyinitialized supervisor memory space.3.9 Task managementA task encapsulates the virtual memory space and processor allocation functions for user levelprograms. The virtual memory space for a task includes program code, data and heap regions,as well as shared regions and memory mapped device registers. The virtual memory modulemanages most of a task's address space requirements: address space allocation and deallocation.All other memory operations, such as page-wise dynamic allocation and mapping, are handleddirectly by the virtual memory module.Chapter 3. Kernel level implementation 50The task. c module implements the bulk of the task management functionality. This moduleworks in cooperation with all of the other main system components, and ties them together toform a basis for user level development: memory management, task scheduling, interrupt andexception handling, and system call handling. The table in Figure 3.10 summarizes the systemcall interface exported to the user level by this module.Synpopsis Interface availability Descriptiontask_create () user application Creates a task.task_destroy () user application Destroys a task.task_suspend () user application Suspends the execution of a task.task_resume () user application Resumes the execution of a task.task_info () user application Returns task state informationtask_signal () user kernel Sends an asynchronous signal message to a task.task_timer_event () user kernel Registers a task for a timer event.task_request_cpu () user kernel Requests a processor for the task.task_relinquish_cpu () user kernel Relinquishes control of the cpu to another task.task_intr_relinquish () user kernel Relinquishes control to an interrupt driver task.task_cleanup () user kernel Cleans up an exiting task.Figure 3.10: task. c module system call summary.This section describes the task management module, and how it interacts with the rest ofthe system. Before describing the details behind task management, we first present an overviewof the task scheduling environment. This discussion is intented to give the reader a global viewof how and why scheduling decisions are made.3.9.1 Task management overviewAll tasks are created using the task_create() system call. Each task in the system is allocatedits own task descriptor from a global descriptor table, task_table 0. The task descriptorcontains all the vital information which describes the task, such as name, entry point, andvirtual memory space. In addition, the task descriptor contains two pieces of informationwhich assist in the task processor allocation: num_cpus, the number of processors currentlyChapter 3. Kernel level implementation^ 51allocated to the task; and num_ready_threads, the number of ready threads in the task waitingfor a processor.When a task receives a processor, the num_cpus field is incremented. When a processoris taken away from a task, num_cpus is decremented. Likewise, when the user level threadscheduler places a thread on its ready queue, num_ready_threads is incremented. When athread is removed from the ready queue, num_ready_threads is decremented.The task scheduling mechanism uses both of these variables to help decide how a taskshould be scheduled. There are seven main sources of control in the system which invoke thetask scheduler to make a scheduling decision:1. Processor requests from user programs When a user level thread scheduler finds itself withready threads on hand (ie., num_ready_threads > 0), then it will request a processor.2. Processor relinquishment from user programs. When a user level thread scheduler runs outof runnable threads (ie., num_ready_threads == 0), then it will relinquish its processor.3. Hardware interrupt handlers. An interrupt may cause an upcall event to be directed at aspecific task, in which case the task must be given a processor.4. Hardware exception handlers. As with interrupt handlers, an exception may occur thatrequires the attention of a specific task 2 .5. When creating a new task. A newly created task begins with num_ready_threads == 1,num_cpus == 0.6. Resuming a suspended task. A suspended task, when resumed with num_ready_threads> 0, is in need of at least one processor.7. After a task is destroyed. When the currently running task is destroyed, another task isscheduled in its place.2 For example, an external pager task would be invoked whenever a task generates a page fault.Chapter 3. Kernel level implementation^ 52Tasks that are in need of a processor, when num_ready_threads > 0, are either placed on thetask ready queue to be given a processor in the future, or are given a processor immediately.The task ready queue is a centralized priority queue with 32 levels. All processors in thesystem share the same queue through a simple three routine interface: enqueue_ready_task(),dequeue_ready_task(), and remove_ready_task().The main scheduling loop within the kernel is in the routine sched(). When invoked,sched() dequeues the next available ready task and runs it. If there are no ready tasks, thenthe processor drops into the idle() loop. The idle loop simply runs a forever-loop, with interruptsenabled. So rather than having idle processors poll system queues looking for work, work isdelivered to idle processors using the software interrupt service (SWI). When work becomesavailable, an interrupt is delivered to the next available idle processor. For example, if a taskis in need of a processor, then one of the idle processors is delivered a TASK_RUN_SWI interrupt.(A global structure, idle_cpus [], keeps track of which processors are idle.)One of the main task scheduler routines that is responsible for delivering work to idleprocessors is the enqueue_ready_task() routine. Whenever a task is found to require anadditional processor, enqueue_ready_task() is called to place the task on the ready queue.However, if an idle processor is available, enqueue_ready_task() will deliver the processor aTASK_RUN_SWI interrupt.3.9.2 The KERNEL_INFO structureThe kernel maintains three important data structures to manage the scheduling of tasks:the KERNEL_INFO structure; the task descriptor, TASK; and the shared region structure,SHARED_REGION. This section focuses on KERNEL_INFO; the following section discusses the lattertwo.The KERNEL_INFO structure, shown below, contains a number of fields that are used through-out the kernel, and at the user level. At boot time, a physical memory page is allocated to storethe structure. The page is mapped into the kernel space with read/write priviledges, and isChapter 3. Kernel level implementation^ 53accessed through the global variable kernel_info. When a task is created, the page is mappedread-only into the tasks address space. All user level code has read access to the information,but it cannot be overwritten.typedef struct{int simulator;^/* nonzero when executing under the 88k simulator */unsigned long timer_ticks;^/* clock ticks since boot time */char idle_cpus[NUM_CPUS];^/* indicates which cpus are idle */unsigned long idle_time[NUM_CPUS];^/* idle time counters for each CPU */unsigned long kernel_time[NUM_CPUS]; /* average time in kernel recently */unsigned long user_time[NUM_CPUS];^/* average time in user space */int page_table_entries;int physical_pages_free;int num_free_regions;int num_tasks;int num_ready_tasks;/* number of free pte entries *//* usable free memory *//* free VM regions *//* tasks created in system *//* number of ready tasks (note: *//* protected by ready_q_lock) *//* stores which cpus tasks are running on. *//* given a task id, it's easy to find which cpus it's running on */char run_cpus[TASK_TABLE_SIZE][NUM_CPUS];int lock;KERNEL_INFO;KERNEL_INFO *kernel_info;^/* allocated and mapped at boot time */The structure contains many miscellaneous fields, and a few important ones. simulator isset at boot time to signify whether the system is running under the g88 simulator, or on the realhardware. Sometimes, as when dealing with devices, it is important to know this difference.timer_ticks is a counter that is incremented at each system clock tick. This can be used togive programs a notion of time.Many of the fields in this structure are used for resource usage accounting. A user levelUnix-style ps command could display this information, or log it to a file. The idle_time[],Chapter 3. Kernel level implementation^ 54kernel_time D , and user_time D fields are used to record system activity for each processor.The next five fields are used to count other system resources, such as free memory and thenumber of tasks running in the system.The idle_cpus [] field is used to record the idle state of each processor in the system. Whena processor enters its idle loop, it sets the associated entry in the idle_cpus [] array. Anotherprocessor in the system can read this field and immediately know which processors are idleand are eligible for work. For example, if a new thread is created at the user level, the threadscheduler can read idle_cpus [] and quickly spawn the thread to a remote idle processor bydelivering a software interrupt message to the idle processor (via intr_remote_cpu().The run_cpus ^ [] field records the running status of each task in the system. When aprocessor is given to a task, the task's run_cpus entry is set for that processor. This allows thescheduling code to quickly find out which processors are running a specific task. This can beused when destroying a task, for example. The task_destroy() code consults the run_cpusentry for the task, and broadcasts an interrupt to the processors running that task. As anotherexample, the IPC mechanism can easily target its messages to processors that are running thedestination task.3.9.3 Task descriptorsThe task descriptor structure, type TASK defined in kernel .h, contains most of the bookkeepingdata that comprises a task. The structure is divided into several main components, each ofwhich is maintained and shared between the various kernel modules. The descriptors are privateto the kernel memory space; it is not directly shared by user level programs.typedef struct{int^id;^/* 0 to TASK_TABLE_SIZE-1 */int lock; /* spin lock for this descriptor */void^*stack_page;^/* kernel address of user's upcall stack*/int^interrupts;^/* bit field of enabled user interrupts */Chapter 3. Kernel level implementation^ 55int^exceptions;int pending_intr;TASK_INFO info;TASK_VM^vm;SHARED_REGION *sr;QUEUE_LINK sched_link;QUEUE_LINK timer_link;unsigned int wake_time;} TASK;/* bit field of enabled user exceptions *//* remembers pending interrupt vector *//* task statistics and other info *//* task virtual memory spaces *//* user/kernel shared region pointer^*//* scheduler queue *//* timed event queue *//* time to wake this task (task_timer) */The following paragraphs and subsections describe each component of this structure. Afterstarting with the smaller miscellaneous fields, the larger component structures are described.The id field identifies the task descriptor. This number is the index value of the descriptorin the task_table [] array. It is initialized at boot time, and remains constant throughout thelife of the system.The lock field is the spin lock used to provide mutual exclusion in routines that managethe task descriptor. Normally this lock is acquired before any system call or scheduling actionis performed on a task. However, there are cases where locking is not required. These cases aredefined by their specific purposes.The stack_page field points to the physical page of memory that is allocated for the tasksupcall stack. When a task is created, a user level upcall stack is allocated and mapped into theuser level memory space at location TASK_STACK_ADDR. When the task is destroyed, the upcallstack is returned to the physical memory pool.The interrupts and exceptions fields are maintained by the kernel interrupt and excep-tion dispatchers. interrupts and exceptions are bit fields which describe the interrupt andexception handling capability for the task. For example, when a task registers an interrupthandler in its address space, the corresponding bit is set in the interrupt field. The kernel in-terrupt dispatcher can use this information for deciding what to do when a particular interruptarrives.Chapter 3. Kernel level implementation^ 56The interrupt dispatcher uses intr_pending to remember a pending interrupt. When ahardware device generates an interrupt, the kernel interrupt dispatcher looks up in its tablewhere to locate the handler for the interrupt. If the handler resides in a task that is currentlynot active, it must be switched in. The intr_pending field is set in the old task to rememberthe pending interrupt vector. When the old task is switched out and the interrupt handler taskis switched in, the vector is examined before upcalling to the new task.The sched_link field is the main scheduler queue link for the task. This link is used toplace the task on the task descriptor free queue and ready queue. The timer_link field is usedby the task event timer system. A task enqueues itself on the task_timer_q using this link,and will be issued a timer upcall event when the wake_time interval expires.TASK_INFO structureThe first main component of the task descriptor, info, contains general information about thetask such as scheduling state, priority, an ASCII string name and resource consumptionstatistics. The reason why these data fields are grouped in their own structure is so that theycan easily be marshalled to user level programs. A special system call, task_inf o 0 , allows userlevel programs to query this structure for information collecting. For example, the consoleprogram implements a Unix-style ps command for displaying task scheduling and resourceusage. Currently, the resources kept track of are memory usage, but this could be expanded toinclude other interesting information such as interrupt and context switching rates.typedef struct{int state;^ /* task scheduling state */int priority; /* 0 to NUM_READYQ_PRIORITIES-1 */char name[TASK_NAME_SIZE]; /* ASCII string name for executable *//* some memory statistics */int code_size; /* size of the program code (bytes) */int data_size; /* size of the program data (bytes) */int bss_size; /* size of the program bss (bytes) */int virtual_size; /* pages of memory mapped by this task */} TASK_INFO;Chapter 3. Kernel level implementation^ 57TASK_VM structureThe next component, vm of type TASK_VM, maintains information about the tasks' address space.This structure is managed by the memory management system. The code and data addressmap descriptors, and allocated virtual memory region descriptors are stored here.typedef struct{MAP code_map;MAP data_map;int code_region_id;int data_region_id;QUEUE device_regions;QUEUE heap_regions;int heap_hint_addr;} TASK_VM;map descriptor for code map */map descriptor for data map */region id for code memory */region id for data memory */region list for mapped devices */region list for allocated memory */hint address for next heap allocation */The map descriptors code_map and data are used by the map module to perform addresstranslations for the task. When a task is created, the maps are allocated, and region descriptorsare allocated for the executable code and data areas. When a task is scheduled to run, the mapdescriptors are passed to the address map activation routine.The two region queues, device_regions and heap_regions, are used to remember regiondescriptors that are dynamically allocated by the user throughout the life of the task. Thisprovides an easy way to free all of a tasks memory when it is destroyed: the region lists can beefficiently traversed and the descriptors freed. The heap_hint_addr field is used to rememberto next free location in the tasks virtual address space for a heap or device memory allocationto occur. The hint helps reduce the amount of searching through the task translation tablestructures to find free contiguous regions.SHARED REGION structureThe sr field is a pointer to the tasks' user/kernel shared memory region. The shared regionstructure, SHARED REGION, contains information that is commonly accessed by both the userChapter 3. Kernel level implementation^ 58and supervisor address spaces for scheduling decisions and other operations. This helps reducecommunication costs between the user and kernel spaces.When a task is created, a physical memory page is allocated for the shared region structureusing rawpage_alloc(). The page is mapped read/write into both the user and supervisoraddress spaces. The memory location of the page in the supervisor address space is rememberedby the sr field. In the user address space, the page is mapped at the well known address,USER_SHARED_REGION_ADDR. The user level kernel code can use this constant to refer to thestructure.typedef struct{/* parameter passing area for system calls */int syscall_parms[NUM_CPUS][SYSCALL_PARMS_SIZE/4];int num_cpus;int num_ready_threads;void *upcall_addr;int upcall_status[NUM_CPUS];int intr_status[NUM_CPUS];int excp_status[NUM_CPUS];/* number of cpus assigned to this task *//* number of ready threads in task *//* user upcall dispatch address *//* upcall dispatcher events *//* interrupt dispatcher events *//* exception dispatcher events */int num_signals;^/* number of signals in list */int signal_head; /* signal list head index */int signal_tail; /* signal list tail index */int signal_lock; /* signal list spin lock */SIGNAL signals[NUM_ SIGNAL_ENTRIES];/* global semaphore links */GLOBAL_SEM_LINK sem_links[NUM_GLOBAL_SEMS];int argc;^ /* number of command line arguments */char *argv[NUM_ARGV_VECTORS]; /* array of argv vectors */char argv_buf[ARGV_BUF_SIZE]; /* buffer for storing parameters */} SHARED_REGION;The syscall_parms field provides a buffer for passing out-of-line parameters through systemcalls. For example, when a user calls kprint 0 , the string parameter is copied into the systemChapter 3. Kernel level implementation^ 59call buffer. The kernel kprint 0 routine can then access the string parameter from the buffer.Ideally, the kernel should be able to access the user's parameters directly. However, most of thekernel system calls are implemented in C, a language which does not easily permit referencesto remote address spaces. So kernel system calls which pass parameters by reference must usethis indirection. Each processor running a task uses its own parameter buffer, so that systemcalls can safely execute in parallel. Thus, the syscall_parms field allocates enough buffers forevery processor in the system.The num_ready_threads field is maintained by the user level threading kernel to countthe number of runnable threads available. num_cpus is maintained by the kernel to count thenumber of processors currently allocated to the task. These fields are used by both the threadand task schedulers as the basis for processor allocation decisions.The upcall_addr field is used by the kernel upcall dispatcher as the address to call in theuser space for upcall events. Initially, when a task is created, this value is set to contain theentry point for the executable program. This entry point corresponds to the user level kernelinitialization code. When the task runs and initializes itself, it can easily change this value toany address within its address space, such as the user level upcall dispatcher.The upcall_status, intr_status, and excp_status are per-processor fields that are usedby the interrupt, exception, and task upcall dispatching mechanism to signal various events tothe user kernel. Each field maintains a bit field of possible flags, each flag is set to specify acertain event, or combination of events.The set of signal fields belong to the asynchronous task signalling mechanism. Any taskcan send a simple signal message to another task using the task_signal() system call interface.These fields are used to implement a FIFO queue for a fixed number of signals destined for atask.The sem_links field stores an array of queue links for the global semaphore service. Eachlink in the array corresponds to a global semaphore entry. When a user thread blocks on aglobal semaphore, the thread's task is queued on the global semaphore's queue. When thetask can execute nowtimer tick upcalltask relinquish requesttimed event has expiredtask signal is pendingtask destroy upcallChapter 3. Kernel level implementation^ 60semaphore is signalled, the next task in the queue is dequeued and issued a signal.The arg fields are used to supply Unix-style command line argument information for userprograms. When a task is created, one of the parameters to task_create 0 is a user argumentstructure. task_create 0 copies the user arguments into these fields. When the user levelinitialization routine executes, the argv vector can be built and passed to the user's main()thread.3.9.4 Task descriptor tableFollowing the general policy of static allocation where possible, the task_table ^ array storesall of the system task descriptors. The size of the table is defined at compile time. The tableis declared in the following fasion in task. c:TASK task_table[TASK_TABLE_SIZE];QUEUE task_free_q; /* free descriptor queue */int task_free_q_lock; /* spin lock for free queue */The table and task_free_q are initialized at boot time. Each of the task descriptors arelinked together into a free queue. The free queue allows for 0(1) allocation and deallocation oftask descriptors.3.9.5 Upcalling to a taskA task is given a processor for a number of reasons. Each reason is termed an upcall event,and there are eight different upcall events defined in upcall . h. These upcall events are usedin various places throughout the kernel:#define RUN_UPCALL 0 /*#define TICK_UPCALL 1 /*#define RELINQUISH_UPCALL 2 /*#define TIMER_UPCALL 3 /*#define SIGNAL_UPCALL 4 /*#define KILL_UPCALL 5 /*Chapter 3. Kernel level implementation^ 61#define EXCP_UPCALL^6^/* a user exception occurred */#define INTR_UPCALL 7 /* a user interrupt occurred */Each of these upcall events correspond to a bit in the upcall_status [] field in the task'sshared region. These events are described in greater detail in Section 4.1. When an upcallis issued to a task, the particular event is bit-wise or'ed into the upcall_status [] entry forthe processor. Previous bits in the entry are therefore retained, and can be accumulated ifmany upcall events are directed to a task before the task actually gets a chance to execute.At the user level, the upcall dispatcher reads this bit-field and executes the appropriate upcallhandlers, clearing the event bits as the handlers complete.For example, when sched() finds a ready task to run, it issues a RUN_UPCALL to the task inthe following manner:curr_task->sr->upcall_status[cpu] l= 1 << RUN_UPCALL;upcall_now(curr_task->sr->upcall_addr);^/* no return */The upcall_now() routine, from excp_dispatch.s, is an assembler code routine that per-forms the upcall action into user space. Given the user level destination address, upcall_now()sets up the processor context, and executes a "return from exception" rte instruction.Before a task can be issued an upcall, its memory space must be activated on the localprocessor. Also, if the task resides on the ready queue, it may have to be dequeued. Thetask_upcall() routine performs all of the bookkeeping chores to enable a task and upcallinto its address space. This routine is used in various places in the kernel to facilitate upcalldispatching.void task_upcall(int cpu, TASK *task, int upcall_vec);The task to activate is specified by task, and the upcall event is specified in upcall_vec.The task will be removed from the ready queue if num_ready_threads <= 1, and its state set toTASK RUNNING. A task is not removed from the ready queue if num_ready_threads > 1, whichmeans that the task is eligible for more processors than the one it is currently receiving.Chapter 3. Kernel level implementation^ 62The task's num_cpus field is incremented and its run_cpus 0 [] entry is set to indicate whichprocessor the task is running on. Finally, the map_enable_task() call activates the code anddata memory space, and upcall_n.ow() is invoked to pass execution to the user level.3.9.6 Creating a taskThe task_create() system call is used to create a task. First, a free task descriptor is dequeuedfrom the taskiree_q. Then, virtual memory call vm_task_alloc() is used to allocate thetask's address space and create the shared region storage area. The shared region fields arethen initialized. The syscall_parms buffer is used to pass the task's id and the region identifierfor the nameserver to the user initialization routine. The task is either placed on the readyqueue to be run, or left in the suspended state to be later task_create(int *task_id, int priority, int ready, TASK_EXEC_HDR *hdr,int code_region, int data_region, TASK_ARGS *args);The priority parameter specifies the task's run priority level, an integer between 0 and31. Priority 0 is low priority, priority 31 is high priority. A high priority task always receivesprocessors before a low priority task. If ready is passed a nonzero value, the new task willbe immediately readied and executed on an idle processor, or placed on the ready queue if allprocessors are busy. If ready is zero, the task will be placed in the TASK_SUSPENDED state, tobe later resumed by task_resume(). task_id returns the created task's descriptor identifier.task_create() returns OK if task creation was successful, or FAILED if an error occurs.The hdr parameter passes the executable header information. This structure contains theexecutables code, data, and bss segment addresses and sizes. It also includes the executableentry point. The caller is responsible for querying the filesystem for this information.The code_region and data_region parameters pass the executable code and data regions.Before the user calls task_create 0, the executable code and data must be read from the filesys-tern into two regions. These regions, plus the hdr information, are used by the vm_task_alloc 0routine to allocate the memory space for executable. In vm_task_alloc(), once the addressChapter 3. Kernel level implementation^ 63maps are created, vm_share () is used to map the code and data regions to the tasks addressspace. The caller to task_create 0 can then free the code and data regions from their addressspace, or use them for additional calls to task_create().The args parameter points to an argument structure of the following format. The usermust properly initialize this structure before calling task_create 0:typedef struct{int nameserver_id;int tty_id;int argc;int argv_len;char argv_buf[ARGV_BUF_SIZE];} TASK_ARGS;region id of nameserver */port id of the tty connection */user argument count */length in bytes of argv_buf */user parameters */The nameserver_id field is used to pass the nameserver region identifier to the newlycreated task. When the task initializes, it can use this identifier in vm_share() to map in thenameserver. The tty_id is used by the user level tty_driver. c server to pass on the ttyconnection identifier. When the tty server creates a task, a stdin/stdout port connection iscreated for the task to communicate with the server. When the task initializes, it uses thetty_id port to communicate with the server.The argc field is set to specify the number of command line parameters. argv_len is set tospecify the length in bytes of the parameters in the argv_buf field. argv_buf is built to containa contiguous list of null-terminated parameter strings. When the task gets a chance to run,the user level initialization routine uses this information to construct an appropriate Unix-styleargv list to pass to the user's main().3.9.7 Destroying a taskint task_destroy(int task_id);Whether a task simply finishes its own execution, or whether a remote task must be killed,task_destroy() does the job. task_destroy() is responsible for locating the specified task,killing it, and returning its resources back to the system.Chapter 3. Kernel level implementation^ 64On a multiprocessor system, the effort required to kill a task is greater than on a unipro-cessor system. The additional difficulty arises when the task to destroy is executing on remoteprocessors. In this case, the remote processors must be interrupted and notified that the taskthey are executing will be destroyed:/* check if there are other cpus running the task */if ( dead_task->sr->num_cpus > 0 ){/* interrupt each cpu currently running this task */intr_remote_cpus(kernel_info->run_cpus[task_id], TASK_DESTROY_SWI,task_id);The remote interruption facility is provided by the intr.remote_cpus() call ininterrupt . c, which in this case, delivers a TASK_DESTROY_SWI software interrupt to all proces-sors executing the specified task. The SWI interrupt handler on the remote processors recognizethis interrupt and discards the dying task.However, stopping execution and cleaning up the task's memory space and kernel datastructures is not enough to completely remove a task from the system. There are also datastructures at the user level which must be cleaned up. For example, if a task has created andis the owner of user level IPC ports, the ports must also be properly terminated. Otherwise,the user level port data structures will be left with stale entries in their descriptor tables. Asanother example, suppose a task has established a communication channel with a server. Thetask must be given the chance to gracefully sever the connection with its server.To offer a graceful shutdown for tasks which are to be destroyed, before the kernel frees thetasks resources, a KILL_UPCALL is delivered to the task. The KILL_UPCALL handler executes atthe user level, and performs any cleanup that is necessary before returning to the kernel to beofficially destroyed. While this is taking place, the task's state is set to TASK DYING, indicatingthat the task should not be scheduled by remote processors. Figure 3.11 demonstrates thissequence of events.Chapter 3. Kernel level implementationWhen the KILL_UPCALL handler has finished cleaning up the user level state, the system calltask_cleanup() is used to return control to the kernel. task_cleanup() frees all of the task'smemory, returns its descriptor to the free list, and jumps into the scheduler loop sched() tocontinue processing other tasks.If a task is voluntarily killing itself, then task_destroy() assumes that the user level codehas already cleaned up its data structures, and therefore dispatching a KILL_UPCALL is notnecessary.3.9.8 Suspending a taskThe act of suspending a task in a multiprocessor system has the same problem as destroyinga task: the task may be executing on a remote processor. As with destroying a task, theintr_remote_cpus() call can deliver a TASK_SUSPEND_SWI interrupt to the appropriate proces-sors. Upon receiving the interrupt, the SWI handler invokes a RELINQUISH_UPCALL event intothe task. At the user level, the RELINQUISH_UPCALL handler places the currently executingthread back on the thread ready queue, and relinquishes control of the processor back to thekernel, where another task is scheduled in its place.Chapter 3. Kernel level implementation^ 66int task_suspend(int task_id);The task_suspend 0 call suspends execution of the specified task. If the task is currentlyexecuting on a remote processor, the processor will be interrupted and forced to scheduleanother task. The tasks state is set to TASK_SUSPENDED. Returns OK if successful, or FAILED ifthe specified task is invalid.3.9.9 Resuming a taskint task_resume(int task_id);The task_resume 0 call resumes execution of a previously suspended task, specified bytask_id. If an idle processor is available, a TASK_RUN_SWI interrupt will be delivered to it, andthe resumed task will be executed there. If all processors are busy, then the task is placed onthe task ready queue where it will be scheduled sometime in the future. Returns OK if successful,or FAILED if the specified task is invalid.3.9.10 Task statesThe task descriptor state field, info. state, maintains the scheduling state of a task. Thereare six possible scheduling states, defined in kernel.h:/* possible states for a task descriptor */#define TASK_FREE 0#define TASK_SUSPENDED 1#define TASK_READY 2#define TASK_RUNNING 3#define TASK_IDLE 4#define TASK_DYING 5Figure 3.12 summarizes the possible transitions between each state. The use of each stateis described as follows:• TASK_FREE: Is the initial state for all tasks. The task descriptor is queued on the freequeue, task_free_q, using the tasks sched_link.Chapter 3. Kernel level implementation• TASK_READY: The task currently resides on the ready queue. The task also may be exe-cuting on at least one processor.• TASK_RUNNING: The task is executing on at least one processor.• TASK_IDLE: The task is not executing anywhere, and does not require any processors.• TASK_SUSPENDED: The task has been suspended. It will receive no processors.• TASK_DYING: The task is being removed from the system. The state will progress toTASK_F'REE when the tasks resources have been properly cleaned up.3.9.11 Task ready queueTasks that require a processor are placed on the task ready queue. The file readyq. c pro-vides a simple interface to a globally shared round-robin priority queue with 32 levels. Allaccesses to the task ready queue are done through three routines: enqueue_ready_task(),Chapter 3. Kernel level implementationdequeue_ready_t ask 0 , and remove_ready_task 0. As long as this interface remains the same,the implementation of the ready queue can change.The ready queue enforces priority levels between 0 and 31. Priority level 0 is low, level 31is high. A higher priority task will receive a processor before a lower priority task.Enqueue and dequeue operations are 0(1) in the most frequent case. This efficiency isachieved by the use of 32 independent FIFO queues, rather than a single sorted priority list.Figure 3.13 displays a typical ready queue layout. Using a sorted priority list would guaranteean 0(1) dequeue operation because the first element on the list is always the highest prioritytask. However, the enqueue and remove operation requires a sort operation to rebuild thelist, which for a heapsort structure, requires 0(n log(n)) access time (not including extra codecomplexity overhead).QUEUE readyq_table[NUM_READYQ_PRIORITIES];int hint;int ready_q_lock;The readyq_table maintains a statically allocated array of FIFO queues. Each queue isused to store task descriptors of a single priority level. A single spin lock, ready_q_lock, is usedto protect the queue from concurrent accesses. The hint integer is used to cache the highestpriority non-empty FIFO queue, and is used to speed up the dequeue operation.Chapter 3. Kernel level implementation^ 69TASK *dequeue_ready_task();The dequeue_ready_task() operation dequeues the next available ready task and returnsits descriptor. The routine consults the hint variable to find the highest level priority with anon-empty queue. If the hint level queue is empty, a linear search counts down the hint untilthe next non-empty queue is reached. The task is dequeued and returned to the caller.void enqueue_ready_task(TASK *task);The enqueue_ready_task() operation tries to find an idle processor to execute the specifiedtask. If all processors are busy, then the task descriptor's state is set to TASK_READY and is placedon the end of the appropriate queue. If the task's priority is higher than the hint variable,then the hint is updated to this priority value.void remove_ready_task(TASK *task);A task that resides in the ready queue structure may need to be removed for reasons otherthan a processor becoming available to run the task. Sometimes a task must be removed out oforder, from the middle of the queue. For example, a task that is being destroyed or suspendedmust be removed from the ready queue. The task descriptor to remove from the ready queueis specified by the task parameter.3.9.12 Task signallingThe task signalling mechanism provides a primitive building block for the construction of userlevel communication services. A signal is a simple non-blocking one-way message deliveredfrom one task to another. The task_signal() system call performs two operations: wakes upa destination task, and delivers it a message. The following structure, from shared_region.h,defines a signal message:Chapter 3. Kernel level implementation^ 70typedef struct{int signal;int data;} SIGNAL;/* signal message type *//* message type-defined data */Any task in the system can be the recipient of multple signal messages from multiple sources.To ensure the non-blocking, reliable delivery of a signal message, a FIFO queue of signal de-scriptors is allocated to buffer messages. The queue is stored as signals 0 , in the shared regionstructure of each task descriptor. Signals are enqueued by the kernel, and dequeued by the userlevel signal dispatcher as a result of a SIGNAL_UPCALL.In the event that many senders bombard a single task with signals, the signal queue ofthe destination task may become filled. Further signal messages will be discarded. Thus thedelivery of signals is not totally reliable. However, reliability can be ensured for a bounded setof interactions given a reasonable queue size. The signal queue fits within the task's sharedregion page, where it has room for about 400 entries, at 8 bytes each.Each task contains a set of signal handlers as part of its user level upcall dispatch tree(c.f. Figure 4.20). One signal handler routine is used for each signal message type. Originally,the signal mechanism was designed to allow handlers to be added and removed dynamically,allowing user programs to add their own signal handlers at run-time. However, a static schemewas implemented for simplicity. The system currently supports six different signal messagetypes:#define GLOBAL_SEM_SIGNAL_SIGNAL#define GLOBAL_SEM_DESTROY_SIGNAL#define GRPC_PORT_SEND_SIGNAL#define GRPC_PORT_REPLY_SIGNAL#define GRPC_PORT_DESTROY_SIGNAL#define GLOBAL_PORT_SEND_SIGNAL0 /* global semaphore signal */1 /* global semaphore destroy */2 /* global rpc send handler */3 /* global rpc reply handler */4 /* global rpc destroy handler */5 /* global port send handler */These particular signals are used by the user level IPC and semaphore library to synchronizetheir cross-address space communication events. For example, when a client invokes an RPCChapter 3. Kernel level implementation^ 71message to a server, the server must be informed of the waiting message. In this case, the clientRPC library would issue a GRPC_PORT_SEND_SIGNAL, specifying the port identifier in the signal'sdata field.Delivery of a signalThe user level can deliver a signal to a task by two means. The user level kernel determineswhich of these two methods is most appropriate:1. Issue a TASK_SIGNAL_SWI interrupt to a processor:intr_remote_cpu(cpu_list, TASK_SIGNAL_SWI, task_id,GRPC_PORT_SEND_SIGNAL, port_id);If the destination task is executing on a remote processor, or if there is an idle processor,then deliver an interrupt to that processor. The SWI_service() handler on the remoteprocessor will queue the signal and dispatch an upcall to the destination task. It is possiblethat the remote processor can become busy before the interrupt is actually sent. In thiscase, the SWI_service() handler will properly dispatch the signal to the destination taskand return to the existing work.2. Issue a task_signal() system call:task_signal(task_id, GRPC_PORT_SEND_SIGNAL, port_id);If the destination task is not executing anywhere, and there are no idle processors, thenthe kernel level system call task_signal() must be invoked on the local processor tomanually queue the signal and wake the destination task. The destination task is wokenby placing it on the ready queue. When the task eventually acquires a processor, itsupcall dispatcher is executed.In either case, when the destination task obtains a processor, the user level upcall dispatchernotices that the num_signals field is nonzero, and begins to process the signals. After eachChapter 3. Kernel level implementation^ 72signal entry is dequeued by the dispatcher, the signal field is used to execute the appropriatehandler. The signal handler can then perform the desired operation using the supplied datafield. The user level IPC and semaphore libraries contain handlers for each of the above signaltypes.3.9.13 Task timer serviceThe timer service allows tasks to register for a timed event. User level schedulers can use thisservice to implement a sleep facility. A list of tasks, timer_q, sorted by their wakeup time, ismaintained by the task_timer.c module. This list is shared with the system clock interruptservice routine, DTI_service(), to ready tasks with expired wakeup values. The timer_q_lockprotects against concurrent accesses to the list. When a timed event expires, the associatedtask is readied and delivered a TIMER_UPCALL event.QUEUE timer_q; /* list of tasks waiting for a timer event */int timer_q_lock;int task_timer_event( unsigned long wake_time );task_timer_event() is called by user level thread schedulers to register for a timed event.The wake_time field specifies the future wakeup time. By using the current system clock time,provided by kernel_info ->timer_ticks, the future wakeup time can be calculated by addingthe desired number of clock ticks. The clock tick interrupt rate is determined at boot time, andcan be configured to any value within the hardware limits: 8.6 microseconds to 563 milliseconds.Only one timed event is registered per task. Other timed events are queued by the userlevel sleep service. Registering an earlier timed event will reset the previous setting to use thenearer value.The task descriptor field timer_link is used to link the task into the timer_q. A linearsearch through the timer_q is done to locate the proper waketime slot.Chapter 3. Kernel level implementation^ 733.10 Kernel management for global semaphoresThe task signalling mechanism provides a way for tasks to send low level event messages toeach other. This service can be used to notify remote tasks that an event has occurred. Globalsemaphores, on the other hand, provide a way for threads in a task to wait for a remote eventto occur. This service can be used as building block for higher level communication services.Synpopsis Interface availability Descriptionkernel_sem_enqueue () user kernel Enqueues the caller task onto the semaphore.kernel_sem_dequeue () user kernel Dequeues the next task from the semaphore.kernel_sem_destroy ( ) user kernel Dequeues all tasks from the semaphore.Figure 3.14: Kernel level global semaphore support routines.For example, a client thread wishing to send a message to a server may have to wait fora free port buffer to become available. The client thread performs a sem_wait() on the portqueue, and when the server releases a buffer, it issues a sem_signal 0. The client thread isthen unblocked to dequeue a free port buffer and send its message.While the user level global semaphore library implements the call sem_wait () and the callsem_signal 0 , some kernel level assistance is required to preserve the ordering of sem_wait()calls. Threads must be unblocked by sem_signal 0 in the same order that they were blockedby sem_wait(). Threads from multiple tasks may be blocked on the same semaphore. Sosem_signal() must decide which task should be delivered the signal message. To do this,the kernel implements a task queueing service, consisting of three system calls summarized inFigure 3.14.These calls manage a set of task queues, one queue for each global semaphore. The queuesare stored in the task_sem_q 0 table, declared in kernel_sem . c:QUEUE task_sem_q [NUM_GLOBAL_SEMS] ;When a thread blocks on a global semaphore, the user semaphore library routinecalls kernel_sem_enqueue() to queue the threads's task descriptor onto the appropriateChapter 3. Kernel level implementation^ 74task_sem_q [1 entry. When the semaphore is signalled, kernel_sem_dequeue 0 is invoked,which dequeues the task descriptor that contains the blocked thread. The task is then de-livered a GLOBAL_SEM_SIGNAL_SIGNAL message using task_signal () . The task will then bescheduled and its upcall dispatcher will unblock the thread.In the event that a global semaphore holding blocked threads is destroyed, thekernel_sem_destroy 0 call is invoked to traverse the appropriate task_sem_q 0 list, and sendGLOBAL_SEMJESTROY_SIGNAL signals to each task. Each task will be scheduled and allowed tounblock its threads.Threads in a task can be blocked on multiple semaphores, so a task descriptor may bequeued on several task_sem_q[] queues at once. To allow this, each task descriptor containsa list of global semaphore links, sem_links D , in its shared region structure. A task is queuedon the same semaphore at most once, so the number of links required equals the number ofsemaphores. The GLOBAL_SEM_LINK is defined as follows in shared_region . h:typedef struct{signed int task_id:16;^/* this link's task identifier */signed int num_threads_waiting:16;^/* num threads in this task */QUEUE_LINK link;^ /* semaphore queue link */} GLOBAL_SEM_LINK;The task_id field identifies the task descriptor that belongs to the link.^Whenkernel_sem_dequeue() removes a task from a semaphore queue, the task_id field is usedto identify which task contains the blocked thread.The num_threads_waiting field counts the number of threads in the task that areblocked on the semaphore. This count is used by the user level semaphore library and thekernel_sem_dequeue () routine to decide when a link should be enqueued or dequeued.Chapter 3. Kernel level implementation^ 753.11 Interrupt managementThis section describes the kernel level interrupt handling and dispatching mechanism. Someinterrupts are handled directly by the kernel, such as the system clock tick. Other interrupts,such as VMEbus interrupts, can be dynamically registered by tasks to call a user level routine.A table of interrupt handler routines is maintained to keep track of user level and kernel levelhandlers.The interrupt .c module implements the interrupt dispatcher and user level system callsuse to register interrupt handlers. The table in Figure 3.15 summarizes the system call interfaceexported to the user level by this module.Synpopsis Interface availability Descriptionintr_register_user () user kernel Registers the task and enables the interrupt.intr_deregister_user () user kernel Disables the interrupt bit and removes handler.Figure 3.15: Interrupt handler registration system calls.All interrupts on the 88100 are funnelled through one single exception vector, INT. Fromthere, the low level interrupt handler properly saves the execution context and branches to theinterrupt dispatcher. The interrupt dispatcher finds the recipient of the interrupt, and eitherupcalls into user space to handle it, or calls a local kernel function.3.11.1 User level preemptionInterrupts preempt user level threads and cause scheduling events to occur. For example, ifa timer interrupt occurs, a TICK_UPCALL event is sent to the user level to indicate that it'stime to schedule the next ready thread. In a multiprocessor environment, special care must betaken to ensure that the rescheduling of threads in the presence of spin-locks does not adverslyaffect performance. A naive approach would allow interrupts to occur at any point duringuser level execution. This can result in very poor performance if threads are using spin-locksfor concurrency protection. If a thread is preempted while holding a spin lock, then all otherChapter 3. Kernel level implementation^ 76threads that try to access the lock must wait until the original thread is rescheduled and releasesits lock. The original thread may not be rescheduled for some time, causing all other threads towaste CPU time, uselessly spinning. Since spin locks are used throughout the user level kernelto protect data structure accesses, this problem must be solved.The solution implemented in the Raven kernel involves close participation between theuser level and kernel. Whenever the user level acquires a spin lock, a lock_count variableis incremented at the user level. Whenever an interrupt occurs that would cause an upcallevent into user space, the interrupt handler checks the lock_count variable. A non-zero valueindicates that a critical section is currently being executed, and control must be returned tothe user. But before the interrupt handler returns control to the critical section, it sets theupcall_pending variable, to indicate that an interrupt occurred. When the user level regainscontrol and finishes its critical section, the upcall_pending variable is checked, and if set, thethread will save its context and branch to the upcall dispatcher. The upcall dispatcher willperform the deferred event(s).As implemented, the two variables, lock_count and upcall_pending are not stored asconventional variables at all. Rather, each of them share the processor's r28 register, as shownin Figure 3.16. This is done to ensure that access to the variables is atomic. For example,to increment lock_count, a single addu r28, r28, 1 instruction is performed. If lock_countwas a conventional global variable, then incrementing it would require the use of a spin lock —which cannot be done, since lock_count itself is used within the locking routines.Chapter 3. Kernel level implementation^ 77A third variable is contained within the r28 register: in_upcall, bit 16. This bit is set toprevent preemption while the user level is processing an upcall event. The interrupt dispatcherchecks this bit whenever an upcall is attempted. If the bit is set, then an upcall is already isprogress. Upcalls cannot be nested, so the user's context is restored, and control returns back tothe point where the interrupt occurred, and upcall processing continues. Any upcall_status []bits that were set as a result of the interrupt will be properly dispatched.3.11.2 Handling interruptsDue to the asynchronous nature of instruction execution on the 88100, low level interrupt han-dling is a fairly complex operation. Certain situations arise where instructions in the dataunit and floating point pipeline are not properly completed. The interrupt handler must com-plete these instructions. It is possible that some of the instructions may cause an exceptionfault to occur, so the interrupt handler must be able to detect this and handle the exceptions.Faulted instructions in the data unit pipeline must be decoded and completed by simulatingthe instructions in software.When a device issues an interrupt, and the processor interrupt disable bit (IND) is not set,the processor stops fetching further instructions and tries to empty its pipelines. The IND bitis automatically set to disable interrupts while the initial interrupt handler executes. Interruptsremain disabled while execution continues throughout the kernel.After the processor finishes cleaning up its internal state, execution resumes at the INTvector handler in the exception vector table, excp_vects. The exception vector table contains1024 entries, one for each of the possible 88100 exception vectors. The INT vector saves theuser's stack pointer in a temporary scratch register, and branches to the kernel interrupt handlerINT_handler() in intr_handler.s:exci: br.n _INT_handler ; hardware device interruptstcr^r31, cr20^; cr20 <- user's stack pointer r31The first thing the interrupt handler does is check if the interrupt belongs to g88. g88 usesChapter 3. Kernel level implementation^ 78the ABRT and SWI7 interrupt for its functioning, so any of those interrupts are immediatelyredirected to the g88mon handler. The ABRT interrupt is set whenever the user presses control-C on the console. The SWI7 interrupt is used to propagate g88mon interrupts to all processors.All other interrupts are processed by the kernel.Interrupts are normally disabled while execution runs within the kernel. The only exceptionto this rule is when the kernel is executing the idle loop. If an interrupt occurs during the idleloop, no register state needs to be saved because the idle loop does not do anything useful. Theinterrupt handler checks the processor EPSR 3 register mode bit to see if the processor was insupervisor mode at the time of the interrupt. If so, then the idle loop was running, and thehandler can jump directly to the interrupt dispatcher intr_dispatch_s() without saving anystate.If execution was interrupted at the user level, the interrupt handler must save the user'sprocessor context. The user's context is comprised of 34 user level registers: r1 to r31, fperand fpsr, and the execution address. This context is saved in one of two possible places:• Temporary storage on the kernel stack. If user level preemption has been deferred byuser lock management (ie, if lock_count and in_upcall in register r28 are non-zero),then the register context is temporarily saved onto the kernel stack. These registers arerestored when the interrupt handler completes and returns control to the user.• The user level thread context saved area. If user level preemption is not deferred (ie, iflock_count and in_upcall in register r28 is zero), then the register context is saved intoa buffer maintained at the user level. A pointer to this buffer is retained in processorregister r29. When the user level thread scheduler runs a thread, it sets r29 to point tothe threads context save area. The interrupt handler saves the context directly into theuser supplied buffer.3The EPSR register is the 88100 exception time shadow register for the PSR register. The EPSR reflects thevalue of the PSR before the exception occurred.Chapter 3. Kernel level implementation^ 793.11.3 Dispatching interruptsOnce the processor context is saved, INT_handler() calls the interrupt dispatcher ininterrupt .c to process the interrupt. There are two interrupt dispatchers:• void intr_dispatch_s( int cpu, int intr_vec, int intr_status );This dispatch routine handles all interrupts while the processor was executing the idleloop (ie, when the processor is in supervisor mode). This dispatcher is fairly simple,because there are no requirements to clean up a currently executing task.• TASK *intr_dispatch(int cpu, int intr_vec, int *context, TASK *curr_task);This dispatch routine handles all interrupts occuring while the processor executes at theuser level. The routine has additional duties because the currently executing task mustbe properly managed.The interrupt dispatch module maintains a table, intr_table 0, which stores the handlersfor each of the 32 possible interrupts. Both of the interrupt dispatch routines use this table tolocate the appropriate interrupt service routine to handle the interrupt. The elements of thistable are defined by the INTR_HANDLER structure:typedef struct{TASK *user_task;void (*routine)(int, int, TASK *);int lock;} INTR_HANDLER;INTR_HANDLER intr_table[NUM_INTERRUPTS];The user_task field designates the task that should be invoked to handle the interrupt.This field is set to NULL if the interrupt handler is local to the kernel. If the handler is local tothe kernel, routine contains the address of the interrupt service routine.Chapter 3. Kernel level implementation^ 80The routines intr_register() and intr_deregister() manage the intr_table 0 entriesto allow a user level task or local kernel function to be the recipient of any interrupt.To handle a kernel level interrupt service routine, the dispatcher simply makes a functioncall to the service routine. But for user level handlers, the procedure is more complicated. Anaddress space switch may be required to activate the proper interrupt handler task.The dispatcher checks the appropriate interrupt entry, and if the currently executing taskis registered to handle that interrupt, then an INTR_UPCALL event is given to the task. Theparticular interrupt vector bit is recorded in the shared region intr_status 0 field. The userlevel interrupt dispatcher examines this field to determine the proper interrupt handler for theevent. This action is demonstrated by the following code from intr_dispatch_s 0:task->sr->intr_status[cpu] 1= 1 << intr_vec;task_upcall(cpu, task, 1 << INTR_UPCALL); /* no return */task_upcall() is a non-returning call in the task scheduler that activates the specified taskand issues an upcall into its address space.If the currently executing task does not handle the interrupt, but another task does, thenthe current task must be switched out and the interrupt handling task must be switched in.But before the current task is placed back onto the task ready queue, the thread which wasinterrupted must be cleaned up so that it can be rescheduled. While the register context forthe thread has been properly saved by INT_handler(), the user level thread kernel must benotified that one of its threads has been preempted, so the thread can be placed back on theuser level thread ready queue. To do this, the current task is issued an RELINQUISH_UPCALL.The user level upcall handler will then take the currently executing thread and place it backonto the thread ready queue. The upcall handler then immediately returns to the kernel viatask_intr_relinquish(). At this point, the current task is returned to the ready queue, andthe interrupt handler task is activated and issued an INTR_UPCALL to handle the interrupt.The Figure 3.17 flowchart summarizes the interrupt dispatching process.Chapter 3. Kernel level implementation3.11.4 Kernel managed interruptsSeveral interrupt sources are managed directly by kernel level service routines. These interruptvectors are the system clock tick timer (DTI), and software interrupt (SWI).Chapter 3. Kernel level implementation^ 82System clock tickThe system clock tick is the heartbeat generator for the kernel. The tick hardware used is thetimer component of the onboard MC68681 DUART chip. The Hypermodule connects the timeroutput to the DTI interrupt vector. The MC68681 is initialized at boot time to a default oruser specified interrupt rate.The clock interrupt service routine, DTI_service 0, checks the task t imer_q list, and readiesthe tasks that have expired times. If an idle processor is available, the processor is delivered aTASK_RUN_SWI interrupt to run the task.After all such tasks have been readied, a TICK_UPCALL event is issued to the currentlyexecuting user level on the local processor. This upcall event is used by the user level asa thread preemption timer to timeslice threads. The user level kernel counts the number ofTICK_UPCALLS it receives, and relinquishes its processor when the value reaches a quantum.There is a certain amount of trust placed in the user level to properly relinquish control whenits quantum has expired.Clock ticks are not issued to each processor in the system at once. Only one processor ata time has its clock tick interrupt vector enabled. When the tick occurs on a processor, theDTI_service() routine advances the tick interrupt to the next processor. Thus, tick interruptspropagate around all processors evenly in a round-robin fashion. This helps reduce the lockcontention that would occur if all processors tried to access the task data structures that thesame time.Remote processor interruptsThe Hypermodule contains a software interrupt register (SWI), which allows any processorin the system to interrupt any other processor. A remote processor is issued an interruptwhen its associated bit is written in the SWI register. An external data structure, SWI_MSG inshared_region.h, managed by the kernel software, allows event messages to be recorded aboutthe interrupt.Chapter 3. Kernel level implementation^ 83typedef struct{int msg;int task_id;int data;int more_data;int lock;} SWI_MSG;/* SWI event message type *//* event task id *//* event type data *//* more event type data *//* spin-lock to protect this entry */SWI_MSG swi_msgs[NUM_CPUS];Each processor in the system maintains its own swi_msg entry in a global swi_msgs 0table. To deliver an interrupt to a remote processor, a message and data is recorded into theprocessor's swi_msg entry. A lock must first be acquired on the swiinsg entry before accessto it is permitted. The local processor then sets the appropriate bit in the SWI register tointerrupt the remote processor.The remote processor will receive the interrupt, and use the SWI_service() routine tohandle it. SWI_service 0 examines the swi_msg entry and performs the appropriate actionbased on the message and data.The swi_msgs [] table is allocated on a single physical page which is shared read/writebetween all address spaces in the system. Also, the SWI register is mapped into all addressspaces. This allows any address space to efficiently deliver a remote interrupt. System callsfrom the user level are not required.The following four interrupt messages can be issued using the swiinsg entry:• TASK_SUSPEND_SWI - issued by task_suspend() to suspend the execution of a task on aprocessor.• TASK_DESTROY_SWI - issued by task_destroy() to suspend execution of a task on a pro-cessor.• TASK_RUN_SWI - issued by thread and task schedulers to initiate execution of a new threador task.Chapter 3. Kernel level implementation^ 84• TASK_SIGNAL.SWI — issued by task_signal() to send a signal event to a task.The interrupt module provides two routines to manage the delivery of remote interrupts.The caller specifies a char array list of processors to interrupt. A non-zero entry in the listindicates that the associated processor should be delivered an interrupt.• void intr_remote_cpus( char *cpus, int msg, int task_id );This routine is used to send an interrupt to a list of processors, specified by the list cpus.The swi_msg entry for each processor in the list is acquired, the message is written, andan interrupt is delivered.• int intr_remote_cpu( char *cpus, int msg, int task_id );This routine is used to send an interrupt to one of the specified processors in the cpuslist. If the swi_msg lock cannot be acquired (ie, if the swi_msg is currently being used todeliver an interrupt by another processor), then the processor is skipped and the next onein the list is tried. The routine returns OK after the first interrupt is delivered, or FAILEDif no swi_msg entry could be acquired.These two routines are not exported to the user level. Since the swi_msgs [] table andthe SWI register are mapped into the user address space, similar routines can be implementedthere, avoiding system calls to the kernel.3.12 User/Kernel Shared memory regionsThe Raven kernel relies on shared memory regions to help reduce communication costs betweenthe user and kernel levels. These regions contain high-use information that is frequently accessedwhen making scheduling decisions at the kernel and user levels. Instead of invoking system callsto pass this information between the user and kernel, read and write operations to the sharedregions are used to perform the same effect. Simple read and write operations are much cheaperthan system calls.Chapter 3. Kernel level implementation^ 85However, the use shared memory regions at the user level opens up system security holes.Errant program behaviour and rogue processes can adversely modify the information containedin the shared regions, causing any number of execution problems ranging from improper schedul-ing of tasks to fatal system crashes. This section summarizes the shared regions and describestechniques used to help avoid their abuse.3.12.1 The shared regionsThere are three shared memory regions between the user level and kernel. The table in Figure3.18 summarizes these regions and their protection status.Region name User Protection DescriptionKERNEL_INFO read-only General purpose kernel state.SHARED_REGION read/write Scheduling information between user/kernel.SWI_MSGS read/write For passing SWI interrupt messages.Figure 3.18: User/Kernel shared memory regions.The KERNEL_INFO structure is a read-only page that is mapped into all user level addressspaces. It is mapped read/write into the kernel address space.The SHARED_REGION structure is a read/write page that is pair-wise mapped between theuser/kernel address space. Each task maintains its own SHARED_REGION structure for schedulingpurposes.The SWI_MSGS structure is a read/write page that is mapped into all address spaces. It isused to pass software interrupt event information between processors. All address spaces requireaccess to this page so that software interrupts can be invoked without kernel intervention.3.12.2 Abuse PreventionThe current technique used by the Raven kernel to reduce the tampering of shared memoryregions is to map the regions at non-obvious locations in the virtual address space. If regions areallocated in sparse areas of the address space, chances are that erroneous or malicious programChapter 3. Kernel level implementation^ 86behaviour will result in memory access exceptions before the shared regions are discovered.However, this technique is certainly not failsafe.In order to fully protect shared memory regions at the user level from invalid access, theprocessor must have the ability to mark sections of memory in the user space as privileged.Accessing these sections would require the executing code to have the same level of privilege.The user level kernel could use this feature to enable its code and shared regions with a certainlevel of privilege, higher than the level given to user application code. However, most modernmicroprocessors, including the 88100, do not allow this kind of flexibility, so alternatives mustbe considered.One alternative to mimick multiple levels of privilege within the same address space isto dynamically change the address mapping tables as the privileged code executes. When afunction in the user kernel requires information from a shared region, it could map in theappropriate page, access the data, and unmap the page. Special purpose address mappingsystem calls could be coded very efficiently in assembler for this purpose. However, additionalcosts such as TLB misses and flushing could make this approach impractical. If the translationtables were accessible to user programs, then these system calls could be avoided, but thesecurity situation would be even worse due to the exposure of translation tables.Other techniques to improve security in this area are being investigated.Chapter 4User level kernel implementationEach user level task running in the system requires a user level kernel library. This librarycontains all of the operating system services that are not implemented in the supervisor kernel,such as: thread management, synchronization primitives, and interprocess communication.This chapter discusses the implementation of the user level kernel, and how it interacts withthe supervisor kernel and user programs.Figure 4.19 shows the modular breakdown of the user level kernel, and the source code filesinvolved. This section begins by discussing the user level upcalling dispatching mechanism. WeChapter 4. User level kernel implementationthen then continue by describing the threading environment and support modules.4.1 Upcall handlingWhen a task is loaded into the system, task_create () allocates a memory space, as in Figure3.8, and preallocates an initial stack for the task to execute in. This stack is known as theupcall stack, and is used by the processor to handle upcalls events into the task. Since upcallevents to the same task can occur in parallel on multiple processors, a separate upcall stack foreach processor is required. The upcall stacks are located at well known location at the end ofheap space.When the kernel upcalls into user space, it places the upcall event flags into the task'sshared region upcall_status D entry'. All upcalls into the user space are funnelled througha single dispatch routine, upcall_dispatch() in upcall. c. upcall_dispatch() scans the1 The this_cpu() function returns the local processor number.Chapter 4. User level kernel implementation^ 89upcall_status [] entry for event bits and handles each of the flagged events by calling theappropriate upcall handler, clearing the event bits as it goes. Figure 4.20 shows a typical set ofupcall handlers, including an exception handler routine and two interrupt device driver routines.Preemption is deferred at all times on the local processor during upcall handling. Thisprevents nested upcalls to occur. For example, if an interrupt occurs during upcall dispatching,the interrupt event is recorded, but control returns directly to the user level to proceed withupcall dispatching. The upcall dispatcher will notice the occurence of any event by examiningits upcall_status [] entry, and handle it appropriately. Preemption is deferred by setting thein_upcall flag in the processor register r28. The kernel level upcall dispatcher examines thisflag before issuing any upcall, and if set, does not issue an upcall.The upcall mechanism maintains a table of upcall handlers, upcall_table D . This tableis comprised of a static list of upcall handler functions, one for each upcall event type. Thekernel task management module defines 8 possible upcall events that must be handled by thedispatcher. The following Figure 4.21 describes the actions performed by each of the upcallevent handlers.A second upcall dispatch routine upcall_resume 0 is provided to invoke dispatching by theuser level spin lock library. When a spin-lock is held, upcall preemption is deferred until thelock is released. The lock release code checks to see if an upcall is pending, and if so, branchesto upcall_resume () to dispatch the pending upcalls.While upcall_dispat ch 0 is invoked by the kernel in the upcall context, upcall_resume 0is invoked within the calling thread's context. So before proceeding with upcall dispatching,upcall_resume () must first save the thread's context and put the thread back on the readyqueue.4.1.1 User level interrupt and exception handlersAll of the entries in the upcall_table [] are defined at compile time to point to well knownhandler routines in the user level kernel. However, two of these upcall handlers, excp_handler 0Chapter 4. User level kernel implementation^ 90Upcall Event Handler function DescriptionRUN_UPCALL run_handler () The task has been given a processor, and can beginscheduling threads.TICK_UPCALL tick_handler() A system clock tick occurred, schedule another thread. Aquantum count is incremented, and if expired, theprocessor is relinquished to the kernel.RELINQUISH_UPCALL relinquish_handler() The task scheduler has requested that this taskrelinquish the processor.TIMER_UPCALL timer_handler () The task timer service has indicated that a timed event inthis task has expired. Check the list of sleeping threadsand ready any threads that have expired waketimes.SIGNAL_UPCALL signal_handler() At least one message is waiting in the shared regionsignal queue. Dequeue each signal message and calltheir signal handlers.KILL_UPCALL kill_handler () The task scheduler has destroyed this task, so the userlevel state must be cleaned up, and control returned tothe kernel.EXCP_UPCALL excp_handler () A processor exception has occurred. The handler checksthe excp_s tatus [ ] entry to determine the exceptioncode and branches to the appropriate exception handler.INTR_UPCALL intr_handler () A device interrupt occurred that is registered for thistask. The intr_status [ ] entry is used to determinethe interrupt vector and user interrupt handler.Figure 4.21: Summary of upcall events and the user level upcall handlers.and intr_handler , furnish a level of indirection to user provided handlers.The user provided handlers are invoked from within the context of the upcall dispatchingmechanism. Preemption is deferred, and execution runs on the upcall stack. Handlers that exe-cute within this context must not invoke thread management routines that may cause a threadcontext switch to occur. Thus, handler routines must be careful not to block on semaphores orsuch. Spin locks are allowed because they do not involve context switching.Interrupt handersThe interrupt . c module maintains a table of user provided interrupt handlers, intr_t able ^ .Each entry in the table contains a pointer to a handler routine:Chapter 4. User level kernel implementation^ 91typedef struct{void (*routine)(int);int lock;} INTR_HANDLER;INTR_HANDLER intr_table[NUM_INTERRUPTS];The following two routines are provided by the interrupt . c module to manage the entriesin intr_table 0:• int intr_register( int intr_vec, void *handler );This routine registers a user supplied handler routine to be called whenever the hardwareinterrupt vector intr_vec is signalled. The list of hardware interrupt vectors are listedin the kernel file registers .h. The kernel system call intr_register user() is calledto register the interrupt with the kernel and set the interrupt enable bit for the vector.Returns OK if successful, or FAILED if an error occurred.• int intr_deregister( int intr_vec );This routine disables the interrupt vector intr_vec and removes the user level handlerroutine from the intr_table 0 . The kernel system call intr_deregister_user() is usedto disable the interrupt enable bit for the vector. Returns OK if successful, or FAILED ifan error occurred.Exception handersThe exception. c module is basically a mirror of the interrupt .c module. It provides thesame functionality, except that its attention is directed towards processor exceptions ratherthan interrupt exceptions.One use for the user level handling of exceptions is to trap the data access exception DACC.This exception occurs whenever a memory load or store operation happens outside of the userlevel address space. This could be used to implement a user level paging mechanism. Or,Chapter 4. User level kernel implementation^ 92a handler could be constructed that would detect thread stack overflows and automaticallyallocate more room. Handlers for other exceptions, such as code violations and division byzero, could be written to destroy the offending thread context, rather than destroying thewhole task.exception. c maintains a table of user provided exception handlers, excp_table 0 . Eachentry in the table contains a pointer to a handler routine:typedef struct{void (*routine)();int lock;} EXCP_HANDLER;EXCP_HANDLER excp_table [NUM_EXCEPTIONS] ;Similar to the interrupt . c module, two routines are provided to manage the entries inexcp_t able 0 :• int excp_register( int excp_vec, void *handler );This routine registers a user supplied handler routine to be called whenever the exceptionvector excp_vec is signalled. The list of hardware exception vectors are listed in the kernelfile registers . h. The kernel system call excp_register_user () is called to register theexception with the kernel. Returns OK if successful, or FAILED if an error occurred.• int excp_deregister( int intr_vec );This routine removes the user level handler routine from the excp_table 0 , and calls thekernel system call excp_deregister_user () to deregister the exception from the kernel.Returns OK if successful, or FAILED if an error occurred.4.2 User level spin- locksThe user level spin locking code, lock. s, provides the same interface as the kernel version, butit is implemented quite differently at the user level. The user level spin-lock mechanism needsChapter 4. User level kernel implementation^ 93to prevent thread preemption while a lock is held. If a thread is preempted while holding a lock,other threads in the system will busy-wait trying to acquire the lock until the thread is resched-uled and the lock is released. To solve this problem, a special locking protocol is implementedbetween lock_wait()/lock_free() and the kernel level upcall dispatcher mechanism.void lock_wait(int *lock);The lock_wait 0 call tries to acquire a lock by spin-waiting on a cached copy of the lock,just like the supervisor kernel version. However, when it does acquire the lock, the lock_countvariable in register r28 is incremented to notify the kernel level upcall dispatching mechanismthat a lock is held. If an interrupt or other task scheduling event occurs, the kernel upcalldispatcher checks the locks_held counter and defers the upcall event if it is non-zero. Theupcall dispatcher will not preempt the user level when the lock_count is non-zero.void lock_free(int *lock);lock_free() writes a 0 value to the lock variable, and decrements the lock_count variable.If lock_count reaches zero, and an upcall is deferred (bit 16 of r28 is set), then control ispassed immediately to the upcall_resume() call. upcall_resume() invokes the user levelupcall dispatcher, where the deferred upcall is properly handled.4.3 Thread managementThe thread management library manages the creation, destruction, and scheduling of lightweightthreads of control. Threads are preemptively timesliced, and and freely migrate from processorto processor, in an effort to balance workload. A central 32 level round robin priority queuemanages the scheduling ordering of threads.4.3.1 Thread descriptor structureThe thread descriptor control block structure, TCB defined in kernel.h, comprises the datathat makes up a thread:Chapter 4. User level kernel implementation^ 94typedef struct{int id;^ /* thread identifier */int state; /* scheduling state */int priority;^ /* priority level, 0 to 31 */char name[THREAD_NAME_SIZE+1]; /* string name for thread */int stack_region;^/* VM region of thread stack */int *context; /* thread register context area */int sem_wait_id;^/* sem this thread is waiting on */unsigned int wake_time;^/* time to wake this sleeping thread */QUEUE_LINK link;int lock;} TCB;TCB thread_table[THREAD_TABLE_SIZE];TCB *thread_free_q;int thread_free_q_lock;/* queue link in system queues *//* spin-lock for this thread */All thread descriptors are stored within the static thread_table [] array. At initializationtime, free thread descriptors are linked into the thread_free_q, to make descriptor allocationan easy dequeue operation.Threads are referred to by their index into this table, the identifier field id. The statefield maintains the threads scheduling state. priority contains the scheduling priority for thethread, a value from 0 to 31. The name field stores a string name for the thread (useful fordebugging purposes).Each thread is allocated a page-aligned context save area and stack from the kernel memoryallocator, vm_alloc 0. The region descriptor belonging to the stack is saved in the stack_regionfield. The address of the context save area is stored in the context field. The thread's stack ispositioned immediately below this context buffer. Figure 4.22 shows this relationship.The sem_wait_id field is used by the local sempahore library to store the identifier of theseamphore that the thread is blocked on. If the thread is ever destroyed while blocked on asemaphore, sem_wait_id is consulted to remove the thread from the appropriate semaphoreChapter 4. User level kernel implementationentry.The wake_time field is used by the thread_sleep 0 facility to record the future wakeuptime of the thread.The link field is a queue link that links the thread descriptor into the various user levelthread management queues. There are several of these queues: the ready queue, semaphorequeues, and IPC queues.Finally, the lock field provides a spin lock to protect against concurrent accesses to thethread descriptor data structure.4.3.2 Thread context switchingThe low level thread context switching routines provide the basic mechanism to load and savethe thread execution context. The thread context is a collection of 29 general purpose registers:r1 — r25, the frame pointer r30, the stack pointer r31, and floating point registers fper andfpsr. In addition to the processor registers, a thread instruction pointer is also maintained.Two assembler routines are provided by the ctxsw. . s module to assist in the saving andloading of thread contexts: load_context 0 and save_context 0.void load_context(int *context, int *thread_lock);This routine loads the processor registers with the context buffer provided by context. Thethread descriptor must be previously locked to prevent writes to the context area while it is beingloaded (if an interrupt occurs in the middle of loading a thread's context, the interrupt handlerChapter 4. User level kernel implementation^ 96must not overwrite this context or inconsistency would result). Processor register r29 is set topoint to the context buffer. (At interrupt time, the kernel interrupt handler consults r29 tofind the context save address.) When the context is finished being loaded, the thread descriptoris unlocked (hence the reason for passing the thread_lock parameter), and a jump instructionbranches to the address contained in the thread instruction pointer. load_context() neverreturns to the caller.A thread's context is saved in one of two ways: by the kernel interrupt handler whena interrupt occurs, or by the thread scheduler to switch out a running thread. The threadscheduler uses save_context() to save the calling thread's context save_context(int *context, int *thread_lock);save_context 0 saves the calling thread's register context into the supplied context buffer.Once saved, the thread descriptor lock is released. save_context 0 returns a non-zero valueto the caller. When the thread context is loaded again sometime in the future, the threadinstruction pointer returns back to the same location where save_context 0 was called from.Except in this case, a zero value is returned. Therefore, the caller must check the returnvalue to see if execution is continuing, or if the thread has been restored. The following codedemonstrates this:/* save the thread's context and drop through */if ( save_context(running_thread->context, &running_thread->lock) )/* When thread context is reloaded, control returns here. *//* This return statement jumps back to the thread.^*/return;/* normal execution drops through to here */4.3.3 Thread ready queueThe thread scheduler uses a 32 level round robin priority queue to order ready threads. Prioritylevel 0 is low, priority level 31 is high. A low priority thread will never be scheduled to run ifChapter 4. User level kernel implementation^ 97there is a higher priority thread waiting. However, this ordering does not span tasks. A lowpriority thread in a remote task will be allowed to run even though there are higher prioritythreads in the local task.The ready_queue.c module implements a priority queue structure and interface that issimilar to the kernel level task ready queue. The readyq_table 0 keeps track of 32 queues, onequeue for each priority level. The hint variable is used to cache the highest priority level witha non-empty queue, to help speed up dequeue operations. A spin-lock ready_q_lock protectsagainst concurrent accesses to the ready queue structures.static QUEUE readyq_table[NUM_READYQ_PRIORITIES];static int hint;static int ready_q_lock;Three routines are provided which operate on the above structures:• TCB *dequeue_ready_thread();Dequeues the next available ready thread from the ready queue, and decrements theshared region num_ready_threads counter.• void enqueue_ready_thread(TCB *thread);Puts the thread descriptor thread into the THREAD READY state, enqueues it onto the readyqueue, and increments the shared region num_ready_threads counter. If an idle remoteprocessor is available, an TASK_RUN_SWI interrupt will be delivered to that processor torun the thread. Otherwise, if a processor has not been previously requested, the kernelsystem call task_request_cpu(), drops down to the task scheduler to request anotherprocessor for the task.• void remove_ready_thread(TCB *thread);Removes the specified thread descriptor from the ready queue structure, and decrementsthe shared region num_ready_threads counter. This is used by the thread_destroy() andthread_suspend() routines to remove a thread from anywhere within the ready queue.Chapter 4. User level kernel implementation^ 984.3.4 Thread schedulingThe thread scheduler is the heart of the user level kernel. When the upcall dispatcher finisheshandling all of the upcall events, control is passed to the scheduler. Ready threads are dequeuedfrom the ready queue, and executed via load_context 0 . When a thread blocks, the schedulersaves the thread's context, and schedules a new thread.There are two main thread scheduling routines: sched() and sched_no_save (). Theseroutines are used internally by the thread package, and are not intended for general purposeuse by user programs A third interface, thread_sched() is available to user programs forvoluntarily relinquishing control.void sched(TCB *running_thread);The sched() routine is responsible for saving the running_thread context and dispatchinga new thread. The caller is assumed to already have placed the running_thread descriptor onthe proper queue; this routine only saves the context, it does not do any queueing. After savingthe caller's context, sched dequeues the next ready thread, sets its state to THREAD_RUNNING,and performs load_context 0 to execute the thread. If there is no ready thread available, thetask_relinquish() system call relinquishes control of the processor back to the kernel taskscheduler.void sched_no_save();sched_no_save () is the same as sched(), except that the caller's context is not saved. Thisis used primarily as the final call by the upcall dispatcher, which doesn't run in a thread context,to begin scheduling threads. sched_no_save() dequeues the next ready thread, sets its stateto THREAD_RUNNING, and performs load_context () to execute the thread. If there is no readythread available, the task_relinquish() system call relinquishes control of the processor backto the kernel task scheduler.Chapter 4. User level kernel implementation^ 99void thread_sched();thread_sched() is a user-callable thread library routine. This call performs a voluntarythread relinquishment. The caller thread is placed on the ready queue, and sched() is calledto save the thread context and schedule another thread.4.3.5 Thread creationAll threads are created using the library call thread_create(). thread_create() allocates afree thread descriptor from the thread_free_q, allocates a stack, and sets up the thread's initialexecution context. The execution context is built from the thread entry point, exit point, initialstack address, and parameters. The exit point for a thread is called when the thread "runs offthe end" of its function, and is set to the thread_destroy 0 thread_create( int *thread_id, void (func)(), char *name,int priority, int stack_size, int ready, int num_args,func specifies the entry point for the thread. name is a null-delimited string which identifiesthe thread (useful for debugging purposes). priority specifies the ready queue schedulingpriority.stack_size specifies the size of the buffer to allocate for the thread's stack. Currently, allstack buffers are allocated using the kernel memory allocator, vm_alloc 0. Thus, stacks arealways allocated on page boundaries. The smallest stack size is therefore 4096 bytes.The ready field specifies the scheduling state that the thread should be placed in after itis created. A non-zero value passed in ready signifies that the thread be placed on the readyqueue and executed if a processor is available. If zero is passed in ready, the thread is placedin the THREAD$USPENDED state, to be later resumed by thread_resume().The num_args field specifies the number of parameter arguments to pass to the createdthread. The thread arguments are specified immediately after the num_args field as a variableargument list.Chapter 4. User level kernel implementation^ 100If the thread creation was successful, the thread identifier is returned in thread_id, and OKis returned. Otherwise, FAILED is returned if an error occurred.4.3.6 Destroying a threadThe thread_destroy() routine removes a thread from the local task. The thread's stack isfreed, and the thread descriptor is placed back onto the thread_free_q. In a multiprocessorenvironment, however, the specified thread to destroy may be executing on a remote processor.In this case, an interrupt must be delivered to the remote processor to suspend execution ofthe thread.If the currently executing thread is destroying itself, then additional work needs to be doneto properly free the threads stack buffer. The currently executing thread cannot free its ownstack, or else execution will have no stack to continue with. Instead, the thread descriptoris placed in the THREADDYING state, and is queued onto the thread_dying_q. A special low-priority idle_cleanup() thread is created at initialization time, which waits for threads to beplaced on the thread_dying_q, and frees the thread stacks when they become thread_destroy( int id );The thread to destroy is specified by id.4.3.7 Suspending and resuming a threadA thread can suspend execution of itself using the thread_suspend() call. This routine placesthe thread descriptor into the THREAD_SUSPENDED state, and will not be executed again until itis resumed via thread_resume().int thread_suspend() ;This function suspends the caller thread thread_resume( int id );This function resumes normal scheduling priority of the specified thread.Chapter 4. User level kernel implementation^ 1014.3.8 Sleeping a threadThe thread sleeping facility is used to suspend the execution of threads for a predeterminedlength of time. The sleep facility is built on top of the task timer service, which generates upcallevents at specified times.The timer_handler() upcall routine and thread_sleep() routine share a sorted queue,thread_sleep_q. Sleeping threads are placed on this queue, sorted in the order of wakeuptime. When timer handler() is called, expired threads on the thread_sleep_q are placed onthe ready queue and thread_sleep( unsigned long sleep_time );thread_sleep() puts the caller thread to sleep for the specified number of clock ticks. Thenumber of clock ticks per second is a value that is interactively set at kernel boot time.4.4 Semaphore managementSemaphores provide a way to synchronize actions and communicate events between threads. Acommon use for semaphores is to provide mutual exclusion around a piece of sensitive code.Other uses include controlling producer/consumer type problems between threads. For example,the interprocess communication system uses semaphores to synchronize message passing eventsbetween threads.The user level semaphore library, sem.c and global_sem.c, provides two types of semaphoresthrough the same function call interface. Lightweight semaphores, local to an address spacecan be created for exclusive use between threads in the same address space (used by the localIPC implementation). Global semaphores can also be created, which allow threads in remoteaddress spaces to synchronize between each other (used by the global IPC implementation).The global semaphore implementation requires special hooks into the kernel, and is thereforeslightly more costly in terms of performance. A parameter in the sem_create 0 call allows thecaller to specify whether a local or global semaphore should be created.Chapter 4. User level kernel implementation^ 1024.4.1 Local semaphoresLocal semaphores are used only between threads in the same address space. They are speciallyoptimized for the local case, and no kernel involvement is required. The semaphore librarykeeps track of each semaphore using a statically allocated table of descriptors:typedef struct semaph_s{int id;^/* semaphore identifier */int state;^/* SEM_FREE or SEM_USED */int sequence;^/* for deletion protection */int count;^/* semaphore count variable */int lock; /* spin-lock protecting this entry */QUEUE wait_q;^/* queue of waiting threads */struct semaph_s *next_sem;} SEMAPH;SEMAPH sem_table[SEM_TABLE_SIZE];SEMAPH *sem_free_head;^/* semaphore free list */int^sem_lock; /* spin-lock protecting free list */The total number of semaphores in a task is bounded at compile time by the SEM_TABLE_SIZEconstant in the sem.h file. Each semaphore descriptor maintains a count value and a queue forstoring blocked threads. The sequence field allows the semaphore routines to check whethera deletion has occurred. Each semaphore descriptor is linked into a free list, sem_free_head,making semaphore allocation a simple operation.4.4.2 Global semaphoresGlobal semaphores can be used to synchronize events between threads in separate addressspaces, as well as their local address space. The global semaphore routines are similar insemantics to the local case. However, they differ completely in implementation, some kernelsupport is necessary.Each address space shares a global semaphore table, global_sem_table defined as:Chapter 4. User level kernel implementation^ 103typedef struct{int lock;int free_sem_q;GLOBAL_SEM sems[NUM_GLOBAL_SEMS];} GLOBAL_SEM_TABLE;GLOBAL_SEM_TABLE *global_sem_table;This stucture is allocated by the first task in the system, and is shared between all furthercreated tasks. The main element in this structure is the table of semaphore descriptors, sems.The free_sem_q field implements a list head pointer to start a linked list of free semaphoredescriptors.The first task that is loaded into the system uses vm_alloc0 to allocate a region for thetable. The region identifier is then stored in the global nameserver. Future tasks that areloaded into the system query the nameserver for the region identifier, and map the table inusing vm_share0.The global semaphore descriptors stored in sems comprise the semaphore's main data struc-ture unit:typedef struct{int id; /*int owner_task; /*int sequence; /*int count; /*int lock; /*int next_sem; /*} GLOBAL_SEM;semaphore identifier */owner task_id of this semaphore */destroy sequence counter */semaphore count */spin-lock protecting this entry */semaphore free list next pointer */Other data structures involved in global semaphore operation reside in the local sharedregion structure, and in the kernel. These data structures are used to support the queueing ofblocked threads in disjoint address spaces. Threads in different tasks can block on the sameglobal semaphore. The sem_signal 0 signal operation will unblock one of the threads, butwhich one? A fair protocol would unblock the thread that has been waiting the longest.Chapter 4. User level kernel implementation^ 104The local semaphore implementation supports this fairness by using a FIFO queue to sortthe blocked thread descriptors. A queue is also used in the global semaphore case to providefairness. However, this queue stores task descriptors, not threads.Each global semaphore in the system has a task queue associated with it. This queue linkstogether all the tasks in the system that contain threads which are blocked on the semaphore.Since multiple threads in the same task can block on several different semaphores, each taskin the system could be linked onto several different global semaphore queues. Thus, each taskneeds to maintain a table of queue links, one for each possible global semaphore.The table of queue links for each task is located in the task's kernel/user shared region, assem_links D. When a thread in a task blocks on a global semaphore, the task is linked ontothe semaphore queue using its local link from sem_links D. When a thread in another taskinvokes the sem_signal() operation, the next task in the queue is dequeued and delivered aGLOBAL_SEM_SIGNAL_SIGNAL message using the task_signal 0 mechanism. The signal handlerfor this message then wakes up and runs the appropriate thread.The semaphore task queues are maintained within the kernel. Whenever a user threadblocks on a semaphore, a system call is required to place the local task on the semaphorequeue: kernel_sem_enqueue(). However, since the task is now queued on the semaphore,further threads that block on the same semaphore do not invoke the kernel operation. Likewise,the kernel_sem_dequeue() system call removes the next task in the queue and delivers it asignal.4.4.3 Waiting on a semaphoreThe sem_wait() call decrements the semaphore count variable. If the count remains zero andabove, then the call returns immediately. Otherwise, the calling thread is enqueued onto thesemaphore's queue, and the thread scheduler is called to run the next thread. The thread willremain on the semaphore's queue until it is unblocked by sem_signal() or sem_destroy(), oruntil the thread is destroyed.Chapter 4. User level kernel implementation^ 105int sem_wait( int sem_id, int no_block );The semaphore is specified by sem_id. If no_block is set to a non-zero value, then sem_wait 0will always return to the caller, regardless of the value of the count variable. This function re-turns OK if successful, WOULD_BLOCK when no_block is set and the call would normally block,DESTROYED if the semaphore was deleted, or FAILED if the specified sem_id is invalid.4.4.4 Signalling a semaphoreThe sem_signal() call increments the semaphore count variable, and readies the next waitingthread from the semaphore's queue using enqueue_ready_thread(). The readied thread willeventually resume execution when a processor is sem_signal(int sem_id);The sem_id parameter specifies the semaphore to signal.4.4.5 Allocating a semaphoreThe sem_create 0 call allocates a free semaphore descriptor and initializes the descriptor en-tries. The next free semaphore descriptor is dequeued from the sem_free_head queue. Theinitial value of the semaphore count can be specified by the caller. Either global or localsemaphores can be allocated by specifying the global sem_create( int *sem_id, int count, int global );The initial semaphore count value is specified by the parameter count. Passing a non-zerovalue as the global parameter will create a global semaphore, otherwise a local semaphore willbe created.Chapter 4. User level kernel implementation^ 1064.4.6 Destroying a semaphoreWhen a program is finished with a semaphore, it should be returned to the system using thesem_de stroy ( ) call. Destroying a semaphore requires more work than enqueuing the descriptoron the free list, however. Any blocked threads waiting on the semaphore must be released, orthey would remain blocked forever. The blocked threads are dequeued one at a time, and readiedwith the enqueue.ready_thread() call. Before releasing the blocked threads, the semaphoresequence field is incremented. When a thread resumes execution in sem_wait(), it checks thesequence value against the previous value. If the values differ, then sem_wait() can return aDESTROYED value.When destroying a global semaphore, however, any remote tasks that are queued behind thesemaphore must be dequeued and delivered a GLOBAL_SEM_DESTROY_SIGNAL message. Since theglobal semaphore queues are maintained within the kernel, the kernel_sem_destroy 0 systemcall performs this function.4.4.7 Kernel interventionThe global semaphore library requires some special purpose kernel support for its operation.One of the main goals of the system is to reduce the overall number of these kernel interactions.This subsection summarizes the situations where kernel system calls are necessary during theglobal semaphore operations.The following steps are executed by the semaphore wait operation. A kernel call is onlynecessary if there are no blocked threads on the semaphore.1. Return if semaphore count is greater than zero.2. Block calling thread.3. Call kernel_sem_enqueue() if this is the first thread in the local task blocked on thesemaphore.The semaphore signal routine makes a system call only if there is a waiting thread:Chapter 4. User level kernel implementation^ 1071. If the semaphore count idicates a blocked thread, call kernel_sem_dequeue().2. Return to caller.4.4.8 Miscellaneous semaphore operationsThe following operations can be useful in certain situations.• int sem_count( int sem_id );This function returns the count value of the specified semaphore.• int sem_reset(int sem_id, int count);This function allows the caller to change a semaphore count value.4.4.9 Semaphore library initializationThe sem_init 0 semaphore library initialization routine first initializes the local semaphoredata structures. The sem_table [] is initialized and a free_list_q is created.Then, and the global semaphore initialization routine is invoked. This routine allocatesor maps in the global semaphore table, and initializes the local data structures. The globalnameserver is queried to see if the GLOBAL_SEMJ{EGIOLNAME string exists in the database. Ifso, the region identifier for the semaphore table is returned, and vm_share 0 is used to map inthe memory. If the string is not registered, then a new table is allocated with vm_alloc 0, andits region identifier is stored in the nameserver database.4.5 Interprocess communicationThe user level kernel supports both port-based synchronous send/receive/reply and asyn-chronous send/receive communication models. These models are sufficiently different in imple-mentation that each require its own library interface: the synchronous port library, rpc_port. cand grpc_port.c, and asynchronous port libraries, ports.c and global_ports.c.Chapter 4. User level kernel implementation^ 108The port based approach to interprocess communication uses a port number as the mailboxaddress for messages. In a client/server model, the server waits for messages to be received ona port, the client sends messages to a port. When a server cannot receive messages as fast asthey are sent, the messages are buffered by the port library. A port must be properly createdbefore any IPC can take place through the interface.For performance reasons, the port user level libraries distinguish between local ports andglobal ports. Local ports are used only for communication within a single address space. Theimplementation of these ports are based on local semphores and the local thread scheduler.Local ports are more efficient than global ports because all of their interaction is limited to thelocal address space.Global ports are used for communication across address spaces, as well as within. Theseports make extensive use of shared memory between the client/server address spaces to helpreduce communication costs. In addition to reducing data copying, the shared memory alsohelps reduce the number of kernel interventions required to invoke the port communicationprotocols.Local port messages are always passed by reference to reduce data copying. However, sincepointers don't make sense in remote address spaces, global port messages cannot use pointers.Instead, to emulate pass-by-reference, a virtual memory region descriptor of the data buffer canbe passed, and the vm_share() and vm_move () system calls can be used to map the region intothe local address space. This technique can eliminate data copying when moving data betweenaddress spaces. For small pieces of data, however, the overhead of memory mapping outweighsthe data copy, and therefore this technique should be reserved for larger chunks of data.4.5.1 The synchronous port libraryThe user level library supports the synchronous style send/receive/reply protocol. Using atransaction identifier returned by the receive operation, server threads can defer the reply stageuntil a later time. The transaction identifier is used by the reply call to identify the properChapter 4. User level kernel implementation^ 109client message to reply.The following two subsections describe the implementation of the local and global portlibraries. Then, the user interface is described.Local synchronous port implementationThe local synchronous port library maintains a table of port descriptors rpc_port_msgs 0 . Thisdescriptor is used to keep track of threads blocked on sends, receives, and replies.typedef struct port_s{int id;int state;int sequence;int send_sem;QUEUE send_msg_q;QUEUE reply_wait_q;/* port identifier *//* USED or FREE status *//* incremented at each port_destroy *//* semaphore to block receiver *//* queue of sender messages *//* queue of threads waiting for a reply */struct port_s *next_port; /* next port descriptor link */int lock;^ /* spin lock protecting this port */} RPC_PORT;The semaphore send_sem is used to block server threads on the receive operation. Thissemaphore counts the number sender threads currently blocked on a send. When a client threadsends a message, the message is queued into the send_msg_q, and the send_sem is signalled towake the server thread. The client thread is placed on the reply_wait_q and blocks waiting fora reply. The server thread dequeues the client from this queue and unblocks it when it performsthe reply operation.Each thread in a task is allocated its own synchronous message descriptor. All local messagesare passed by reference. This message descriptor maintains pointers to the passed data buffers,and a queue link for attaching to a port:Chapter 4. User level kernel implementation^ 110typedef struct{int id;char *send_data;int send_len;char *reply_data;int *reply_len;QUEUE_LINK link;} RPC_MSG;/* port id this message is queued on *//* pointer to send data *//* length of send_data *//* pointer to reply data *//* pointer to reply data length *//* link for port msg queue */Global synchronous port implementationEach task in the system shares a region of shared memory that contains a table of global portdescriptors. When a global port is created, a free descriptor is allocated from the free list. Aregion buffer for storing messages is allocated by the owner task, and is shared to all clientsthat want to interact on the port.This port descriptor maintains head/tail pointers to the shared port message buffers. Eachport descriptor contains three queues: the msg_free_q links the free message descriptors to-gether; the msg_send_q_head links the send messages together; and msg_reply_q_head whichlinks the reply messages together.A global semaphore in each port descriptor is used to control access to the port's msg_free_q.When a client thread sends a message, a message buffer is dequeued from the shared buffer. Ifthere are no free buffers, then the thread blocks on the msg_free_sem. When a server releases amessage buffer back to the free list, it performs a sem_signal 0 operation to unblock a waitingthread.The task signalling facility is used to communicate low level send/receive/reply events tothe client and server address spaces. These events are not sent on every interaction, but only inthe worst-case moments discussed below. The task signalling facility, task_signal(), is usedto communicate these events to remote address spaces. The following three signal messages aredefined for this use:• the GRPC_PORT_SEND_SIGNAL message is sent by a client thread to a server task with aChapter 4. User level kernel implementation^ 111blocked server thread to wake the server thread and tell it that a message awaits. If asend message is already present in the queue, then the library assumes that the serverhas already been delivered a signal message, so sending another is not necessary.• the GRPC_PORT_REPLY_SIGNAL message is sent by a server thread to a client task to indicateto the client task that a server is replying to a client's message. When the client taskreceives this signal, it can wake the client thread, and deliver the reply message to it.• the GRPC_PORT_DESTROY_SIGNAL message is sent to all client tasks that are blocked on aport when the port is destroyed. The client threads can then be unblocked and told thatthe port was destroyed.The task_signal 0 mechanism can deliver these messages without kernel intervention onthe local processor. If a remote processor is executing the destination task, then that processoris interrupted by the user level intr_remote_cpu() service. Otherwise, if a remote processor isidle, then that processor is interrupted to handle the signal message. If all processors are busy,then a system call is required to deliver the signal.Sending a synchronous messageint rpc_port_send( int port_id, char *send_data, int send_len,int *reply_data, int *reply_len);This function queues a message of length send_len bytes on port_id and blocks waitingfor a reply. For a global port, the send_data buffer is copied into the ports shared bufferregion. For a local port, a pointer to the data is passed. reply_data is assumed to point topre-allocated buffer space for the reply message, whose length is specified in reply_len. Uponsuccessful return, the global port reply data is copied in reply_data buffer and the length inreply_len. The call returns OK if successful, DESTROYED if the port was destroyed while waitingfor a reply, and FAILED if the port does not exist.The following client thread demonstrates this call:Chapter 4. User level kernel implementation^ 112void client(port){char reply_msg[20];char *send_msg = "send message";int reply_len;rpc_port_send(port , send_msg, strlen(send_msg)+1, &reply_msg, &reply_len);printf("reply message is:%s length:%d\n", reply_msg, reply_len);Before the client threa can queue a message into the shared port buffer, the buffer must bemapped into the client's address space. rpc_port_send() will automatically do this mapping onthe first message send. However, when a client is finished communicating across a global port,the port buffer must be unmapped explicitly. The rpc_port_dereference 0 call, describedbelow, does this.Receiving a synchronous messageint rpc_port_recv( int port_id, char **recv_data, int *recv_len,char **reply_data);This function blocks the calling thread and waits for a message to arrive on the specified port.When a message arrives, recv_data and recv_len is returned to contain a pointer and lengthof the received data, and the reply_data field points to the reply buffer. This buffer is usedby the server to copy the reply message into. When successful, the function returns a msg_id,which identifies the synchronous transaction. This msg_id is supplied to the rpc_port_reply 0function to issue a reply on the transaction. The following code demonstrates this interaction:void server(int port){char *recv_msg;int recv_len;char *reply_data;char *reply_msg = "reply message";Chapter 4. User level kernel implementation^ 113int msg_id;msg_id = rpc_port_recv(port, &recv_msg, &recv_len, &reply_data);printf("received message:%s length:%d\n", recv_msg, recv_len);/* copy reply message into supplied reply buffer */strcpy(reply_data, reply_msg);rpc_port_reply(port, msg_id, strlen(reply_msg));Replying a synchronous messageint rpc_port_reply( int port_id, int msg_id, int reply_len);This routine is called by the server thread to reply to an rpc_port_send() call. The msg_idspecifies the transaction number obtained from rpc_port_recv(). reply_len specifies thelength of data copied into the reply_data buffer from rpc_port_recv.Creating a synchronous portint rpc_port_create( int *port_id );This is the creation routine for local synchronous ports (global ports are sufficiently differentto require a separate routine). The created port is returned in grpc_port_create( int *port_id, int max_data_len, int num_msg_bufs);This is the creation routine for global synchronous ports. The maximum port messagesize, max_data_len, and queue size, num_msg_bufs, are specified so that a queue buffer can beallocated. The created port is returned in port_id. Returns OK if successful, FAILED if thereare no free ports.Destroying a synchronous portint rpc_port_destroy( int port_id );Chapter 4. User level kernel implementation^ 114This function destroys the specified synchronous port. Only the port creator's address spacecan destroy the port. All clients blocked on an rpc_port_send() are rpc_port_dereference( int port_id );When a client address space is finished sending messages to a global port, it can "deref-erence" the port. This causes the shared memory region between the client and server to beunmapped from the client address space.Kernel interventionThe interprocess communication library is designed to reduce the number of user/kernel interac-tions required to interact between client and server address spaces. Pure kernel implementationsrequire at least three kernel calls per send/receive/reply interaction: one call at each of the send,receive, and reply stages.The local port communication library does not require any kernel support at all. The globalport library does however require kernel system calls in certain cases. Global interprocesscommunication is based on the task-to-task asynchronous signalling mechanism. The amountof kernel intervention required is directly based on the amount of kernel intervention in thesignalling mechanism. An invocation of the signal mechanism breaks down into the followingsteps:1. If the destination task is running on a remote processor, deliver it an interrupt and return.2. If there is an idle processor, deliver it an interrupt and return.3. Otherwise, make a system call to perform the signal.The signalling mechanism uses software interrupts to deliver messages to remote processors,avoiding system calls on the local processor. A system call is required only if the third step isreached.Chapter 4. User level kernel implementation^ 115The synchronous IPC mechanism tries to reduce the number of kernel calls by invokingtask signals only when necessary. The following lists summarize the stages of the synchronousoperations and where system calls and task signals potentially occur.The rpc_port_send() operation requires at most one system call per invocation, or possiblya task signal invocation:1. sem_wait() for a free send buffer, causing a system call only for the first blocked thread(described in 4.4.7).2. Copy message into buffer.3. If the send queue is empty, deliver a signal to the destination task.4. Block sending thread.5. Wake up and copy reply message to local buffer.6. sem_signal() free buffer queue.The steps followed by the rpc_port_recv() operation does not require any kernel supportat all:1. If a message awaits, return a pointer to the message buffer.2. If the port is empty, block the thread.3. Wake up and return a pointer to the message buffer.rpc_port_reply() requires a task signal invocation only when there are no reply messagesqueued for the client:1. Enqueue the reply message.2. If there are existing reply messages in the queue, simply return.3. Otherwise, deliver a task signal to the client task.Chapter 4. User level kernel implementation^ 1164.5.2 The asynchronous port libraryAsynchronous send/receive communication can be used between threads when no reply messageis required. This method can more efficient than synchronous interactions because there ispotentially less context switching involved between the client and server thread. Client sendmessages are copied into a buffer, to be picked up by the server. Client sending threads do notblock unless the port queue becomes full.Threads communicate their message data through a FIFO message buffer. In the local portcase, this buffer is dynamically allocate into the local memory space. In the global port case, ashared region is allocated between client and servers, similar to the global synchronous ports.The local port implementation uses local semaphores to synchronize client/server access tothe port message buffer queues. The sender thread waits for free message buffer to becomeavailable. The free message buffer is dequeued, and the send message is copied into it. Themessage is queued onto the port send queue, and the send semaphore is signalled to wake theserver thread.The global port implementation relies on a shared region of port descriptors to manage theport queue. One global semaphore is used to synchronize access to the port free message queue.Client threads wait on the free message queue semaphore for buffers to become available. Theserver threads block waiting for sent messages.When a client thread sends a message, a GLOBALPORT_SENDSIGNAL message is delivered tothe server task to notify server threads that a message is waiting. This signal is sent only whenserver threads are blocked, and when no messages are queued in the send buffer. That is, thesignal is sent only when the send queue is empty.Sending an asynchronous messageint port_send( int port_id, char *msg, int msg_size, int no_block );Sends a message pointed to by msg to the specified port. The message buffer of lengthmsg_size is copied into the port's message queue. If the message queue is full, this routine willChapter 4. User level kernel implementation^ 117return immediately without queuing the message if no_block is set to NO_BLOCK. If no_blockis BLOCK, the calling thread will be blocked until room is made available in the message queue.Returns OK if the message was queued successfully, WOULD BLOCK if the queue was full, DESTROYEDif the port was destroyed while blocked, or FAILED if the port does not exist.The following client thread demonstrates this call:void client(port){char *send_msg = "send message";port_send(port, send_msg, strlen(send_msg)+1, BLOCK);Before a client thread in a remote address space can send messages on a global port, the portmessage queue region must be mapped into the client's address space. The port_send 0 routinewill automatically map in the port message queue on the first invocation. When communicationacross the port is later discontinued by the client, the port queue must be explicitly unmapped.The port_dereference() call, described below, performs this operation.Receiving an asynchronous messageint port_recv( int port_id, char *msg, int *msg_size, int no_block );Receives a message from the specified port. If the port queue is empty, this will returnimmediately if no_block is set to NO_BLOCK. Otherwise, the caller thread will be blocked untila message arrives. msg must be a preallocated buffer to hold at least msg_size bytes. Themessage is automcatically copied into this buffer when it is dequeued from the port. ReturnsOK if a message was successfully received, DESTROYED if the port was destroyed while blocked,WOULD_BLOCK if the caller thread would be blocked, or FAILED if the port does not exist.The following server thread demonstrates this call:Chapter 4. User level kernel implementation^ 118void server(port){char recv_msg[20];int recv_size;port_recv(port, recv_msg, recv_size, BLOCK);printf("message received:U length:%d\n", recv_msg, recv_size);Creating an asynchronous portint port_create( int *port_id, int max_msg_size, int queue_size, int global );This is the asynchronous port creation routine. The max_msg_size parameter specifiesthe maximum length in bytes of the messages passed on this port. queue_size specifies themaximum number of messages that can be queued up on the port. If global is PORT_GLOBAL,then a global port will be created. Otherwise, if global is PORT_LOCAL, then a local port willbe created. Returns OK if successful, FAILED if there was a parameter error, or if there are nomore port descriptors.Destroying an asynchronous portint port_destroy( int port_id );Destroys the specified local or global port. Remote and local threads blocked on an operationto the port will be woken with a DESTROYED failure result. Returns OK if successful, FAILED ifthe port does not exist, or if the port did not originate in the caller's port_dereference(int port_id);When client threads in an address space are finished accessing a port, this routine will cleanup the shared memory region associated with the port. Returns OK if successful, FAILED if theport does not exist.Chapter 4. User level kernel implementation^ 119Kernel interventionSimilar to the synchronous port library, the asynchronous library only requires kernel supportfor supporting global communication. The task signalling mechanism is used in the samemanner to help reduce kernel intervention.The port_send() call may require a system call if the sem_wait() call blocks, or a signalmay be delivered:1. sem_wait() for a free message buffer.2. Copy message into buffer.3. Deliver a signal only if the send queue is empty and a server thread is blocked.The port_receive() call may require a system call by sem_signal() to wake any blockedsender threads:1. Block if there are no messages.2. Wake up, dequeue message and copy it into the caller's buffer.3. Issue sem_signal() on the port free buffer queue.4.6 The global nameserverThe user level library maintains a simple nameserver data structure that is shared read/writeby all address spaces in the system. The nameserver database stores (string,integer) pairs,using the string as a search key. User programs can register strings, with them associated aninteger value. The integer value can be used to pass handles for various objects, or for othermiscellaneous purposes. For example, global ports can be registered, so that remote addressspaces can query for server port identifiers.The following structure is used for each entry in the nameserver_table ^ :Chapter 4. User level kernel implementation^ 120typedef struct{char string[NAMESERVER_STRING_SIZE];int data;} NAMESERVER_ENTRY;The complete nameserver interface is provided through the following three function calls:nameserver_register ( ) , nameserver_deregister () , and nameserver_find().4.6.1 Registering with the nameserverAny null-character delimited string, of up to NAMESERVER_STRING_SIZE characters, can be storedin the nameserver database. The register routine simply performs a linear search through thename server_t able 0 , and finds the first available nameserver_register( char *str, int data );The null-terminated string str is stored in the nameserver_table 0, with its associateddata value, data. If the table is full, FAILED is returned. The string and data value are copiedinto the table entry. The string is truncated if not enough space is available. Returns OK ifsuccessful.4.6.2 Deregistering a stringThe deregister routine removes a string and its data value from the nameserver table. A linearsearch find the table entry, which is then nameserver_deregister( char *str );The null-terminated string str specifies the string to invalidate. Returns OK if successful,or FAILED otherwise.Chapter 4. User level kernel implementation^ 1214.6.3 Nameserver initializationThe task initialization upcall must initialize the nameserver data structure. The initializationcode checks to see if a nameserver table already exists in the system. The nameserver regionidentifier is passed to a task through the TASK_ARGS structure when calling task_create 0.The first task in the system allocates the table using vm_alloc(). All child tasks are passedthe nameserver table region identifier through TASK_ARGS. Child tasks, when initializing, usesthe region identifier to map in the table with vm_share().4.7 User/User Shared memory regionsJust as the supervisor kernel implements shared regions between the user and kernel spaces, theuser kernel library also makes use of shared regions between user/user address spaces. Theseregions are used by the interprocess communication library and global nameserver to facilitatedata sharing between tasks. The supervisor kernel has no knowledge of these regions.These regions are susceptible to the same problems that the user/kernel shared regions face,although the affects of such problems are slightly less critical. Damaging a user/user sharedregion may prevent interprocess communication across a port, but it won't fatally crash thesystem as damaging the user/kernel shared region would.4.7.1 Globally shared regionsThe user level kernel library shares three regions between each task in the system. The firsttask that is loaded into the system and executed is responsible for allocating each region. Othertasks that are subsequently loaded share these regions. Each region is shared with read/writeprotection:• The global nameserver region, which contains a table of key/value pairs for passing handlesand other information.Chapter 4. User level kernel implementation^ 122• The asynchronous port descriptor region, which contains a table of asynchronous portdescriptors.• The synchronous port descriptor region, which contains a table of descriptors for thesynchronous port library.4.7.2 Client/Server shared regionsThe synchronous and asynchronous port creation routines allocate a buffer in the server taskto queue messages between the client and server. Clients that wish to communicate acrossthe port first map in the buffer region to their address space. So these regions are only madeavailable to those address spaces that ask for it.Other shared memory regions can be easily created by user programs for application specificpurposes.Chapter 5Performance EvaluationThis chapter presents some performance benchmarks for various aspects of the kernel operation.We begin by describing the tools used to obtain performance data. Then, several benchmarksare presented and analyzed.5.1 Benchmark toolsThere are two tools available in the system which can provide performance measurements.The Hypermodule's on-board Z8536 Counter/Timer can be configured as a 32-bit timer, withmicrosecond resolution. The following routines are available within the kernel and at the userlevel to control the timer:unsigned long timer_start();unsigned long timer_stop(unsigned long start);The timer hardware is initialized at boot time to continually cycle through a 32-bit counterregister. The counting frequency is 2MHz, giving the counter a cycle time of about 35.7 minuteswith 1/2 microsecond accuracy. The timer_start 0 function takes a snapshot of the counterregister, returning the counter value. This value is then passed to t imer_st op ( ) , where anothersnapshot is taken, and the time between snapshots is returned. This value should then bedivided by 2 to convert to microseconds.The second tool is only available while running under the simulator. The simulator providesa special device used to support an instruction tracing feature. This feature allows a processor'sexecution stream to be saved for later analysis, such as counting the number of instructions123Chapter 5. Performance Evaluation^ 124executed. A simulator device allows the tracing to be turned on and off, at program control, andcategorized into slots. The following macros can be used by any code to control the instructiontracing device:/* 'x' is the category slot */#define SIMTRACE_ON(x) (*(int *) SIMTRACE_CTRL = x)#define SIMTRACE_OFF(x) (*(int *) (SIMTRACE_CTRL+4) = x)5.2 Function calls vs. System callsTo help put the performance results in perspective, here we consider the cost of a null systemcall compared to the cost of a null procedure call. A null procedure with four parametersrequires at least 6 instructions on the 88100: four instructions to build the argument list, oneinstruction to call the routine, and one instruction to return./* func() is the null workhorse */void func(int a, int b, int c, int d){return;The kernel system call interface funnels all system calls through a single trap vector. Aninteger passed across the trap is used to determine the correct system call. The initial systemcall handler has the job of preserving user level state and setting up the kernel executioncontext. This involves some register saving and processor configuration. In all, 56 instructionsare required to invoke a null system call with 4 parameters.The basic setup cost and adminstration for system calls is an order of magnitude greaterthan a local function call. Table 5.2 summarizes this comparison.5.3 Thread management performanceThis section details the costs of thread creation and scheduling. Thread management operationsare primitives that closely affect the performance of tightly coupled parallel applications. InChapter 5. Performance Evaluation^ 125Instructionslocal function call 6kernel system call 56Table 5.2: 4 parameter user level function call vs. kernel system call.1 CPU 2 CPU 3 CPU 4 CPUthread_create 0 38.1 usec 35.7 usec 34.3 usec 34.1 usecdispatch and destroy 22.3 usec 16.2 usec 13.5 usec 11.4 usecthread_sched() 12.8 usec 8.2 usec 6.5 usec 5.1 usecTable 5.3: Thread management performance.these applications, threads are used extensively to peform jobs in parallel, and schedule eventsfrom one to another. Often, the raw performance of the underlying thread scheduler can becomea bottleneck.5.3.1 Thread creation performanceThis benchmark measures the overall performance of the thread_create() routine. A simpleprogram was written which creates 100 threads, but does not execute them. Execution timeswere collected for four different processor configurations: one, two, three, and four processorversions. In each case, the same number of worker threads is used to create the null thread asthere are processors. For example, the four processor measurement uses four server threads tocreate 25 null threads each, totalling to 100 threads.Table 5.3 summarizes the results. The basic uniprocessor thread create call completes every38.1 microseconds. This creation time could be significantly improved by modifying the threadstack allocator. Currently, thread_create() performs the system call vm_alloc() to allocateits thread stack. An algorithm that allocated a large number of thread stacks at once wouldreduce the expensive vm_alloc 0 calls.When more processors are added to the system, performance improves because there of theChapter 5. Performance Evaluation^ 126parallel creation, but only slightly, until levelling off after about 4 processors. This is probablycaused by a spin-lock contention bottleneck, seen in three places during thread creation: lockingthe free thread descriptor queue to allocate a descriptor; locking the task descriptor to performthe vm_alloc 0 system call; and locking the task's address map descriptor to perform thevm_alloc 0 system call. In addition to reducing the amount of code executed, removing thevm_alloc() call for stack allocation may also reduce lock contention.5.3.2 Thread resumption and destruction performanceUsing the 100 threads created for the previous benchmark test, the threads are resumed usingthread_resume () , executed, and finally destroy themselves using thread_destroy 0 . The lastthread to destroy itself prints out the execution time for the complete run.Table 5.3 summarizes the execution times for one, two, three, and four processors. A threadis executed and removed in 19.3 microseconds. This execution time includes putting the threadon the ready queue, invoking the scheduler to dispatch it, and invoking the thread_destroy()routine to kill the thread. Adding more processors has a more significant effect on the executiontime, compared to the thread creation benchmark. While there is still lock contention for theready queue list, this contention is not nearly as long as found in the vm_alloc() system call.Greater parallelism is achieved in this case.5.3.3 Thread context switchingThis benchmark measures the basic context switching performance of the thread_sched()library call. 10 threads are created, and each of them performs the following tight loop:void worker( iterations ){while ( iterations-- )thread_sched();Chapter 5. Performance Evaluation^ 127100,000 iterations were performed by each thread, and the overall execution time wasrecorded. This number is divided by 100,000 to determine the approximate performance ofthe thread_sched() call. The calling thread's context is saved, and its descriptor is placed onthe ready queue. The next available thread is dequeued and its context is restored.Table 5.3 shows the average performance of thread_sched(). A basic context switch fromone thread to another on a uniprocessor occurs in 12.8 microseconds. The ready queue offerssome lock contention for when additional processors are added. The remainder of the processortime is spent saving and loading the 29 thread context registers'.5.4 Interrupt handling performanceThe performance of interrupt handling is a critical concern for high speed device drivers andkernel scheduling performance. Interrupt handling must be as lightweight as possible to ensurelow latency dispatch times to device drivers. Interrupt handling and dispatching in a monolithickernel is fairly straightforward: trap the interrupt, save context, and call the interrupt serviceroutine. In the Raven kernel, since device drivers are implemented at the user level, deviceinterrupts must take the journey up into the user level for processing. However, once at theuser level, execution can continue with application processing.An experiment was constructed to measure the execution latency time to dispatch an in-terrupt to a service routine. The kernel interrupt handler was modified to take a timestamp atthe earliest convenience. This timestamp is then compared with the timestamp acquired at thebeginning of the interrupt handler. Three different interrupt handler scenarios were measured:1. A kernel level interrupt handler. Invoking this handler is a local kernel call.2. A user level interrupt handler in a task that is activated on the interrupted processor.Invoking this handler requires an upcall into the user space.1 A small optimization can be made during the register save operation because the register save is occuringduring a well known point in the thread scheduler. This allows less registers to be saved than the whole processorcontext.Chapter 5. Performance Evaluation^ 128Time (usec) Instructionskernel invoke 7.21 86user invoke 14.0 194user switch/invoke 30.6 421Table 5.4: Interrupt service routine invocation latencies.3. A user level interrupt handler in a task that is not currently activated on the interruptedprocessor. Invoking this handler requires that the current task be switched out and theinterrupt handler task be switched in. Performing this operation requires two upcalls intouser space and a system call.Table 5.4 summarizes the average times, in microseconds, to invoke each service routine.Also, the number of instructions per invocation is shown. The cheapest invocation time of 7.21microseconds is naturally inside the kernel. No special setup is required to upcall into userspace. Also, the user level register context can be saved in a cheaper fashion, since it will bedirectly restored by the kernel at the end of the interrupt.Invoking a user space handler is about twice as expensive. The user level register contextmust be properly saved into the user level context save area, and an upcall must be performed tothe user level. Once at the user level, the upcall dispatcher must place the previously executingthread on the ready queue, and finally call the service routine.Switching address spaces before calling the service routine is the most expensive invocationoperation. The old address space must be upcalled to handle any cleanup and placed on thetask ready queue before the new address space can be invoked.At first glance, this benchmark appears to show that user level device drivers are much moreexpensive than kernel device drivers because of the interrupt dispatching overhead. However,one must also consider that even a kernel device driver needs to communicate with the user levelat some point. User level code must eventually be executed to operate on the data providedby the device driver. Depending on the device, this may involve an extra data copy operationChapter 5. Performance Evaluation^ 129to move the data between the user application and kernel device driver. Moreover, there isthe additional costs of scheduling and activating the user application when the device driver isready for more. All of these costs are automatically taken care of by the interrupt dispatcherand upcall mechanism.5.5 Task signalling performanceThe asynchronous task signalling facility is used throughout the kernel to provide synchroniza-tion and event passing between tasks. The performance of this facility is an important factorfor global thread synchronization and interprocess communication.The signalling mechanism relies on interprocessor interrupts or a system call to deliver signalmessages. The sequence of steps performed by a signal invocation is summarized as follows:1. If the destination task is running on a remote processor, deliver it an interrupt and return.2. If there is an idle processor, deliver it an interrupt and return.3. Otherwise, make a system call to perform the signal.The remote interrupt mechanism allows the local processor to offload all of the queuing andtask invocation work to the destination processor. The initiating processor can continue withits own work while the signal is processed. If signal invocation makes a kernel call to performits work, the local processor handles the signal message queuing and readying, and dispatchingof the destination task.A benchmark was constructed to measure the difference between a software interruptedsignal message and a kernel mediated signal message. Two tasks were created. One taskcontains a signal handler that will be invoked by the other task. Three test cases were measured:a kernel mediated signal, a software interrupt with the remote task activated on the interruptedprocessor, and a software interrupt with the remote task not activated on the interruptedprocessor. The results for a single invocation are shown in Table 5.5.Chapter 5. Performance Evaluation^ 130Time (usec)kernel signal 32.4user active signal 13.2user inactive signal 15.6Table 5.5: Task signalling invocation latencies.The kernel mediated signal is more than twice as expensive as the software interrupt versions.This time is mostly consumed by the task switch that must occur on the local processor. Theinitiating task must be switched out, and the destination task must be switched in a deliveredthe signal. Sending a signal to a remote processor that is running the task is the least expensive.No task switching is required, but there is the cost of handling the interrupt and saving theinterrupt thread's context. Interrupting an idle processor is only slightly more expensive. Thedestination task must be activated, but there is no thread state to save.5.6 Interprocess communication performanceThe performance of the interprocess communication primitives is a vital factor for high-throughput client/server applications. In these types applications, IPC performance is themost common bottleneck.The user level IPC libraries make extensive use of shared memory and scheduling primitivesthat do not require kernel intervention. Two experiment programs were constructed to measurethe performance of the IPC libraries under various conditions. The first set of experiments testthe local interprocess communication primitives. Then a second set of experiments test theglobal IPC primitives. These libraries are sufficiently different in implementation that it makessense to test and analyze them separately.Chapter 5. Performance Evaluation^ 1315.6.1 Local communicationBoth synchronous and asynchronous port based communication was tested. The performanceof these libraries weighs heavily on the performance of the low level thread scheduling modules.The operations performed by the local IPC primitives are mostly thread context switching andenqueue/dequeue operations.The test program creates a port and some threads to communicate across the port. Thenumber of client and server threads was varied, as well as the number of physical processors. A4 byte message is passed several thousand times each case, and the individual results averaged.The first set of tests measures the asynchronous port performance. A port with a 20 elementqueue was created. The results of these tests are shown in Table 5.6. The first test creates 1sender thread and 1 receiver thread (denoted 1-send/1-recv). While the average send/receivetime for this case is 13.2 microseconds, the actual latency between a single send/receive pair ismuch higher. The 13.2 microsecond time is achieved because the sending thread can fill up theport queue at full speed, and then transfer control to the receiver thread which can spend all ofits time draining the queue. Thus the overall time is greatly reduced due to the amortizationof message buffering in the queue.The addition of more processors to this case does not help much. In theory, the sender shouldbe able to supply data fast enough to keep the receiver occupied. However, by examining theport status descriptors at various intervals during the benchmark, it appeared that the portremained full during most of the computation. This caused senders to block, thus forcingcontext switches between the senders and receivers. A more balanced workload on the senderside would have helped reduce this problem.The synchronous message passing case requires much more work per transaction, becausethe sender always blocks waiting for a reply. Therefore the 1-send/1-recv/reply uniprocessorcase is heavily bounded by the thread scheduling performance. Adding more processors causesadditional work to be done. This appears to be due to scheduling overhead. When a clientthread places a message on a port queue, the client will deliver a remote interrupt to the nextChapter 5. Performance Evaluation^ 1321 CPU 2 CPU 3 CPU 4 CPU1-send/1-recv 13.2 usec 12.6 usec 12.7 usec 12.7 usec10-send/10-recv 14.3 12.0 10.3 10.01-send/1-recv/reply 32.6 32.9 33.0 33.010-send/10-recv/reply 18.1 17.6 17.0 16.8Table 5.6: Performance of asynchronous and synchronous local ports, 4 byte data message.available idle processor to wake the server thread. Since there is only one server thread andone client thread, the cost of managing the context switch on the local processor turns out tobe less expensive than delivering an interrupt to a remote processor. Interrupting a remoteprocessor to run a thread causes that processor to examine the ready queue, increasing lockcontention for other processors in the system.If the client and server sides have more threads to work with, as in the 10-client thread10-server thread case, performance is greatly improved because the message queue can bemaintained at a non-empty state, and servers can read messages as fast as clients can placethem.However, none of these local port cases show much improvement when more processors areadded. Most of the time spent seems to be wasted on lock contention. There are two hot spots:the port spin-lock or the thread scheduler ready queue lock. The workload that the client andserver threads perform are basically null operations, so most of the execution time is spentchasing after locks within the thread scheduler.5.6.2 Global interprocess communicationThis section measures the performance throughput of the global interprocess communicationservice. This test combines many of the primitive system services: remote interrupt dispatching,task signal dispatching, global semaphores, and task and thread scheduling.The global IPC test cases create two address spaces: a server task, and a client task. Theserver task allocates a port descriptor containing 20 message buffers and advertises it on theChapter 5. Performance Evaluation^ 1331 CPU 2 CPU 3 CPU 4 CPU1-send/1-recv 43.2 usec 33.9 usec 32.1 usec 32.0 usec2-send/2-recv 44.5 32.2 31.8 31.010-send/10-recv 44.3 33.9 32.0 32.51-send/1-recv/reply 145 124 124.1 123.82-send/2-recv/reply 108 95.3 90.3 89.610-send/10-recv/reply 89.3 92.5 94.9 95.6Table 5.7: IPC performance for asynchronous and synchronous global ports, 4 byte data mes-sage.nameserver. A number of server threads are created to listen for messages on the port. Theclient task queries the nameserver for the port descriptor and creates a number of client threadsto bombard the server with messages.Table 5.7 contains performance results for various combinations of processors and threads.The simple case of 1-send thread and 1-receive thread synchronous send/receive/reply demon-strates the worst case performance of 121 microseconds per interaction. Each iteration requirestwo processor relinquishments and two upcalls. This figure is improved slightly in the 2 proces-sor case, because the client and servers reside on separate processors most of the time. Invokingthe remote task to signal a message is done via a software interrupt, and an address space switchis not required.However, synchronous performance does not increase by the expected amount when morethreads and processors are added. In fact, performance is reduced when more processors areadded to the 10-sender 10-receiver case. In a uniprocessor system, the all the sender threadsblock, then all the receivers return replies. However, when processors are added, the threadand task schedulers seems to constantly jump around in a form of hysteresis, causing morescheduling events to occur than optimal.The asynchronous message transfers are much faster overall because of the reduced contextswitching requirements between the client and server. The client threads have no problemkeeping the port queue full of data for the server threads. Increasing the number of threadsChapter 5. Performance Evaluation^ 134Async IPC Sync IPCtotal IPC calls 200,000 300,000kernel calls 16,000 56,000user interrupts 68,000 108,000Table 5.8: IPC primitive breakdown.results in an apparent slight performance hit. This is possibly due to the increased number ofthread descriptors being managed throughout the system. Both the single thread and multiplethread case are able to keep the port queue full, so have multiple threads on a single processordoes not help.As seen in the local case, performance sharply declines as more processors are added to thesystem. The reason for this again is lock contention. The workload performed by the clientand server threads is null, so all their effort is spent trying to access the port queue. Much ofthe overall processor time is spent waiting on the port queue.Kernel InterventionThe global interprocess communication library attempts to reduce the number of kernel in-teractions by using a task signalling mechanism that can send event messages without kernelintervention. To illustrate this, the number of low level primitive invocations was measured forthe 2-thread server, 2-thread client, benchmarks shown above for synchronous and asynchronousIPC.These measurements are broken into the following three categories: the total number of IPCprimitive calls, the number of kernel mediated calls, and the number of software interrupt me-diated calls. Table 5.8 summarizes the results for asynchronous and synchronous benchmarks.These results demonstrate the reduced dependency on kernel interaction of the user levelIPC library. Better results could be achieved using an improved task and thread scheduler.Such a scheduling system could properly schedule threads that are communicating so as toreduce the number of context switches and event messages.Chapter 5. Performance Evaluation^ 135Time (usec) Instructionsvm_alloc () 4KB 38.5 389vm_alloc () 64KB 149 1516vm_alloc () 1MB 2080 20637vm_free() 4KB 23.3 303vm_free() 64KB 145 1575vm_free() 1MB 2140 21745vm_share() 4KB 30.1 303vm_share() 64KB 116 1126vm_share() 1MB 1650 14682vm_move() 4KB 34.0 324vm_move() 64KB 132 1264vm_move() 1MB 1720 15353Table 5.9: Virtual memory operation execution times.5.7 Memory management performanceThis section performs some experiments to measure the execution time of various virtual mem-ory system calls. The system calls tested are: vm_alloc (), vm_free(), vm_share(), andvm_move 0 . Each call is benchmarked by repeatedly calling the routines using region sizesof 4KB, 64KB, and 1MB.The calls were all performed within a single thread, so no parallel performance numberswere measured. However, one can deduce that the parallel performance of these operations willbe relatively poor due to high lock contention. Approximately 90% of the work done by theseoperations occur within a critical section, protected by the address map lock, or by the taskdescriptor lock. This lock contention is on a per-address space basis, so parallel invocations willhappily coexist if allocations occur amongst disjoint address spaces.Each call was executed a number of times in a tight loop. Table 5.9 shows the averageexecution time and instruction counts for each call, for varying region sizes.The time to execute each call on a single page size of 4KB is relatively high compared tothe time for larger page sizes. This is due to the high setup costs for each call. Using multipleChapter 5. Performance Evaluation^ 136Frame size (bytes) Time (sec) efficiency128 18.6 45.1%512 9.03 92.9%1024 8.71 96.3%1500 8.57 97.9%Table 5.10: Time to transfer 10MB of data over sizes allows the per-page cost to be amortized over a large number of pages, thus reducingthe relative execution time drastically.5.8 Ethernet driver performanceA simple test program was constructed to test the performance of the Ethernet device driver.The test program runs on two machines connected by Ethernet. One machine continuouslysends data to the receiver machine. 10 megabytes of data is transmitted between the machinesusing a range of Ethernet frame sizes. Table 5.10 summarizes the results of the test, showingthe percentage efficiency compared to the 10Mbps Ethernet speed.Chapter 6Related WorkThis chapter examines recent work in the field of multiprocessor operating systems, and howit relates to the design of the Raven kernel. The operating systems topic is rather broad. Theresearch emphasis in recent years has turned away from implementing feature rich environments,to finding more efficient and streamlined ways of doing things. For example, rather than buildingan operating system that contains everything that anyone would ever need, recent research inthe field identifies the basic operating system components and improves upon them. The Ravenkernel was designed and implemented in the same spirit.Amongst these basic operating system components that are particularly relevant to mul-tiprocessor systems are critical section management, thread management and scheduling, andinterprocess communication. Much attention to these areas has been spent in recent years toimprove the performance and their characteristics.6.1 Low-level mutual exclusionThere is a large body of research work related to the implementation and analysis of mutualexclusion synchronization primitives on shared-memory multiprocessors. Early work in this areadetailed software algorithms where the only atomic operations provided by the hardware arememory read and write, [Dij65] [Knu66]. The main disadvantage of these software dominatedapproaches is their inefficiency. Since then, however, more powerful atomic operations suppliedby hardware has made mutual exclusion more efficient. Before looking at these operations,consider where mutual exclusion is used.Operating systems rely on low-level mutual exclusion algorithms to protect against parallel137Chapter 6. Related Work^ 138access to system data structures and hardware devices. In a uniprocessor system, a common usefor mutual exclusion are to protect data structures that are shared between interrupt handlersand device drivers. Disabling preemption by masking interrupts around sensitive code is aneffective way to provide this capability. However, the addition of multiple processors in asystem complicates the situation. Disabling interrupts alone does not stop other processors inthe system from accessing the protected resource.The technique of spin-locking has long been an elementary operation that can providemutual exclusion between separate processors. The algorithm is to spin-wait for a shared lockvariable to become available, and then mark it unavailable, thus claiming the lock. When thecritical section is over, the lock variable is marked available again. In many systems, the lock"acquire" stage requires an atomic operation, such as memory exchange or test-and-set.While lock contention can be reduced by carefully designing critical sections to minimizetheir overlap, spin-locking can be wasteful of available computing resources because of the busywait nature of the algorithm. In addition to completely consuming local processor cycles, thespinning read/write cycle of test-and-set can generate a constant barrage of memory transac-tions. In a shared memory multiprocessor environment where main memory accesses share acommon bus, this activity can degrade the performance of all processors in the system. Ex-periments in [And89] show that this algorithm is worthwhile for systems with less than sixprocessors. However, this experiment only shows the impact of memory bus contention againstspinning processors, and not processors doing other work.A simple optimization to the spin lock involves the use of memory caching to reduce globalmemory bus contention. This technique relies on cache coherency to maintain proper copies oflock variables. Rather than accessing the lock variable directly in memory, the lock variable isread into the local processor cache. All spinning occurs out of the cache. When a lock valuechanges, the system's cache coherency algorithm propagates the new value to the appropriatecaches. On some systems however, the cost of maintaining cache coherency can become a bot-tleneck. The Raven kernel implements the above method because the Hypermodule hardwareChapter 6. Related Work^ 139and the 88100 provide the necessary cache coherency protocols.In the absence of cache coherency, spinning memory transactions can be reduced by usinga backoff algorithm. If acquiring a lock fails, then delay for a period of time and try again.This algorithm is similar to the Ethernet's exponential backoff [MB76]. However in this case,performance can still be poor for a small number of spinning processors because the lock acquirestage will continue to backoff even when the lock is released. The waiting processor remainsconsumed by the backoff delay. This algorithm is not appropriate in the current implementationof the Raven kernel because of the small number of processors in the system. Also, the overheadrequired to implement backoff timing would consume a high proportion of lock acquire stage.The technique of queuelocks has been shown to reduce memory bus contention even in thepresence of many processors, [And89], [GT90], [MCS91a]. The idea behind queuelocks is tomake each thread spin only for one other thread to release a lock. If one thread waits for alock holder, another thread will wait for the first waiting thread. This relationship allows eachthread to spin on a different memory location. However, the advantages of this method areoffset by the additional overhead costs in the bookkeeping of lock queue data structures. Thisoverhead is not justified in the current implementation of the Raven kernel because of the smallnumber of processors.Experiments have shown that spin-locking on global locks does not scale well beyond eightprocessors [And89] [KLMO91]. While memory bus contention is significantly reduced usingcaching, performance eventually becomes bounded by spinning processor cycles. This bottle-neck appears to become a factor in systems with more than eight processors. Also, the cost ofcache coherency in some systems can impose other bottlenecks to the system.An alternative to spin-waiting involves techniques based on wait-free synchronization [Her91][Her90] [MCS91b] and data structures known as lock-free objects [Ber91] [MP91]. The ideahere is optimistic: allow concurrent data accesses without blocking. After a modification toa data structure is made, the algorithm checks to see if structures are consistent, and if not,the operation is rolled-back. However, these algorithms require additional hardware supportChapter 6. Related Work^ 140beyond simple test-and-set or compare-and-swap to operate efficiently. The Hypermodule and88100 instruction set does not directly support these algorithms, but they can be constructedusing more primitive features.6.2 ThreadsOperating systems have long provided lightweight threads of control to support a general pur-pose concurrent programming model for address spaces. Threads are used in uniprocessorsystems as a structuring aid and to help overlap input/output with computation. In multipro-cessor systems, threads are also used to exploit true parallelism. Several techniques for threadmanagement and their associated performance characteristics are measured in [ALL89].Thread management is usually implemented as either a kernel level service, or in useraddress spaces as a threading library linked with executables. Kernel level threads benefit frombetter integration with the other kernel supported services, such as priority scheduling andinput/output. The kernel maintains control over all scheduling decisions, so thread priorities canbe obeyed across address spaces. Threads performing input/output system calls or interprocesscommunication can be properly blocked and rescheduled as their operations complete.Traditional microkernel architectures such as Mach [TR87] and the V-Kernel [Che88] demon-strate the use of kernel threads. However, the performance of these systems inherently sufferdue to the costs of crossing user/kernel boundaries to perform thread management functions.Every single thread context switch and library call requires a kernel call. In addition, sincethe kernel level interface is usually intended to be used by all varieties of user programs, thethreading interface must be general purpose and cannot take advantage of any local specialpurpose optimizations.The Raven kernel implements threads at the user level to avoid the above performanceproblems and provide more convenient interfaces to the user.Pure user level threading implementations can perform thread management operations atleast an order of magnitude faster than their kernel level counterparts. The cost of invokingChapter 6. Related Work^ 141thread operations is at most the cost of a local procedure call, which in some cases can beoptimized to inline macro routines. Many such user level threading packages exist for the Unixenvironment [Go186] [Doe87] [SM90]. These packages multiplex a number of user defined threadson top of a single kernel implemented process. While these packages avoid kernel invocationfor most thread services, they introduce problems of their own:• Blocking system calls stop all threads in the address space. While select 0 can be used toalleviate this problem for routines such as open () , close 0 , read () , and write 0 , otherpotentially blocking calls such as mkdir (), rename (), ioctl (), stat 0 and asynchronousevents such as page faults are more difficult to deal with.• Poor performance resulting from improper scheduling decisions imposed by the kernelduring low-level thread mutual exclusion. Spin locks are commonly used between threadsin the same address to provide lightweight mutual exclusion (blocking semaphore man-agement is too heavyweight for some operations). If a spin lock is acquired and held by athread which is subsequently switched out, other threads in the system trying to acquirethe lock will hopelessly busy wait until the holder is allowed to complete its critical sec-tion. A thread can be switched out for a number of reasons, such as the expiry of a timequantum, or the arrival of other external interrupting conditions.One technique which tries alleviate parts of the above problems allows lightweight userlevel threads to be executed on top of kernel supported middle-weight threads of control. Thistechnique is used by Mach's C-Threads library [CD90] and SunOS's LWP [PKB+91] [SS92].The user level threads reside as data structures in the user level address space. Kernel levelthreads are used as virtual processors to execute the user level threads. User level threads canbe successfully scheduled around blocking system calls, but low-level synchronization problemsstill exist.To solve the scheduling problems that low-level synchronization code introduces requiressome special support by the operating system that can detect when it is inappropriate forChapter 6. Related Work^ 142context switching to occcur. In the Psyche operating system [SLM89], first class user levelthreads [MSLM91] share locking information between the lock management routines and thekernel. Soon before preemption is required, the kernel provides the user level with a two-minutewarning flag. User level synchronization code can check this flag prior to acquiring a lock, andvoluntarily relinquish control to the kernel if it deems necessary. While this technique doesnot completely remove inappropriate scheduling decisions, the number of them is significantlyreduced. The solution implemented in the Raven kernel eliminates this problem.The scheduler activations technique [ABLL92] provides a more sure-fire way of preventinglock synchronization problems by allowing critical sections to complete before preempting theprocessor. This is the same idea used by the Raven kernel, but scheduler activations implementsit quite differently.When an event occurs that would normally cause preemption during a critical section, anupcall into the user space occurs. The user level kernel recognizes that a critical section is inprogress by checking the interrupted address, and jumps directly to the code that will completethe critical section. This code is in fact a copy of the original, except that the tail end containsa relinquishment call to the scheduler.Instead of upcalling to the user level during a critical section, the Raven kernel defers thepreemption by setting an "upcall deferred" bit and returns to the user execution. At the endof the critical section, the user checks the bit and relinquishes if it indicates so.The scheduler activation technique allows lock operations to be as efficient as possible, be-cause they do not require to manage any preemption status variables. However, additionaloverhead is required by the upcall handler to dispatch control to the appropriate copy of thecritical section. The instruction pointer at the time of preemption must be examined and com-pared with a list of critical section handlers. This makes nested locking and critical sections withmultiple return points difficult to manage. Special compiler support is required to automatethe code copying. The upcall dispatcher in the Raven kernel aviods this mess altogether.Chapter 6. Related Work^ 1436.3 Interprocess communicationIn traditional operating systems, the kernel has mediated the interprocess communication mech-anism. User programs wishing to communicate with remote services were required to invokekernel operations to perform the communication protocol. This process is now seen as be-ing inefficient due to the increased relative costs of crossing user/kernel boundaries comparedto simple procedure calls. Interprocess communication systems are concentrating on reducingdata copying costs and latencies by using memory mapping techniques and software supportedscheduling mechanisms.Recent versions of the Mach 3.0 kernel have improved on typical kernel mediated IPC im-plementations by introducing the continuation [DBRD91]. Continuations facilitate the passageof execution control through the kernel scheduling primitives by allowing execution context tobe handed off to another thread. This can eliminate scheduling overhead and queueing withinthe kernel. In the fast path best case, the sender thread executes within the context of thereceiver thread.The Lightweight Remote Procedure Call (LRPC) [BALL89] mechanism also involves kernelintervention to pass messages between client and server, but takes advantage of architecturaldetails of the DEC SRC Firefly. Execution progresses through the kernel and into the remoteaddress space using a special purpose stack structure that is used by both the sender andreceiver. Frequently used parameters are cached in processor registers.As with the thread scheduling implementations discussed above, recent interprocess com-munication design have been removed from the kernel and implemented at the user level. Thisallows users to directly invoke IPC primitives without the added costs of crossing user/kernelboundaries. A level of indirection is removed because instead of invoking the kernel, the usernow directly communicates with the remote process.The URPC technique [BALL90] relies on pair-wise shared memory between the client andserver processes to pass message data. The message delivery system is controlled by low prioritythreads that poll the message queues looking for work. The threads only poll while the systemChapter 6. Related Work^ 144is idle. While this polling mechanism can produce low latency message transfers, this best casescenario only occurs when there is no other work for the system. Therefore, this model is notappropriate for systems with constant workloads. The Raven kernel is intended to be used inapplications where good IPC performance is required under load.The split level scheduling technique and memory mapped streams implemented in the con-tinuous media system [GA91] describes one way of using shared memory and scheduling tech-niques to reduce communication bottlenecks. Split level scheduling allows scheduling decisionsto be made at both the kernel and user level. A shared data structure between the user/kernellevel facilitates the communication of thread scheduling information. This sharing of informa-tion is similar to the Raven kernel, but in Raven, the amount of information shared is much less.In order to properly honour thread priorities and real-time events in remote address spaces,much more scheduling detail must be shared between the user and kernel.The address-valued signal mechanism introduced in [CK93] describes a hardware assistedlow level signalling mechanism. This is a hardware solution to the Raven kernel's task signallingfacility. The hardware maintains a FIFO queue of signal interrupts, making it especially easyand efficient for one address space to send a low level synchronization message to anotheraddress space. Virtual addresses are used to direct the signal to any particular address space,and the hardware handles the rest. The remote processor is interrupted and presented the nextsignal on the hardware FIFO queue. Higher level communication protocols, such as RPC, canbe built with this low level mechanism.Chapter 7Conclusion7.1 SummaryThis thesis presented a new multiprocessor operating system, known as the Raven kernel. TheRaven kernel provides a multitasking, timesliced, environment for user level programs to ex-ecute in. The system provides the notion of tasks, virtual memory, threads, and interprocesscommunication. However, unlike traditional microkernel architectures, the Raven kernel imple-ments many of these services completely in user space. The motivation behind this design wasto improve system performance by reducing the number of user/kernel boundary changes.An overview of the runtime environment used for the Raven kernel was provided. Thehardware platform used is the Motorola 88100 four processor Hypermodule, with 32MB ofshared memory. A special kernel debugger based on gdb allows kernel code to be interactivelydebugged and tested.The implementation of the kernel services was then described. The kernel consists of severalmodules that provide three main services to user level programs:• Task management (allocation and scheduling).• Virtual memory management (memory allocation and mapping).• Low level upcall dispatching.The description of the user level kernel followed. The following operating system serviceswere implemented completely at the user space:• Preemptive thread scheduling. Threads migrate from processor to processor in an effortto balance the load.145Chapter 7. Conclusion^ 146• Device interrupt handlers. Hardware interrupts are efficiently funnelled up from the kernelto user level registered handlers.• Semaphores, used to synchronize events between threads in remote address spaces.• Interprocess communication. A synchronous and asynchronous port based messagingscheme was implemented using shared memory queues and low level synchronization rou-tines.• A nameserver database for fast lookup of global names.A set of primitive low level event signalling routines made interprocess communication andscheduling possible without kernel intervention:• The intr_remote_cpu() processor interruption facility delivers hardware interrupts toremote processors.• The task_signal() signalling facility sends asynchronous event message to remote tasks.The performance results show that reduced kernel intervention and improved performanceis possible using these techniques. However, the tradeoff between performance is stability. Toreduce communication bottlenecks between address spaces, extensive use of shared memoryregions is employed. These shared memory regions are left exposed to malicious processes orerrant program behaviour. Therefore, the system is suited towards dedicated environmentswhere programs are trusted.7.2 Future WorkThe Raven kernel is intended to provide the basis for an efficient and lightweight parallelprogramming environment for high-speed parallel applications. Work will be continued in thisarea to improve performance and add functionality.The performance results showed that scheduling and lock contention overhead contributedto most of the interprocess communication bottleneck. The current task and thread schedulerChapter 7. Conclusion^ 147makes scheduling decisions based solely on the round-robin fairness scheme. A better task andthread scheduler could be designed which would identify communicating threads, and try toschedule them together to reduce context switching bottlenecks.Appendix AKernel system call interfaceThis appendix summarizes the supervisor level system call interface that is available for generalpurpose user programs. The first section presents the system calls intended by use for user levelkernels. The second section presents the interface intended for user application programs.A.1 System calls provided to the user level kernelThis section summarizes the system calls provided to user level kernels. User application pro-grams should not call these routines directly.A.1.1 Task management system callsint task_timer_event( int wakeup_time );int task_request_cpu();int task_relinquish_cpu();int task_intr_relinquish_cpu();int task_cleanup( int task_id );A.1.2 Interrupt management system callsThese system calls register and enable interrupts to the user intr_register_user( int intr_vec );int intr_deregister_user( int intr_vec );A.1.3 Exception management system callsThese system calls register and enable exceptions to the user level.148Appendix A. Kernel system call interface^ 149int excp_register_user( int intr_vec );int excp_deregister_user( int intr_vec );A.1.4 Global semaphore managementint kernel_sem_enqueue(int sem_id);int kernel_sem_dequeue(int sem_id);A.2 System calls provided for application programsThis section summarizes the system calls provided by the supervisor kernel for applicationprograms.A.2.1 Task management system callsint task_suspend( int task_id );int task_resume( int task_id );int task_signal( int task_id, int signal, int user_data );int task_create( int *task_id, int priority, TASK_EXEC_HDR *hdr,int code_region, int data_region, TASK_ARGS *args );int task_destroy( int task_id );int task_info( int task_id );A.2.2 Virtual memory management system callsint vm_alloc(int *region_id, int task_id, void **addr, int size, int attrb );int vm_free(int region_id);int vm_move(int region_id, int dest_task_id, void **dest_addr, int dest_attrb);int vm_share( int src_region_id, int *dest_region_id, int dest_task_id,void **dest_addr, int dest_attrb );int vm_map_device( int *region_id, void *phys_addr, void **addr, int size );int vm_unmap_device( int region_id );A.2.3 Console input/outputvoid kprint(char *str);Appendix A. Kernel system call interface^ 150int kgetstr(char *str, int len);A.2.4 Program loaderint read_exec_hdr(char *filespec, TASK_EXEC_HDR *hdr);int read_exec( int code_region, int data_region );Appendix BUser kernel library call interfaceThis appendix summarizes the user level kernel library call interface that is available for generalpurpose user programs. All of these calls are prototyped in the <user/threads . h> header file.B.1 Thread managementint thread_me();int thread_sleep( unsigned long sleep_time );int thread_suspend();int thread_resume( int id );void thread_sched();int thread_create( int *thread_id, void (func)(), char *name, int priority,int stacksize, int ready, int num_args, ...);int thread_destroy( int id );B.2 Synchronization primitives/* Spin lock routines */void lock_wait(int *lock);void lock_free(int *lock);/* semaphore routines */int sem_wait( int sem_id, int no_block );int sem_signal( int sem_id );int sem_reset( int sem_id, int count );int sem_count( int sem_id );int sem_create( int *sem_id, int count, int global );int sem_destroy( int sem_id );B.3 Asynchronous Send/Receive port IPCint port_send( int port_id, char *msg, int msg_size, int no_block );int port_recv( int port_id, char *msg, int *msg_size, int no_block );151Appendix B. User kernel library call interface^ 152int port_create( int *port_id, int max_msg_size, int queue_size, int global );int port_destroy( int port_id );/* for global ports only -- these do memory mappings of port queues */int port_reference(int port_id);int port_dereference(int port_id);B.4 Synchronous Send/Receive/Reply port IPCint rpc_port_send( int port_id, char *send_data, int send_len,char *reply_data, int *reply_len );int rpc_port_recv( int port_id, char **recv_data, int *recv_len,char **reply_data );int rpc_port_reply( int port_id, int msg_id, int reply_len );int rpc_port_create(int *port_id);int rpc_port_destroy(int port_id);/* special calls for global rpc_ports */int grpc_port_create(int *port_id, int max_data_len, int num_msg_bufs);int rpc_port_dereference(int port_id);B.5 Nameserverint nameserver_find(char *str, int *data);int nameserver_register(char *str, int data);int nameserver_deregister(char *str);B.6 User level memory management/* Zone memory allocator routines */void *zone_alloc( int zone_id );int zone_free( void *buf );int zone_create( int *zone_id, int size, int alloc_size );int zone_destroy( int zone_id );void *malloc(int size);#define free(buf) (zone_free(buf))B.7 Interrupt and exception managementint intr_register( int intr_vec, void *handler );int intr_deregister( int intr_vec )Appendix B. User kernel library call interface^ 153int excp_register( int excp_vec, void *routine, int global );int excp_deregister( int excp_vec );Appendix CUnix versionA Unix version of the user level interface was implemented to aid in the development andtesting of user level programs. The Unix version implements a non-preemptive thread scheduler,properly integrated with Unix filesystem I/O using a select 0 wrapper.The following function prototypes document the Unix version interface:int thread_me();int thread_sleep( long sleep_time );int thread_suspend( int id );int thread_resume( int id );void thread_sched();int thread_create( int *thread_id, void (func)(), char *name, int priority,int stacksize, int num_args, ...);int thread_destroy( int id );int sem_wait( int sem_id, int no_block );int sem_signal( int sem_id );int sem_reset( int sem_id, int count );int sem_count( int sem_id );int sem_create( int *sem_id, int count, int global );int sem_destroy( int sem_id );int port_send( int port_id, int *msg, int msg_size );154Appendix C. Unix version^ 155int port_recv( int port_id, void **msg, int *msg_size );int port_create( int *port_id, int msg_size, int queue_size, int attrib );int port_destroy( int port_id );/* for Unix I/O */int Read( int fd, char *buf, int nbytes );int ReadN( int fd, char *buf, int nbytes );int Write( int fd, char *buf, int nbytes );int WriteN( int fd, char *buf, int nbytes );int Open( char *filespec, int flags, int mode );int Close(int fd);int Socket( int domain, int type, int protocol );int Accept(int fd, struct sockaddr *addr, int *addrlen);int Connect(int fd, struct sockaddr *name, int namelen);Bibliography[ABB+86] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, andM. Young. Mach: A new kernel foundation for UNIX development. In SummerConference Proceedings. USENIX Association, 1986.[ABLL92] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy.Scheduler activations: Effective kernel support for the user-level management ofparallelism. ACM Transactions on Computer Systems, 10:53-79, February 1992.[ALBL91] Thomas E. Anderson, Henry M. Levy, Brian N. Bershad, and Edward D. La-zowska. The interaction of architecture and operating system design. In FourthInternational Conference on Architectural Support for Programming Languages andOperating Systems, pages 108-120, 1991.[ALL89] Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. The performanceimplications of thread management alternatives for shared-memory multiprocessors.IEEE Transactions on Computers, 38(12):1631-1644, December 1989.[And89] Thomas E. Anderson. The performance implications of spin-waiting alternativesfor shared-memory multiprocessors. In International Conference on Parallel Pro-cessing, pages 11-170-11-174, 1989.[BALL89] Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy.Lightweight remote procedure call. In Proceedings of the 12th ACM Symposium onOperating System Principles, pages 102-113, Litchfield Park, AZ, 3-6 December1989. Published as Operating Systems Review, volume 23, number 5.[BALL90] Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy.User-level interprocess communication for shared memory multiprocessors. Tr-90-05-07, University of Washington, July 1990.[Bed90] Robert Bedichek. Some efficient architecture simulation techniques. Winter 1990USENIX Conference, January 1990.[Ber91] Brian N. Bershad. Practical considerations for lock-free concurrent objects. Cmu-cs-91-183, Carnegie-Mellon University, September 1991.[BRG+88] David L. Black, Richard F. Rashid, David G. Golub, Charles R. Hill, and Robert V.Baron. Translation lookaside buffer consistency: A software approach. Cmu-cs-88-201, Carnegie-Mellon University, December 1988.[CD90] Eric C. Cooper and Richard P. Draves. C threads. Technical report, Departmentof Computer Science, Carnegie Mellon University, September 1990.156Bibliography^ 157[Che88]^D.R. Cheriton. The v distributed system. Communications of the ACM, 31(3):314-333, March 1988.[CK93]^David R. Cheriton and Robert A. Kutter. Optimizing memory-based messaging forscalable shared memory multiprocessor architectures. Technical report, ComputerScience Department, Stanford University, 1993.[Com84]^Douglas Comer. Operating system design, the Xinu approach. Prentice Hall, 1984.[DBRD91] Richard P. Draves, Brian N. Bershad, Richard F. Rashid, and Randall W. Dean.Using continuations to implement thread management and communication in op-erating systems. In Proc. 13th SOSP., 1991.[Dij65]^E. W. Dijkstra. Solution of a problem in concurrent programming control. Com-munications of the ACM, September 1965.[Doe87]^Thomas W. Doeppner. Threads: A system for the support of concurrent pro-gramming. Technical Report CS-87-11, Department of Computer Science BrownUniversity, Providence, RI 02912, June 1987.[GA91]^Ramesh Govindan and David P. Anderson. Scheduling and IPC mechanisms forcontinuous media. In Proc. 13th SOSP., pages 68-80, Asilomar, Pacific Grove, CA,13 Oct. 1991. Published as ACM. SIGOPS.[Go186]^Murray W. Goldberg. Pthreads. Technical report, Department of Computer Sci-ence, Univeristy of British Columbia, 1986.[Gro90]^Motorola Computer Group. MVME188 VMEmodule RISC Microcomputer User'sManual. Motorola, 1990.[GT90]^G. Graunke and S. Thakkar. Synchronization algorithms for shared-memory mul-tiprocessors. IEEE Computer, June 1990.[Her90]^Maurice Herlihy. A methodology for implementing highly concurrent data struc-tures. Second ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming, March 1990.[Her91]^Maurice Herlihy. Wait-free synchronization. ACM Transactions on ProgrammingLanguages, January 1991.[JAG86]^M.D. Janssens, J.K. Annot, and A.J. Van De Goor. Adapting unix for a multipro-cessor environment. Communications of the ACM, September 1986.[KLMO91] Anna R. Karlin, Kai Li, Mark S. Manasse, and Susan Owicki. Empirical stud-ies of competitive spinning for a shared-memory multiprocessor. In Proc. 13thSOSP., pages 41-55, Asilomar, Pacific Grove, CA, 13 Oct. 1991. Published asACM. SIGOPS.Bibliography^ 158[Knu66]^Donald E. Knuth. Additional comments on a problem in concurrent programmingcontrol. Communications of the ACM, May 1966.[MB76]^R. Metcalfe and D. Boggs. Ethernet: Distributed packet switching for local com-puter networks. Communications of the ACM, July 1976.[MCS91a] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronizationon shared-memory multiprocessors. TOCS, 9(1):21-65, February 1991. Earlierversion published as TR 342, URCSD, April 1990, and COMP TR90-114, Centerfor Research on Parallel Computation, Rice UNIV, May 1990.[MCS91b] J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. InPROC of the Fourth ASPLOS, pages 269-278, Santa Clara, CA, 8-11 April 1991.In CAN 19:2, OSR 25 (special issue), and ACM SIGPLAN Notices 26:4.[Mot88a] Motorola. MC88100 User's Manual. Motorola, 1988.[Mot88b] Motorola. MC88200 User's Manual. Motorola, 1988.[Mot88c] Motorola. MVME188BUG 188Bug Debugging Package User's Manual. Motorola,1988.[Mot88d] Motorola. MVME6000 VMEbus Interface User's Manaual. Motorola, 1988.[MP89]^Henry Massalin and Calton Pu. Threads and input/output in the synthesis kernel.In Proceedings of the 12th ACM Symposium on Operating System Principles, pages191-201, Litchfield Park, AZ, 3-6 December 1989. Published as Operating SystemsReview, volume 23, number 5.[MP91]^H. Massalin and C. Pu. A lock-free multiprocessor OS kernel. Technical ReportCUCS-005-91, Department of Computer Science, Columbia University, February1991.[MSLM91] B. D. Marsh, M. L. Scott, T. J. LeBlanc, and E. P. Markatos. First-class user-levelthreads. In PROC of the Thirteenth SOSP, pages 110-121, Pacific Grove, CA, 14-16October 1991. In OSR 25:5.[PKB+91] M. L. Powell, S. R. Kleiman, S. Barton, D. Shah, D. Stein, and M. Weeks. Sunosmulti-thread architecture. Proceedings of the Usenix 1991 Winter Conference, 1991.[SLM89]^M. L. Scott, T. J. LeBlanc, and B. D. Marsh. A multi-user, multi-language openoperating system. In PROC of the Second Workshop on Workstation OperatingSystems, pages 125-129, Pacific Grove, CA, 27-29 September 1989.[SM90]^Inc. Sun Microsystems. Lightweight processes. SunOS Programming Utilities andLibraries, March 1990.[SS92]^D. Stein and D. Shah. Implementing lightweight threads. Proceedings of the Usenix1992 Summer Conference, 1992.Bibliography^ 159[Sta89]^Richard M. Stallman. The GNU gdb debugger. The Free Software Foundation,1989.[TR87]^A. Tevanian and R.F. Rashid. MACH: A basis for future UNIX development.Cmu-cs-87-139, Carnegie-Mellon University, June 1987.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items