Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Unified language and operating system support for parallel processing Acton, Donald William 1994

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1994-893273.pdf [ 5.56MB ]
Metadata
JSON: 831-1.0051476.json
JSON-LD: 831-1.0051476-ld.json
RDF/XML (Pretty): 831-1.0051476-rdf.xml
RDF/JSON: 831-1.0051476-rdf.json
Turtle: 831-1.0051476-turtle.txt
N-Triples: 831-1.0051476-rdf-ntriples.txt
Original Record: 831-1.0051476-source.json
Full Text
831-1.0051476-fulltext.txt
Citation
831-1.0051476.ris

Full Text

UNIFIED LANGUAGE AND OPERATING SYSTEMSUPPORT FOR PARALLEL PROCESSINGbyDonald ActonB.Sc. (Computer Science) University of British Columbia, 1982M.Sc. (Computer Science) University of British Columbia 1985A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIES(Department of Computer Science)We accept this thesis as conformingto the required standard/THE UNIVERSITY OF BRITISH COLUMBIAApril 1994©Donald Acton 1994In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)Department of__________________The University of British ColumbiaVancouver, CanadaDate 4/ //99DE-6 (2/88)AbstractThe programming of parallel and distributed applications is difficult. The proliferation of networks of workstations, combined with the advent of shared memory-machines for desktop andoffice use, is making parallel and distributed environments more commonplace. And, there isan increasing demand for general purpose applications to be able to make use of these extraprocessing resources.This thesis explores the adding of language and system support for general purpose parallel and distributed programming to the object-oriented language and system Raven. The system is targeted for operation in shared-memory and distributed-memory environments wherethe processors are general purpose and execute their own independent instruction streams.Support for parallelism forms a fundamental part of Raven. To simplify the creation of parallelism Raven introduces the concepts of class-based and user-based parallelism. Class-basedparallelism is parallelism created within a class and it is realized through early and delayedresult. User-based parallelism results through the use of a class and is realized through companions and certificates. Raven also supplies automatic concurrency control, even acrossmachine boundaries. Concurrency control is attached to an object through the more generalproperty mechanism. Properties are a new way of supplying system services to objects on anindividual object basis. Raven also introduces the notion of invoke streams which are a wayfor a third party to sequence invocation requests targeted at the same object.To demonstrate the viability of these ideas, an extensive implementation was performed.Raven runs on several different machines and operating systems including a 20 processorshared-memory machine. To demonstrate the usability of the parallel constructs, and the efficiency of the implementation, several parallel applications were written and performancemeasurements made. Included are implementations on shared-memory and distributed-memory machines that are identical except for a few lines of code specifying the system configuration. A distributed mail system is also presented to highlight the support for writingdistributed applications.11Table of Contents iiiTable of ContentsAbstract .. . .fl..... . .......... ...... ...................fl......... ........... .... iiof Contents.............. iiiList of l’ables ... .. .. •.. •. ,•,•,•••••• ... viiilist of F’igures ......... ......... ................. .............. ............... ixAcknov1edgement... ...................... .. .......... ...... ........ .. ................. ..... xiHAPTER 1 Introduction ........................................................... 11.1 The General Problem 31.1.1 Hardware Boundaries 31.1.2 Software Boundaries 51.2 Thesis Statement 61.3 Contributions 71.4 Thesis Overview 8CHAPTER 2 Background and Related Work............................. 152.1 Distributed and Multiprocessor Environments 162.1.1 Granularity of Parallelism 172.2 Parallel programming models 182.2.1 Data Parallelism 202.2.2 Data Partitioning Parallelism 202.2.3 Synchronous Iteration Parallelism 202.2.4 Replicated Worker Parallelism 212.2.5 Pipelined Parallelism 212.3 Supporting Parallel and Distributed Computing 222.3.1 Extending Sequential Systems 232.3.1.1 ParallelRPCs 232.3.1.2 Streams and Pipes 242.3.1.3 Promises and Claims 242.3.1.4 Multiprocessors 252.3.2 Distributed Systems 272.3.2.1 SR 272.3.2.2 Poker 282.3.2.3 MUPPET 302.3.2.4 PVM 302.3.3 Language Extensions or Adaptations 312.3.3.1 Multilisp 32Table of Contents iv2.3.3.2 Actors 322.3.3.3 Linda 332.3.4 Smailtalk-Based Systems 342.3.4.1 Distributed Smalitalk 352.3.4.2 CST 352.3.4.3 Actra 362.3.5 Other Object-Oriented Systems 372.3.5.1 S1NA 372.3.5.2 POOL-T 382.3.5.3 Eiffel 392.3.6 Choices 402.4 Support Requirements 412.5 Summary 47CHAPTER 3 The Raven System and Language Overview.........493.1 The Passive and Active Object Models 493.2 Raven 523.2.1 The Raven Language 533.2.2 Raven Class Library 613.2.3 The Raven Compiler 623.2.4 The Local Raven Environment 633.2.5 The Virtual Machine 653.2.6 Supporting Distribution in Raven 663.3 Summary 67CHAPTER 4 Support for Parallel and DistributedComputation in Iaven ............................................ 694.1 Specifying Parallelism in Raven 704.2 Class-Based Parallelism 714.2.1 Early Result 724.2.2 Delayed Result 724.3 User-Based Parallelism 744.3.1 Certificates and Companions 754.3.2 InvokeStreams 804.4 Properties 854.5 Object Concurrency Control 924.5.1 Locking Costs 944.5.2 Locking Duration 954.5.3 Locking problems 974.5.4 Locking and Early Result 1004.5.5 Locking and Delayed Result 1024.5.6 Deadlock Handling 1024.6 System Resource management 105Table of Contents v4.6.1 Memory 1054.6.2 Processor Management 1054.7 Object Transparency 1104.8 The System Object 1184.9 Failures and Exceptions 1194.10 Unique Raven Language Features 1214.11 Summary 125CHAPTER 5 Implementation Details and Issues ...................... 1265.1 Object Implementation 1265.1.1 Object Organization 1285.1.2 GlobaliDs 1295.1.3 Class Objects 1325.1.4 Properties 1345.1.5 Method Invocation and Lookup 1345.2 Memory Allocation and Garbage Collection 1425.2.1 Modifications to the memory allocation system 1445.3 The Thread Environment 1495.3.1 The Thread Class 1505.3.2 Companions 1525.3.3 Early Result 1545.3.4 Delayed Result 1565.4 Remote Invocation 1585.4.1 Class relationships 1605.4.2 Avoiding Copying Loops 1665.4.3 Object Encoding 1685.4.4 Locating Objects 1695.4.5 Name Server 1705.4.6 InvokeStreams 1725.4.7 Lock Management 1745.4.8 Remote Invocation Failures 1755.5 Summary 176CH14P1’ER 6 Performance 1776.1 Performance Metrics 1776.1.1 Speedup and Efficiency 1796.2 Example Problems 1836.2.1 Mandelbrot Set 1856.2.1.1 Problem Description 1856.2.1.2 Performance Results 1876.2.2 Bayesian Search 1886.2.2.1 Problem Description 1896.2.2.2 Performance Results 1936.2.3 Prime Number Generation 198Table of Contents vi6.2.3.1 Problem Description 1986.2.3.2 Performance Results 2006.2.4 A Distributed Version of the Mandelbrot Computation 2026.2.5 A Distributed Mail Application 2056.2.6 Miscellaneous Applications 2086.3 Performance Analysis Overview 2096.4 Summary 210CHAPTER 7 Conclusions and Future Research....................... 2117.1 Summary of Results 2117.1.1 Parallelism at the Language Level 2127.1.2 System support 2147.1.3 Conclusions 2157.2 Future Work 216Bibliography... 219(lossary ................ ................ ........................ ................. 231Appendix A. 1’ab1e of Contents.............. .......................... ................ 235A.ppendix 4. The RavenI1anguage............................................... 237A. 1 Introduction 237A.2 BNF Syntactic Notation 238A.3 Syntax of the Raven Language 239A.3. 1 Basic Building Blocks 239A.3.1.1 Comments 239A.3.1.2 Strings 239A.3.1.3 Numbers 239A.3. 1.4 Reserved Words 240A.3. 1.5 Punctuation 243A.3.1.6 Identifiers and Types 244A.3.2 RavenBNF 244A.3.3 A Raven Program 246A.3.4 Declaring Classes and Variables 246A.3 .4.1 Parameter and Variable Declarations 246A.3.4.2 Type Definitions and Declarations 248A.3.4.3 Self, Me, and Super 248A.3.4.4 Behavior Declaration 249A.3.4.5 Class Declaration 250A.3.4.6 Parameterized Types 253A.3.5 Control Statements 254A.3.6 Defining Behaviors and Manipulating Objects 254A.3.6. 1 Behavior Definition 254Table of Contents ViiA.3.6.2 Behaviors with a Variable Number of Arguments 255A.3.6.3 Method Invocation 256A.3.6.4 Creating Objects 257A.3.6.5 Casting 258A.3.7 Support for Parallelism 259A.3.7. 1 Early and Delayed Result 259A.3.7.2 Certificates and Companions 261A.3.8 Type Checking 262A.3.9 Locking 263A.3.10 Properties 265A.4 Support Classes 267A.4.l Object 268A.4.2 Class 269A.4.3 Certificate 270A.4.4 CertificateGroup 272A.4.5 Thread 272A.4.6 CompanionThread 274A.4.7 InvokeStream 275A.4.8 System 276A.4.9 NameServer 277A.4.l0 Miscellaneous Classes 277A.5 The Raven Compiler 278List of Tables viiiList of TablesTABLE 2.1 Summary of language/system support for parallelism 44TABLE 6.1 SPEC results for selected processors 203TABLE 6.2 Elapsed time and speedup for distributed Mandeibrot computation 204List of Figures IxList of FiguresFIGURE 1.1FIGURE 3.1FIGURE 3.2FIGURE 3.3FIGURE 3.4FIGURE 3.5FIGURE 3.6FIGURE 3.7FIGURE 3.8FIGURE 4.1FIGURE 4.2FIGURE 4.3FIGURE 4.4FIGURE 4.5FIGURE 4.6FIGURE 4.7FIGURE 4.8FIGURE 4.9FIGURE 4.10FIGURE 4.11FIGURE 4.12FIGURE 4.13FIGURE 4.14FIGURE 5.1FIGURE 5.2FIGURE 5.3FIGURE 5.4FIGURE 5.5FIGURE 5.6FIGURE 5.7FIGURE 5.8FIGURE 5.9FIGURE 5.10FIGURE 5.11FIGURE 5.12FIGURE 5.13FIGURE 5.14FIGURE 6.1FIGURE 6.2Target environment 4Objects invoking on each other 51Raven pencil cup example 55Pencil cup class hierarchy 56Eraser as a subtype of Weighable 57Class organization 59Tagging a parameter as copyable 60Objects with different invoke routines 65Example of a remote invocation 67Example code demonstrating early result 72Use of delayed result 74Examples of certificates and companions 76Example use of CertificateGroups 79Use of CertificateGroups and tags 80Some potential parallel invocations with serial components 81Example stream creation and use 83Creating objects with different properties 91Concurrent object access 93Object invoking back on self 98Lock Sharing 99Sequential code to update a display 106Conversion of sequential code to parallel version 106Object interactions in a distributed Raven environment 114Object organization in memory 128Format of global ID information 130Instance data of an object of type class 133Pseudo code illustrating steps involved in performing an invoke 135Processor efficiency and memory allocation delays when no heaps areused 145Processor efficiency and memory allocation delays when multiple heapsare used 146Layers of the Raven system 150Actions undertaken upon companion creation 153Pseudo-code illustrating early result 155Delayed result example and pseudo code describing actions taken bysystem 157Remote invocation sequence 159Example of the data fields for an invocation request 164Creating a name server 171Example of stream usage 172Subset of Raven code to perform the Mandelbrot computation 186Mandelbrot set speedup 188List of Figures XFIGURE 6.3 Subsection of multiple-bit adder 189FIGURE 6.4 The main body of the execute method in the Worker class 192FIGURE 6.5 Sequential and parallel control portions of Bayesian search 193FIGURE 6.6 Speedup for Bayesian search space 194FIGURE 6.7 Barrier class to synchronize and time a collection of Threads 196FIGURE 6.8 Speedup vs. processors using a modified measuring technique 197FIGURE 6.9 Sequential and parallel control portions prime number search 199FIGURE 6.10 Speedup vs. processors for prime number searching 200FIGURE 6.11 Efficiency vs. processors for the prime number searching routine 201FIGURE 6.12 Partial results of the Mandelbrot computation performed in a distributedenvironment 204FIGURE 6.13 A distributed mail application 206FIGURE 6.14 Example early result usage in distributed mail application 207AcknowledgementI would like to thank my supervisor, Gerald Neufeld, for his help, guidance, and prodding incompleting this thesis. My thesis committee of Norm Hutchinson, Jim Little, Alan Wagner,Sam Chanson, and Peter Lawrence, also deserves recognition for their comments and helpwhen requested.Grad school would not have been the enjoyable experience it was without the many people who have become my friends and colleagues over the last few years. To Ian, George,Terry, Peter, Barry, Murray, Carlin, the Mikes, Andrew, Andy, and all the people I am forgetting, thanks.A special thanks goes to the Amor family, who have adopted me and become my familyaway from home. Finally, I would like to thank my parents for their constant support and loveduring this endeavour.xiIntroductionParallelism and distribution of operations and functions are ubiquitous in everyday tasks in thereal world. A person might wash dishes while watching television, dry one load of clotheswhile washing another, or simultaneously cook the peas, carrots, corn, and potatoes for a meal.However, parallelism is not confined to daily activities: computer systems use it too. A computer system itself typically engages in parallel activities like performing simultaneous readsand writes to disk drives, or accessing a network while displaying information on a console. Inaddition, a timesharing system extends the notion of parallelism by providing the illusion ofrunning multiple programs at once. After leaving the domain of the operating system, however,and advancing to the application program, any appearance of parallelism has usually disappeared and the programmer is provided with a sequential model to program in. Of course, operating in a sequential environment makes the task of programming easier for many applications;however, those tasks that inherently involve some degree of parallelism are more difficult toprogram when parallel activities are constrained to operate sequentially.1CHAPTER 1: Introduction 2The need for programs to be able to express and harness parallelism is not strictly confined to applications that have a natural parallel expression but are being forced to adapt to asequential environment. The increasing power of microprocessors, the proliferation of computer networks, and the development of multiprocessors is increasing the need for applicationsto be able to deal with parallel and distributed computing. Problems once confined to supercomputers are now being tackled by workstations, and the processor cycles in idle computersattached to networks represent untapped resources. Some mechanism is needed to operate inthese types of parallel and distributed environments.The increased power of microprocessors is also creating the demand for applications thathave substantial parallel components. For example, machines targeted at providing multimediaenvironments are now available. With this environment a program can be expected to recordan audio message while playing back a stream of audiovisual data in real time. This data mightbe obtained from a local CD-ROM or from a continuous-media file system accessed through ahigh speed network. Certainly, these types of environments will demand more of the programmer with respect to the number and variety of devices that a program will need to control simultaneously. This correspondingly increases the demand for facilities to manipulate and controlparallelism from within the application. Additionally, limiting an application program to usinga serial approach needlessly restricts the problem-solving techniques that can be used on inherently parallel applications. For example, the processing of a video stream that needs to be synchronized with an audio stream involves processing operations that are logically separate, yetrequire synchronization for playback. Although applications of this nature are in their infancy,they will become more commonplace as the processor speeds increase and more sophisticatedapplications are developed. The result is an increasing need to support parallel and distributedcomputing.CHAPTER 1: Introduction 31.1 The General ProblemTo harness the power of parallelism it is imperative that parallel systems be easy to use whilesimultaneously providing the developer with the software tools and run-time environmentrequired to achieve maximum performance. The advent of small, inexpensive, powerful microprocessors has made it possible to assemble collections of processors into parallel machineswith potentially supercomputer performance. The result is that parallel machine environmentsare becoming more accessible and commonplace.For parallel machines to come into common usage, a change in the development environment and run time support for parallel systems is required. Changes will be needed, similarto those that revolutionized the uniprocessor world as its programming environment advancedfrom toggle switches to high level languages. If the parallel programming revolution succeedsit will result in a computing environment capable of allowing a large class of programs to beprogrammed easily and run efficiently. Until the task of programming a parallel application iscomparable in simplicity to that of programming a sequential application, the use of paralleland distributed computing will remain restricted.1.1.1 Hardware BoundariesWithin the class of general purpose machines, two trends are emerging with respect to computing environments. One trend is the proliferation of workstation or desktop class machines connected via networks; the other is the emergence of shared-memory multiprocessor machines.Several vendors are already offering, or are expected to offer, shared-memory systems targetedfor desktop and office use [85][901[251. Both the shared-memory machines and networks ofconnected machines can be characterized as collections of independent processors that executetheir own instruction streams. Machines of this type are called multiple instruction multipledata (MIMD) machines. The processors can communicate with each other either by sharedmemory, a network, or both. Figure 1.1 depicts one possible computing configuration. In thefigure, circles represent the independent processors, the large rectangle represents shared memCHAPTER 1: Introduction 4ory, and the thick black line represents a network. The system essentially consists of a shared-memory machine with four processors sharing a network with three other processors.As high-speed networks become prevalent, and shared-memory machines become common, it is reasonable to expect that the computing environments of the future will consist ofsome collection of shared-memory processors connected to other machines by a high-speednetwork. With the development of such systems, there will be a demand for applications thateffectively use the processor resources of such systems. The problem of providing support forparallel and distributed computing, as addressed by this thesis, is narrowed by restricting thesystems to operate in a shared- or distributed-memory system where each processor is autonomous and executes its own instruction stream.FIGURE 1.1 Target environmentCHAPTER 1: Introduction 51.1.2 Software BoundariesWhen writing a program, the programmer must decide upon the general organization of theprogram and the degree and type of functionality the application is to have. Once the basicoperating parameters for the application have been defined, the programer must then decideupon the data structures, the functions and procedures, and, ultimately, the actual code thatimplements the solution. These tasks are difficult in their own right, and the programming ofparallel and distributed applications greatly increases the magnitude of the problem. In paralleland distributed programming the programmer must also be concerned with decomposing theapplication into processes, deciding how processes are to communicate with each other, synchronizing process execution, and controlling access to shared resources. In addition, the programmer may also be concerned about the particular hardware environment the application isto run in. The approach used to solve a particular problem may be highly dependent uponwhether the application will be running in a shared-memory or distributed-memory environment. For example, in a distributed environment the programmer may have to be concernedwith how data is exchanged between machines and how to deal with any incompatibilities thatresult from running in a heterogenous processor environment, and in a shared-memory environment the potential for data access conflicts is greatly increased.In both shared-memory and distributed-memory environments, parallel programmingpackages have concentrated on progranmiing facilities for scientific applications where highperformance is crucial. Systems like the NX/2 operating system for Hypercubes [82], and Chameleon [6] and vendor specific packages [89] for shared-memory machines illustrate this point.These systems sacrifice ease of use for performance. Little support is available to allow moregeneral purpose application to make use of parallelism. Yet, the proliferation of threads packages [33] [79] strongly suggests the desire to use multiple threads of control as a way for generalpurpose applications to use parallelism as a structuring technique for inherently parallel tasksand as a way to improve performance for desktop shared-memory multiprocessor machines.Although threads packages provide the bare tools for writing parallel programs, they addressCHAPTER 1: Introduction 6few of the difficult problems associated with writing parallel and distributed applicationsdescribed in the previous paragraph.A good example of a general purpose application that could make better use of parallelism is a graphical user interface to a system. On the display, when a button is pressed and a longrunning action initiated, it should be possible to press other buttons and have additionalrequests serviced. Currently, applications generally freeze and do not accept additional inputuntil the current task is completed. If parallelism were easier to create and manage, the task ofservicing multiple requests simultaneously would be simplified.The problem of providing support for parallel and distributed computing as addressed bythis thesis is narrowed by restricting the systems to operate in a shared- or distributed-memorysystem where each processor is autonomous and executes its own instruction stream. In thesoftware environment, this work will be restricted to providing support for parallelism for general purpose applications where the emphasis is on moderate performance improvements in thehardware environments described in the previous section. This should not preclude having programming language constructs that allow the writing of classical performance oriented scientific applications; rather, the focus is on ease of use in writing the general purpose system levelsoftware and not on raw performance.t2 Thesis StatementThe software currently existing for the programming of general purpose applications in sharedmemory and distributed-memory environments that is independent of the base operating system and the underlying hardware is in a primitive state. The programming abstraction used inprogramming these environments can be greatly simplified by developing an object-orientedlanguage and system that provides a single unified object-view of system services, languageconstructs, and object location. Furthermore, the additional processing power available fromthese processor configurations can be harnessed by embedding support for parallel and distribCHAPTER 1: Introduction 7uted processing directly into the language and system so that any object can be the target ofparallel requests. The requirements for such a system can be met by:• Taking advantage of the object-oriented paradigm to identfy the unit ofparallelism and the type ofactions that can be performed in parallel.• Integrating operating system services directly into the language, thereby freeingthe application program from direct interaction with any operating system interface that is dependent upon the underlying system.• Providing a uniform object access environment across shared-memory and distributed-memory environments, therebyfreeing the programmerfrom the need tomanage how objects in different environments interact.• Implementing the language/system on shared-memory and distributed-memoryhardware, thereby demonstrating the viability of the ideas, the portability of thesystem, the capability to operate in a heterogenous environment, and the abilityof all the components to operate together• Undertaking performance measurements, thereby demonstrating the effectiveness of the system for parallel and distributed computation.1.3 ContributionsThe research contributions of this thesis work are detailed below, with further explanationappearing in section 1.4.1. The development of a model for parallelism in object-oriented systems. Themodel exploits the strong data encapsulation of methods and the method invocation interface to specify parallelism. The parallelism can be specified either bythe user of a method (user-based parallelism), or by the implementor of a method(class-based parallelism).2. Class-based parallelism is realized through early and delayed result, while userbased parallelism is realized through certificates and companions.3. The introduction of invocation streams as a way to serialize invocation requeststargeted for the same object.4. The introduction of the notion of properties as a convenient way to provide operating system services to objects on an instance-by-instance basis, while avoidingthe class explosion problem.CHAPTER 1: Introduction 85. The specification of a target virtual machine interface to aid portability. The usefulness of this was demonstrated by porting the Raven system to several differenttypes of hardware running different operating systems. The porting operationwas accomplished by efficiently implementing the virtual machine, using the services of the native system.6. The provision of system-level support for automatic concurrency control torelieve the programmer of the task of managing concurrent access to objects.This is demonstrated by developing single-threaded applications in a uniprocessor environment and converting them to run in parallel on shared-memory multiprocessors, without adding additional code to manage concurrency.7. The specification of a uniform programming interface that allows applications tobe programmed independent of knowledge about the underlying hardware, be ita shared-memory or distributed-memory environment.8. Providing an implementation of the proposed system to demonstrate the viabilityof the features required to support parallel and distributed programming.9. The programming of several examples to demonstrate the usability of the systemfor writing parallel and distributed applications in both shared-memory and distributed-memory environments.1.4 Thesis OverviewIn this section, a broad overview of the contents of the thesis are first presented. The overviewis followed by a more detailed description of contents of the various chapters. The remainderof the thesis is organized in the following way:• Chapter 2 provides an overview of the work done in parallel and distributed systems relevant to the work of this thesis. The strengths and weaknesses of the various systems are examined to arrive at the features important to supportingparallel and distributed processing.• Chapter 3 introduces the Raven language, which serves as platform for the ideasand features introduced by this thesis.• Chapter 4 introduces the design features introduced into Raven to support parallel and distributed computing.• Chapter 5 provides details of the implementation of the features described inChapter 4.CHAPTER 1: Introduction 9Chapter 6 demonstrates Raven’s support for parallel and distributed computation, with several examples.Chapter 7 is a summary of the thesis with emphasis on the thesis results andfuture areas of research.In general terms, the first three chapters of this thesis are introductory in nature. Chapter 1introduces the problem, Chapter 2 introduces the important terminology with respect to paralleland distributed computing, and Chapter 3 introduces Raven, the language and system that thework this thesis describes is implemented in. Chapter 2 also provides an overview of relatedwork.Chapters 4, 5, and 6 form the main body of the thesis, organized into three major parts.The first part, Chapter 4, describes the features to support parallel and distributed programming. Chapter 5, the second part, describes the implementation issues and problems associatedwith the implementation. The third part, Chapter 6, illustrates the use of Raven’s support forparallel and distributed programming with some programming examples. A more detaileddescription of the contents of the remainder of the thesis follows.Chapter 2 presents the background and related work relevant to the understanding of thethesis. It begins with a discussion of the distinction between distributed and multiprocessorenvironments (Section 2.1), which includes a discussion of the granularity of parallelism. Fromthere, a review of the various programming models used in developing parallel programs isundertaken (Section 2.2). An overview of the support for distributed and parallel programmingprovided by other systems then follows (Section 2.3). The systems examined are classifiedaccording to how they provide support for parallel and distributed computing. Some systemsextend sequential systems, some are based on distributed systems, some use language extensions, and some are object based. By examining these systems, a list of important featuresrequired to support parallel and distributed computations is created. This list forms the basisfor determining the feature set needed by Raven to support parallel and distributed programming.CHAPTER 1: Introduction 10Chapter 3 introduces the Raven language and system. The chapter starts by introducingthe passive and active object models (Section 3.1) and how the Raven programming model differs from these models. A detailed overview of the Raven language follows (Section 3.2). Theoverview consists of defining the various terms used to discuss programming in Raven and aprogram example to illustrate the features of Raven important to understanding this work. Todo this, the basic syntax of the Raven language is presented, along with an overview of the various components that comprise the Raven system. These components consist of the compiler,class library, and the underlying virtual Raven machine. Additionally, the concept of a localRaven environment and the support for distribution are also introduced.Chapter 4 introduces the key problems and the application level solution to supportingparallel and distributed computation. To provide an environment that handles the problemsassociated with parallel and distributed processing described earlier in Chapter 1, a number ofissues had to be addressed. The initial decision that needed to be made concerned the programming language or environment to be used. The choice was to extend the object-oriented language and system, Raven, to support the programming of parallel and distributed applicationsin shared-memory and distributed-memory environments. To do this, several key problems hadto be solved. The first problems concentrated on creating parallelism and determining what theunit of parallelism should be (Section 4.1). The strong data encapsulation presented by objects,combined with a well defined and restricted procedure interface (methods) to manipulate thedata strongly suggested that the method interaction point should be the location where parallelism is created, and that the unit of parallelism should be the method. This observation leadsto the introduction of class-based (Section 4.2) and user-based parallelism (Section 4.3). Classbased parallelism is parallelism that results from actions on the callee side of the invocation.The parallelism is expressed through the new language constructs of early and delayed results.User-based parallelism is parallelism initiated on the caller side and takes the form of certificates and companions. Certificates and companions are used to create, monitor, and manageparallelism on the caller side. Part of the ability to manage companions takes the form of invoCHAPTER 1: Introduction 11cation streams, which are a new way of imposing third-party sequencing on requests made ofthe same server.With the facilities for creating parallelism firmly established, Chapter 4 proceeds to discuss properties (Section 4.4). Properties are a new way of providing operating system servicesto individual objects. Properties stop the class explosion problem that results when the sameclass needs to make use of specific system services, such as concurrency control, yet was notinitially programmed to use those services. Properties also eliminate the need to provide amechanism that allows the programmer to make direct calls to the operating system for specificservices.One of the main motivating factors for the concept of properties was the need to supplyconcurrency control (Section 4.5) for objects that were not originally programmed to be usedthat way. The concurrency control provided in Raven is accomplished at the method level, andit permits multiple reader methods and a single writer method. Combining the object-orientedprogramming style encouraged by Raven with facilities to encourage use of parallelism createschallenging concurrency control problems. In particular, the problems associated with objectscalling methods on themselves, locking conflicts that result from using class-based parallelism,locks that need to cross machine boundaries, and the handling of deadlocks are explained andsolutions described.With the ability to easily create parallelism come problems associated with managing system resources (Section 4.6), in particular main memory and the number of active processes.Garbage collection is used to help manage memory and restrictions on the creation of processesare used to control the number of active parallel execution threads. The number of active parallel threads a system can have is considered a system configuration issue. By using thisapproach the programmer can write an application without regard to how many processors thetarget system actually has.CHAPTER 1: Introduction 12One of the goals of the work in this thesis is to provide a uniform application programming intçrface for programs that run in a shared-memory or distributed-memory environment.This is accomplished by using the object and method invocations as the unit of work. With system support for making an object location transparent (Section 4.7), the application cannot distinguish remote objects from local objects. This transparency is crucial to making it possible towrite applications that can run in shared-memory and distributed-memory environments without modification.Raven provides support to configure the running system through the System object(Section 4.8). Chapter 4 concludes with an overview of the facilities Raven provides for processing and handling exceptions and failures (Section 4.9).Whereas Chapter 4 introduces the various features of Raven to support parallel and distributed computation, Chapter 5 describes the underlying implementation of the Raven systemand details the various problems that were encountered and solved while implementing the support for parallel and distributed processing. The first topic covered is how objects are organizedin Raven (Section 5.1). Particular attention is given to how methods are invoked, how properties are associated with objects, and how remote objects are represented locally. When Ravenwas first run on a shared-memory multiprocessor, a performance bottleneck in the dispatchingof methods was encountered. The diagnosis of this problem and the resulting solution are alsodescribed in this section of Chapter 5.The memory allocation and garbage collection system used on a shared-memory multiprocessor also introduced some unexpected performance problems (Section 5.2). The extensive modifications made to improve the operation of the memory management system, alongwith proposals on how to improve the overall performance of the memory management systemare also discussed.The next section of Chapter 5 is devoted to a discussion of the role of threads(Section 5.3) in the Raven environment. All parallelism within Raven is accomplished throughCHAPTER 1: Introduction 13the creation and management of new threads of control. Threads are system level entities andthey are not used by the application. Applications achieve parallelism by using companions,early result, and delayed result. How these language level features are translated into real parallelism is described in this section.To support a uniform programming environment for shared-memory and distributed-memory machines, special system support is needed to make remote objects appear like localobjects (Section 5.4) and this topic is the last topic of Chapter 5. In providing this support, theproblems associated with copying objects, operating in a heterogenous processor environment,locating objects, managing locks across machines, sequencing invocation requests, and handling failures are discussed and addressed.Chapter 6 demonstrates the usability of the Raven system and its support for parallelismby describing the implementation of several applications and the performance results. Thechapter starts with a discussion of the performance metrics (Section 6.1) to be used, and continues with some specific examples (Section 6.2). Three examples of Raven’s performance ona shared-memory multiprocessor are presented. These examples, the Mandelbrot set, a bayesian search, and a prime number generator are each presented and the performance resultsdescribed. A distributed implementation of the Mandelbrot set computation is also discussed.Of particular note is the fact that the shared-memory and distributed-memory implementationsare identical except for a few lines of code involved with system configuration issues. The distributed Mandelbrot example was run in heterogenous environment. To illustrate that Ravencan be used to develop larger distributed applications, a distributed mail application isdescribed. Taken together, these applications illustrate Raven’s support for parallelism throughcompanions, early result, delayed result, and the ability to provide a uniform operating environment for shared-memory and distributed-memory environment.Chapter 7 presents a summary of the results (Section 7.1). The summary includes areview of the way parallelism is supplied in the Raven language and of the system support sup-CHAPTER 1: Introduction 14plied for parallelism. The thesis closes with a description of the research contributions of thisthesis and the identification of some future areas for research.In summary then, Chapter 2 presents an overview of work done in parallel and distributedsystems relevant to this thesis and sets the basis for the work. Chapter 3 introduces the Ravenlanguage and system. Chapter 4 describes the new features for supporting parallelism providedby Raven. Chapter 5 describes the implementation, with Chapter 6 describing the sample problems programmed in Raven. Chapter 7 presents a review of the thesis work. The relevance ofthis work, its contributions to the body of computer science research, and future research areasare highlighted.CHAPTER 2 Background andRelated WorkThe purpose of this chapter is to provide a foundation for the discussion of the work presentedin the remainder of this thesis. To accomplish this, the chapter first discusses the differencesbetween distributed and multiprocessor programming environments. Since the granularity, orunit of parallelism will affect how a parallel application is written, the next section is devotedto a discussion of granularity. The following section provides a design approach for creating aparallel program. It starts with a description of the steps involved in creating a parallel application and concludes with an enumeration of some of the common approaches to organizing aproblem to achieve parallelism. To this point, the chapter has provided the background materialused when discussing parallelism, and insight into ways to design parallel programs. Buildingon this material, the next section reviews and classifies some systems that support parallel anddistributed programming. The chapter closes with a summary of the material that was presented.15CHAPTER 2: Background and Related Work 162.1 Distributed and Multiprocessor EnvironmentsWhen dealing with parallel and distributed computation, there are two main types of programming environments. One type is a distributed environment and the other is a multiprocessorenvironment. Each of these environments presents a particular system view to the programmer,with the result that the similar problems are solved using different approaches, and the environments themselves affect the sort of problems being solved. Bal et al [171 define a distributedsystem as follows:A distributed computing system consists of multiple autonomous processors that do not share primary memory, but cooperate by sendingmessages over a network.A further elaboration of this definition explains that each system executes its own instructionstream and has its own local data. By its very nature, such a configuration is subject to partialfailures where one part of the system fails and the others continue to run. These types of systems also tend to be loosely coupled, and there is the possibility that such a system is heterogenous and consists of multiple processor types.This is contrasted to a multiprocessor, where processors communicate through shared-memory, like the Sequent [89], or through fast processor to processor connections, such asmight be found in a Transputer farm [54] [42] or Intel Hypercube machine [58]. These systemsdiffer from a distributed environment in that they tend to consist of tightly coupled homogenous processors. The processors and interconnections are not as autonomous as a distributedsystem and the system lacks independent failure modes. That is, if one processor node or memory component fails, the whole system fails, just like a uniprocessor machine.The types of programming models used in these two environments is often quite different. Distributed-system software tends to use remote procedure calls to communicate betweenthe different processors, with the result that the applications remain sequential in nature.Remote applications are often involved with the managing or sharing of resources amongst acollection of machines; some examples of this are the sharing of printers, providing file accessCHAPTER 2: Background and Related Work 17services, or maintaining a database. A distributed application tends to be much more concernedwith the handling of failures than multiprocessor applications. When a component of a distributed system fails, the application needs to have some level of fault tolerance. Using printing asan example, if the print server fails client processors should not crash. It should, instead probably wait for the print server to become available again. Since distributed environments can beheterogenous, a distributed application must also address the problems of exchanging data andrequests between machines with different internal data representations. Increases in performance of a distributed system are often more a by-product of the organization of the applicationthan by design.In contrast, applications developed for a multiprocessor environment, and in particularfor a shared-memory environment, tend to have as their primary goal increasing the performance of the application. In a distributed application an increase in performance may result foran application, but that is not the sole goal. Especially in a shared-memory multiprocessor, theapplications, typically, do not concern themselves with issues associated with heterogenousenvironments or fault tolerance.Obviously, the physical constraints of an environment play an important role in the system and affect how applications are developed in distributed and multiprocessor environments.However, these systems do have to address many of the same sorts of problems. For example,how are other entities in the computation located and communicated with, how are sharedresources protected from improper concurrent access and how are the various components ofthe application defined, created and assigned to processors.2.1.1 Granularity of ParallelismAnother consideration when discussing parallelism is the unit of parallelism. This is referredto as granularity. Bal et al [17] provide a definition for the granularity of parallelism. Theydefine a parallel application’s granularity by the amount of computation done between communications. With such a definition, the natural inclination is to consider communications to meanCHAPTER 2: Background and Related Work 18the exchange of messages between processes running in a distributed memory environment.This is not the case, since a communication can take place in a variety of ways. In particularthe communication can consist of the exchange of information through shared memory, or thesynchronization of processes.The terms fine-grained and coarse-grained are used to describe the length of time betweencommunications. In a fine-grained parallel application the time between communications issmall; in a coarse-grained parallel application the time between communications is large. Fora fine grained application a communication might occur after every instruction, whereas in acoarse-grained applications the communications may be separated by hours. The hardware anapplication is using will, obviously, have some effect on the granularity of parallelism. Forexample, a distributed environment is not likely to support an extremely fine-grained model ofparallelism, since the cost of synchronizing across the network would be high compared to theamount of computation done between communications.2.2 Parallel programming modelsThere are several ways that parallelism can be used to solve a problem. The technique that Car-nero and Gelertner [29] advocate for writing parallel programs consists of the following steps:• First choose an approach to parallelism that most naturally maps to the problem.• Write the program based on this choice.• Run the program, and, if it is not efficient enough, transform the solution into amore efficient solution by changing the approach to parallelism.This is the approach to parallel programming advocated in Raven. It is important to have a goodunderstanding of a problem and its solution, and one of the best ways to achieve this is to implement the solution in its most natural form. By analyzing the results and the way the programbehaves, these additional insights into the problem can be used, if needed, to modify the solution to be more efficient.CHAPTER 2: Background and Related Work 19Carriero and Gelertner also present three common ways to organize applications toachieve parallelism. These approaches are classified as follows:1. Result parallelism which focuses on the finished result.2. Agenda parallelism which focuses on the list of activities that must be performed.3. Specialist parallelism which focuses on specialized processing that needs to beperformed.This classification is based on the techniques used to extract parallelism from a solution.Although these forms of parallelism are identified, an application seldom uses a single form ofparallelism. An application will tend to have one dominant form of parallelism, with the otherforms of parallelism being present in varying degrees.With result parallelism, the focus of the approach is on the format that the final result isto take. Parallelism is achieved by computing all the component parts of the result simultaneously. Result parallelism is often used when the result consists of a series of values that arepredictably organized and/or have well defined dependencies. When agenda parallelism isused, the application concentrates on producing a list of activities that need to be performed fora solution to be reached. Parallelism is then achieved by using multiple worker processes.These workers may cooperate to complete one agenda item or, if possible, several agenda itemscan be worked on simultaneously. Finally, with the specialist approach, a group of specializedworkers are connected together to form a logical network. Parallelism results when multiplespecialists are active at once. Typically, each specialist is responsible for some sort of computation and then moves the results of its computation, via the network, to the next specialist.The three formats for organizing parallelism described in the previous section provide abroad framework and general approach that can be used when trying to write a parallel program. They do not, however, provide any guidelines on how to recognize the parts of an application that can be parallelized and the techniques that can be used to extract parallelism. LesterCHAPTER 2: Background and Related Work 20[65] provides a summary of some of the more common approaches to extracting parallelismand how to recognize it. A brief overview of a some of these techniques follows.2.2.1 Data ParallelismIt is often the situation that the form of the input data and output data has a strong influence onthe approach to achieving parallelism. The term data parallelism is applied to approaches wherethere are a large number of data objects subject to similar processing. To accomplish this, thecomputations on the data items need to be independent. The manipulation of large vectors ormatrices, typical of many numerical algorithms, can often make good use of data parallelism.2.2.2 Data Partitioning ParallelismClosely associated to data parallelism is parallelism that results from data partitioning. Withdata partitioning, the data space is divided into regions, and each region is assigned a processorso that the computation can proceed in parallel. What constitutes a region usually has some natural definition within the problem being considered. This region might be a block of a matrixin a numerical algorithm, or it might correspond to a region of space in a simulation of a gas.This approach differs from the data parallel approach in that there is the occasional need toexchange data across region boundaries. Except for these occasional data exchanges, each processor concentrates on performing the computation within its region.2.2.3 Synchronous Iteration ParallelismAnother technique for parallelism, closely associated to the ones already discussed, is synchronous iteration. In both data partitioning and data parallelism, each processor proceeds on itsown with minimal interaction with the other processors. In synchronous iteration, the samecomputation is performed on each data item, this is similar to the approach used by data parallelism, except that after each step in the computation all the processors synchronize andexchange data. It is the use of the results of one step of the computation by other processorsCHAPTER 2: Background and Related Work 21performing subsequent steps of the computation that mandate the synchronization, and differentiate synchronous iteration from both data parallelism and data partitioning.2.2.4 Replicated Worker ParallelismWith replicated workei; or processor farm, techniques for parallelism, there exist a number ofworker processes that are constantly requesting work to perform from a pool of possible tasks.The algorithm terminates when all the tasks in the pool have been completed. The approachproceeds by having each processor ask for a task from the pool, perform the task, and possiblyregistering the results before asking for the next task to perform. During the execution of a task,a processor may generate new tasks that need to be computed and these are added to the pool.Tree and graph searches can often make use of this approach to parallelism, since this approachis well suited to problems where the amount and nature of computation is dynamic and cannotbe accurately determined in advance.2.2.5 Pipelined ParallelismThe final technique for parallelism that will be introduced is pipelined parallelism. When usingthe pipeline approach to parallelism, the processors to be involved in the computation arearranged into a regular interconnection pattern, of which a mesh, ring, and pipeline are somecommon types. The computation is broken into a number of different component tasks, andeach one is assigned to a processor. Just as in a pipeline, the data flows through the network ofprocessors with each node performing some computation on the data. Parallelism results whenmultiple stages of the pipeline are executing simultaneously. Execution on a problem proceedsby having the first stage acquire the information needed to start processing. It processes untilits allotment of work is finished, and then it passes it to the next stage. While the second stageof the pipeline works on the problem, the first gets more work to perform. This approach to parallelism is like an assembly line, and has similar concerns. For example, to maximize parallelism each stage of the pipeline should use the same amount of time as the other stages. If this isCHAPTER 2: Background and Related Work 22not true then some stages will have no work to perform while other stages will have work queuing up, and the parallelism will not have been maximized.Although the approaches and techniques for dealing with parallelism have been presentedas being disjoint, in many situations more than one technique will be used. A solution mightcombine the approaches of data parallelism with replicated workers, or perhaps replicatedworkers will be the front end for a computation also using pipelining.2.3 Supporting Parallel and Distributed ComputingThe previous sections provided a broad overview of what constitutes parallel and distributedcomputing and some of the design techniques and approaches that can be used when developing a parallel or distributed application. This section presents an overview of variousapproaches to addressing the problem of actually realizing a design in a programming system.The various approaches have been divided into the following five groups:1. Extending existing sequential systems.2. System systems for distributed processing.3. Changes or adaptations of existing languages to support parallelism.4. Systems for parallel and distributed processing based on Smalltalk.5. Support for parallelism in object-oriented systems that are not Smalitalk-based.Each one of these systems, or approaches, will be briefly described with particular attention being paid to the issues of the granularity of parallelism, how parallelism is created, andthe level of support for concurrency control. Taken together, these systems will provide anoverview of some of the techniques for supporting parallel and distributed processing and howthese techniques have evolved.CHAPTER 2: Background and Related Work 232.3.1 Extending Sequential SystemsIn this section techniques for achieving parallelism that exploit existing sequential systems areprovided. By taking a sequential system and extending it, experience on providing services forparallelism is gained, and this serves to drive the development of new approaches to managingand supporting parallelism. The first three examples describe changes that have been made tothe procedure calling mechanism to support parallelism. The final example describes the support for parallelism available on multiprocessors.2.3.1.1 Parallel RPCsThe procedure call mechanism is a well understood programming tool. In distributed systems,the remote procedure call (RPC) has been developed to manage inter-process communication.Work by Martin [711 [72] [73] has concentrated on extending a single remote procedure call torun on multiple machines.Two extension to the remote procedure calling interface were made. These are:1. A mechanism to identify N remote hosts to execute the remote procedure callwas added.2. An optional results statement, which is given control each time one of the N procedure calls terminates, was introduced.The general form of usage is to have the remote procedure call specify both the processorsto make the call to, and the result statement to be executed. As each call completes, the resultsstatement is given control. This mode of processing continues until either all the calls havecompleted or the decision to break out of the wait is made. When a results statement is brokenout of, the results from any calls which have not completed are lost. The only way results canbe returned is through the parameters of the remote procedure call. Therefore, these calls cannot be used as functions. The completion of an RPC and the execution of its results statementare atomic.The advantage of using this approach is that the procedure call, and by implication theremote procedure call, is a language construct that is well understood by the programmer. Con-CHAPTER 2: Background and Related Work 24ceptually, there is only one thread of control, thus making it easier to understand what is happening. Parallelism in this system is coarse-grained and only accomplished when the sameremote procedure call can be targeted at multiple sites. Synchronization support is supplied tomanage the potential simultaneous return of results, however, concurrency control on the RPCserver side remains the responsibility of the programmer.2.3.1.2 Streams and PipesGifford [46] [47] [48] also looked at remote procedure calls and suggested changes to improvetheir speed while at the same time increasing the parallelism between the caller and callee. Gifford recognized that one of the bottlenecks to parallel processing using the remote procedurecall model occurs when the caller blocks waiting for the procedure to complete. To overcomethis problem, the pipe is introduced. A pipe is essentially a remote procedure call, except it doesnot block the caller nor does it return any results. Because the caller does not block, parallelismbetween the client and server results when a pipe is used. The granularity of parallelism iscoarse-grained and at the procedure call level. Concurrency control is the responsibility of theprogrammer and it is assumed that something like monitors [53] is used.In this system, remote procedure call is kept and optimized for low latency. Since noordering is implied between multiple requests not directed at the same pipe, the group andsequence constructs were introduced to allow a mechanism to specify the ordering of requestson a channel. (A channel is the connection between the source of the request and the destination. Both pipes and RPCs use channels.) By splitting the familiar remote procedure call modelinto two different classes it is possible to develop and optimize protocols for the function theyare to perform. RPCs can be optimized for quick sending and receiving of results, and pipescan be optimized for bulk data transfer.2.3.1.3 Promises and ClaimsArgus [66][67][68] incorporated both the futures notion from Multiisp (Section 2.3.3.1) andGifford’s work with pipes into promises and claims [69]. Like a future, a promise allows theresults of a call to be claimed at a later time. To use a promise, the notion of a call stream isCHAPTER 2: Background and Related Work 25introduced. A stream is like a pipe in that the sender does not wait for replies. Unlike Gifford’swork, however, a call on a stream may return a result. To capture this result each call returns apromise that can be used to claim a result in the future. A promise can be viewed as an objectlike data structure reserving space for the results of the method invocation, which will be arriving at a later time. To get the results of the promise, a claim call is made on the appropriatepromise and the process blocks until the promise is fulfilled. Other functions to check on thecompletion status of a promise, and functions to iterate and wait on collections of promises, arealso provided. The promise and the claim provide a way for RPCs to proceed in parallel whileat the same time allowing results to be returned. This overcomes the restriction of the previously described parallel RPCs being unable to execute different types of procedure calls, andthe inability of pipes to return results. Parallelism, therefore, results from the ability to overlapclient and server processing. The parallelism is at the procedure call level and is coarse-grained.Being written in Argus, the protection required in the presence of concurrent accesses to objectsis automatically provided through Argus’s atomic transaction facilities.2.3.1.4 MultiprocessorsThe previous examples have concentrated on some of the methods that have been used on distributed systems to provide parallelism. Similar incremental techniques have been in use toextract performance improvements in multiprocessor systems. This processing environmentcan be described as one in which more processors are added to an existing general purpose system while hiding the multiple processors from the user and continuing to present a uniprocessorenvironment. Parallelism is accomplished by having multiple coarse-grained processes runsimultaneously on the individual processors. Typical of the machines in this class are theSequent machines running a variant of the UNIX operating system [96] [89] and Helios [42]running on transputers.These machines typically consist of some number of processors on a common high speedbus, along with some amount of shared memory. When a process is ready to run, it is assignedto one of the processors. Parallelism is achieved by having multiple processors active at theCHAPTER 2: Background and Related Work 26same time. This leads to two major techniques for using parallelism on these machines. One issystem wide parallelism that can easily be achieved by assigning the various runnable processes in a time shared environment to different processors. Although this may increase thethroughput of the system as a whole, the elapsed time for any one process will not be less thanif the machine were idle and consisted of a single processor.A second type of parallelism, that a user is more likely to encounter, results from theUNIX structures of pipes and background process. In this case the background processes orconstituents of a pipe are assigned to separate processors, if possible, in an attempt to takeadvantage of the parallelism implied by the user. Under these situations it is quite likely that auser will experience a speedup in his job mix attributable to this parallelism.However, it can be argued that the above really is not what is meant by programming aparallel system. In the multiprogrammed environment of UNIX and other operating systems,the pseudo parallelism, provided by the operating system in an attempt to make better use of asingle processor, is simply extended to use more than one processor. The creation of multipleprocesses, in the background or in a pipeline, does not necessarily imply parallelism. Typically,the process model of programming is being applied and processes are being created to handlelogically separate components of a task. These separate processes are part of a design methodology that makes it easier to construct a system by interconnecting smaller, easier to debug programs, and not a technique intent on improving system performance through parallelism. Thepipeline form of parallelism represented by this approach is stylized and restricted in functionality.To allow the user to take direct advantage of the shared memory and multiple processors,the operating system typically supports some form of threads package [89][701. These packages provide ways for the program to create new threads of control and services to allow processes to synchronize within a shared address space.CHAPTER 2: Background and Related Work 27A logical progression from the shared-memory approach is to make the processors selfcontained with their own memory and then provide some form of interconnection. An exampleof this is the Intel Hypercubes type machines [16] [57] [58]. Each processing node is independent and requires programmers to explicitly write and assign code to the individual processors.Parallelism is then achieved by having the multiple processors work concurrently. Simple communications primitives like send and receive are used to exchange data between processors. Toachieve parallelism great care must be taken to ensure that once a processor sends a message itis possible for both the sender and receiver to execute simultaneously.In all of these systems, the granularity of parallelism would be described as relativelycoarse-grained, since it tends to be at the process level. The creation of parallelism is doneexplicitly by the programmer and data protection is, also, the explicit responsibility of the programmer.2.3.2 Distributed SystemsProgramming distributed memory systems is characterized by two notable problems. Oneproblem is deciding where to place processes; the other problem is establishing a mechanismto simplify the problem of identifying and communicating with the processes of the application. The systems examined in this section explore the issues associate with providing solutionsto these problems.2.3.2.1 SRIn a distributed system, RPCs are often too inflexible for efficient distributed computation, andpure message passing can be difficult to program. SR (Synchronizing Resources) [8] [9] [10]investigates these problems through processes, resources, and a full complement of messagepassing primitives. In SR, a resource possesses some of the abstract qualities of an object byproviding encapsulation of data and specifying the operations that must be used to interact withthis data. A resource is comprised of one or more processes, which represent the operations thata resource is willing to perform. Ail the processes associated with a resource must execute onCHAPTER 2: Background and Related Work 28the same processor, and this leads to the property that all the process within a resource, and onlythose processes, may share data. This organization captures aspects of both the distributed programming environment and the shared-memory multiprocessor environment. A singleresource could be viewed as a multiprocessor, and multiple resources can be viewed as a distributed environment.In SR, communication is accomplished on the sender side using the call and sendstatements. These correspond to synchronous and asynchronous message sends, respectively.Each call, or send, action specifies the name of the request and its parameters. On thereceiving side, the requests are accepted with the in statement, which specifies the name of therequest to receive and the parameters to be retrieved. The in statement also has an optionalguard clause that can be used to control or prioritize the reception of requests. The results of arequest can be returned either through a result statement or reply. When a reply is done,the result is returned to the sender and execution continues with the statement after the reply.SR also has a co/oc delimited block to specify invocations to be done in parallel. Eachinvocation can also have a post processing block associated with it. This allows the programmer to directly specify which activities are to be executed concurrently. Concurrency controlon any shared resource is the responsibility of the programmer, and SR provides semaphoresfor this purpose. The granularity of this system is coarse with parallelism being at the procedurecall level. Although the user may create new processes, a substantially component of an SRapplication is static, with the processes being defined at start time.2.3.2.2 PokerResearchers at the University of Washington recognized that another difficulty with distributedmemory systems was assigning processes to processors. To explore these problems, they developed the Poker system [811[94][95]. The Poker system consists of both hardware and software,with the hardware being a collection of processors with fixed interconnects to adjacent processors. One of the basic underlying assumptions of the research project was that the parallelismin a program is best handled explicitly by the programmer.CHAPTER 2: Background and Related Work 29The designers of Poker identified five functions common to the specification of parallelprograms for MIMD machines. These are:1. The specification of a communication graph to indicate how the various processing elements are interconnected.2. The specification of the program(s)/code that will ultimately be loaded to a processing element3. The specification of the assignment of object code to processing elements.4. The assignment of names to the edges (wires) connecting processing elements.This specification allows a mapping between the named edges used in the program and the physical wires to be carried out.5. A specification of where input is to be taken from and where output is to beplaced.Although these points are valid, they highlight the problems that prevent parallelismfrom moving into common usage. Programmers are not particularly interested in taking on thetask of specifying the communications graph, decomposing the problem into parts for assignment to processors, any more than they are in specifying where in main memory their programshould reside. The addition of these extra dimensions to the programming task detracts fromthe usefulness of these parallel environments for programming general purpose tasks.In addition to these types of problems, there were also restrictions on the how the processors could be organized. For example, the processing elements had to be placed on a lattice,thus restricting the interconnection patterns. Only one sequential process can be assigned toeach processing element: that is, it is not possible to have multiple process on the same processor. The Poker system is hardware-driven, forcing the user to be intimately involved in thespecification of a program’s parallelism. Parallelism in this system results by having multiplenodes active at one. Since there can be only one process active on a node, and since the processes must explicitly accept incoming messages, concurrency control is implied. The granularity of this system will vary depending upon the speed of the communication primitives. Thesystem would not be defined as fine-grained, however, the time between communications isCHAPTER 2: Background and Related Work 30much smaller than the systems examined to this point and is probably best classified as mediumgrained.2.3.2.3 MUPPETIn many respects MUPPET [77], the multi-processor programming environment of the WestGerman SUPERNUM computer project, is a lot like Poker; but, instead of facilitating thedevelopment of programs for a particular hardware configuration of a machine, MUPPETattempts to remove this restriction through the introduction of abstract machines that hide theunderlying processor interconnection patterns. With MUPPET, when a program is designed,the programmer is permitted to specify a collection of abstract machines and the interconnection pattern to be used for processing. These abstract processors, called LAMs (local memoryabstract machines), represent virtual disjoint processors that do not share memory. Each LAMprovides facilities for message passing, process creation and process destruction. Once a program has been specified for a given collection of LAMs the MUPPET software will attempt toconfigure that representation on the available set of processors. Depending upon the relationships between the abstract machine specified, and the actual machine, this mapping may be anon-trivial problem.The major difference between Poker and MUPPET is that in Poker one specifies theactual machine configuration to be worked with, and designs an algorithm for that configuration. In MUPPET, one designs an algorithm for an abstract machine and maps the abstractmachine to the physical processors. The effective parallelism will be a function of how manyLAMs can be assigned to different processors. Concurrency control remains the responsibilityof the programmer.2.3.2.4 PVMPVM [43] [44] stands for Parallel Virtual Machine, and has as its goal the desire to make a collection of machines on a network appear as a single concurrent computation resource. PVMdoes many of the same things as Poker, except it does not assume any special underlying hardware. PVM does assume:CHAPTER 2: Background and Related Work 31• A collection of heterogenous machines accessible through a network.• Programming using an imperative language.• Access to system services through procedure calls.• Operating system support for limited inter-process communication within amachine.• Support for the unreliable delivery of data between machines.PVM can basically be described as a software system that tries to make a collection ofmachines appear like one. Support for this is provided through a library package that supportsthe creation of processes and the sending of messages within the PVM environment. This ispartly accomplished through a daemon process run on each node in the PVM system. Thesedaemons essentially cooperate to keep each other informed as to the state of their part of thePVM. Each PVM daemon can be viewed as a mini operating system that maps the PVM calls,for things like process creation and message passing, from the PVM form to the local machine.Parallelism in PVM is coarse-grained at the process level, and explicitly created externally bythe PVM user or internally within a program. Being a message based system, concurrency protection is implicit since a process encapsulates data within its address space. Shared resourcesoutside of a process’s address space must be managed as in any other system.2.3.3 Language Extensions or AdaptationsThe previous sections have all examined providing support for parallel and distributed computation primarily from the system level. That is, the support for parallelism encompasses something about how the complete system is built. In this section some changes to programminglanguages to support parallelism will be looked at. The first two examples Multilisp and Actorslook at language-level support for parallelism in a fine-grained environment. The third example, Linda, presents language-level parallelism from a coarser-grained perspective.CHAPTER 2: Background and Related Work 322.3.3.1 MultilispMultilisp [50][51][52] is a variant of Lisp that explores the possibility of evaluating multipleexpressions in parallel. The initial parallel construct in Multilisp, pcal 1 (),is used in the following manner:(pca11FABC...)The expressions F, A, B, C and . . . are evaluated in parallel with control passing to the nextexpression when all the expressions have been evaluated. Another way to exploit expressionparallelism is to separate the computation of a value from the point at which it is used. Thistype of concurrency is captured by futures. An example use of future is:(cons (future A) (future B))In this situation the computation of A and B is started in parallel with the main flow ofcontrol, but the results are not needed immediately. At some later time, when the actual valuesare required, either they will be ready or they will still be being computed. If they have not finished, then the computation will stop and wait for the results to be returned; otherwise, the computation can continue immediately. Concurrency is explicit through the pcall and futurefunctions and is relatively fine-grained. Concurrency control is not an issue in a language likethis since only values are manipulated.2.3.3.2 ActorsThe Actor [2][3][41[40] languages also have a Lisp like flavour and are designed to supportlarge scale concurrent symbolic computation. These systems are characterized as being highlydynamic, with constantly changing structure and computing requirements. One of the assumptions for parallelism in the later Actor languages is that all the expressions forming part of afunction can be evaluated in parallel. Everything in an Actor system --this includes functions,data and messages-- is an actor. Expressions and functions are evaluated by sending messagesto the message queue associated with the actor specification for that function. These messagescontain information indicating exactly what function is to be performed. To actually do some-CHAPTER 2: Background and Related Work 33thing, each actor has a script which tells it what to do when a message is received. An actormay work on only one message at a time, and as soon as an actor is finished its computation itdisappears.As part of its execution an actor computes a replacement for itself. In many cases thereplacement actor is a copy of the existing actor, but there is nothing to prevent the replacementfrom being a totally different actor with a totally different behavior. The message queue associated with an actor is inherited by its replacement. Some control over parallelism can be exercised by specifying when the replacement actor is created. If the replacement is specifiedimmediately, then a message from the inherited queue can be retrieved for processing and computation between the original actor and its replacement can proceed in parallel. If instead thespecification of the replacement is delayed until the computation is nearly completed, thanthere is little opportunity for parallelism.Actor languages have the advantage that parallelism is automatic; however, no orderingcan be imposed on the sending of messages, thus increasing problems resulting from nondeterminism. Also under certain circumstances some parts of an expression may continue to be evaluated even after the value of the expression is known, thus wasting computing resources.(Example: Two expressions are or’d. One of the expressions evaluates to true while the otheris still computing. At this point the value of the expression, --true--, is known and there is noneed to continue to evaluate the other expression.) The actor approach is targeted towards functional languages which will run on large fine grain machines. Unlike Multilisp, the parallelismis implicit in the Actor languages since every expression is viewed as running in parallel, andlike Multilisp concurrency control is not an issue since computation are always performed onvalues.2.3.3.3 LindaThe Linda programming language [64) [5] [28] [29] [45] has adopted the approach of adding newfunctions to an existing programming language, along with underlying software support, toprovide parallelism. The goal of Linda is to achieve parallelism by allowing the user to con-CHAPTER 2: Background and Related Work 34centrate on programming and not on the hardware. When programming, the user is expected toexplicitly decide which parts of a program can be run in parallel. The Linda language then provides the mechanism to implement this parallelism.In Linda the programmer is presented with a tuple space to which processes performoperations in an effort to achieve parallelism. The tuple space is common to all processors inthe system. Linda provides operations to add, remove or read tuples from the tuple space. Onejustification for this computational model is that processes in a parallel environment should notconcern themselves with how other processes are using data. They should essentially take somedata, compute with it, and then, maybe, generate some new data for another process to workwith. The parallelism is then achieved by having multiple processes perform requests of thetuple space. The programmer must either create these new processes within the application, orstart them and then associate them with a running tuple space. Linda can also be viewed as acollection of global mailboxes that are emptied and filled with data. The fields comprising atuple serve as the mailbox or tuple identifier.Parallelism in Linda is coarse-grained and at the process level. Parallelism is achieved byusing the tuple space as a buffered send area. The tuple space, which is a the only shared datastructure in Linda is maintained in a consistent state by the system by ensuring that the operations to add, read, and remove tuples from the tuple space are atomic.2.3.4 Smalltalk-Based SystemsObject-oriented systems, with their strong data encapsulation and well-defined method interface, seem like they should provide good environments to support parallelism. The classicalobject-oriented programming language, Smalitalk, has also been the target of modifications tosupport parallelism. This section will look at some of the ways that Smalltallc and Smalitalkderived languages have been used to support parallelism. The first system examined is animplementation of Smalitalk for a distributed environment; the second system examined is anCHAPTER 2: Background and Related Work 35implementation of Smailtalk for a fine-grained environment; and the third system examined isan implementation of Smalltalk for a multiprocessor.2.3.4.1 Distributed SmailtalkDistributed Smailtalk [19][20j is a system that allows objects on different machines toexchange messages. It also permits objects to be shared amongst different users. The functionality of Distributed Smalitalk can be characterized as being similar to the situation of userssharing files or server processes in a more general purpose environment. It lacks the tight coupledness and cohesiveness of an environment designed for parallel programming. The supported parallelism is extremely limited, is system wide, and is available only at the user processlevel. Individual users cannot have multiple threads of control active at once and the programmer is responsible for controlling concurrent access to objects.2.3.4.2 CSTCST, which stands for concurrent Smalitalk [361[37][104], is a proposed dialect of Smalltalkdeveloped to run on the J-machine. The J-machine is a fine-grained concurrent computer witha large number of special purpose processing nodes. The processing nodes each have 4K words(36 bit words) of memory, each forming part of a global virtual address space, and a specialpurpose communications controller to perform message routing. Concurrency is achieved inthis system through distributed concurrent objects, asynchronous message sends and multipleconcurrent methods on an object.This system is heavily reliant upon the global virtual address space to support a virtualobject space. Invocation on objects can be supported by moving the objects to the requestinglocation or by sending the request to the object. The virtual object space allows for easy location or movement of objects. Depending upon the nature of the request, the method will betransferred either to the object, or the object to the request.One of the ways to achieve parallelism in CST is to use a distributed concurrent object.A distributed concurrent object is an object which has a number of constituent parts, which areCHAPTER 2: Background and Related Work 36distributed over the collection of processing nodes. Some Smalitalk objects that fit this classification are arrays, collections and bags. The creation of a distributed object requires a list ofnodes at which constituent objects are to be placed. The distributed objects are constructed, bythe system, in such a fashion that when a method is sent to a distributed object it may be delivered to any one of the constituent objects. A distributed object is viewed as being composed ofmany objects, each of which may be processing a method, thus achieving parallelism. In thissituation it is the programmer’s responsibility to control access to any of the shared data thatthe distributed objects may be trying to modify.A second way to achieve parallelism is through asynchronous message sends to objects.This allows several methods to be sent without waiting for replies. Finally, multiple methodsmay access an object in parallel. Lock primitives are provided to allow the programmer to protect the integrity of data in this situation.2.3.4.3 ActraActra [1021 [103] [181 is an Actor based extension to the Smailtalk programming language. Thedesign goal is to provide a multi-user, multiprocessor, object-oriented program developmentenvironment. Unlike the Actor model proposed by Hewitt, these Actors are of a much coarsergranularity. Actra is intended to run on a shared memory multiprocessor. It is hoped that thisuniform memory view will simplify referencing and communication between objects.In Smailtalk, processes are not permitted to receive messages, thus limiting the way inwhich they can be used. To overcome this problem in Actra, a new class of objects, --actors--,is introduced. An actors object is a community of objects. Each new actors object is implemented as a separate task. Multiple actors residing on separate processors may execute in parallel. Communication between actors (tasks) is accomplished by adding the blocking sendlreceive/reply methods to the actors class. The blocking nature of these new constructs mapseasily to the way methods are normally invoked in Smailtalk. This permits the interactionbetween objects and actors to have the same surface level appearance.CHAPTER 2: Background and Related Work 37Parallelism is achieved by permitting the receiving actors object to perform a reply andthen continue processing. Only one method at a time may be active in an object. With only oneobject method active at a time, there is the possibility that long lived requests could preventother methods on the object from executing. As a result a mechanism allowing a method to voluntarily relinquish control is provided. The system can be viewed as collection of Smalltalkenvironments running in their own environment on their own processor. These environmentsthen communicate to each other through actor objects and are, perhaps, better thought of as distributed environments that occasionally exchange data or make requests of one another.2.3.5 Other Object-Oriented SystemsObject-oriented systems and programming languages offer such attractions as code reuse anda structured programming environment to the programmer. Smalitalk, however, is often viewedas being too inefficient for production programming. In an effort to overcome these problems,new languages and systems that provide many of the benefits of an object-oriented languagehave been developed. In this section four approaches, SINA, POOL-T, Eiffel and Choices areexamined.2.3.5.1 SINASINA [15] [105] is an object-oriented language being designed to study issues associated withconcurrent and distributed programming. In general terms, SINA would have to be regarded asbeing typical of large grained object-oriented languages. An object consists of its instance dataand methods. Method invocations are realized through the send, receive and reply messagepassing primitives. Each method in an object is associated with a queue of message invocationrequests. Even though an object may have multiple methods, only one method is active at atime, thus ensuring it has exclusive access to the object instance data. Within an object, thereare three system calls, hold, accept, and detach which can be used to control method processing. A method can be in either one of two states, held or accepting. If the method is in the heldstate, then it is not allowed to execute new requests, and must hold them in a queue for subsequent execution. In the accept state, the method is permitted to execute requests from its mesCHAPTER 2: Background and Related Work 38sage queue provided another method of this object is not active. Initially, all method interfacesare in the accept state, but as processing proceeds the object may instruct the runtime supportsystem to change the activity state of its method or methods of other objects of the same type.(An example when this is useful can be found in the implementation of a stack object. Whenthere is nothing on the stack one might decide to stop processing pop requests until some moredata is put on the stack.) With the facilities described so far, there is little opportunity to achieveparallelism. The detach command is an attempt to overcome this problem. If while executinga method, a detach call is made, the current execution thread detaches itself from the object. Todo this it makes copies of all the object and local data and retains sufficient information aboutthe object performing the invoke so that results may be returned. When the detach is performedthis releases the block on the method interface and a new method request can be removed fromthe queue and executed.The detach construct, which is similar to a fork ( ), allows limited concurrency withinan object. The ability to turn method interfaces off and on allows for control over the order inwhich methods are invoked. However, there are a number of problems with the approach takenin SINA. For example, since the detached thread only has copies of the object’s instance datait cannot make any lasting changes. The detached thread is a short-lived cloned version of theoriginal object, and once it has finished its computation, it ceases to exist. There is no abilityfor a detached object to synchronize with other detached objects, or even its parent.The detach statement is a heavyweight construct since it must create and initialize a newprocess before there is any parallelism. The parallelism in the system is therefore coarsegrained and programmer controlled. By permitting only one method at a time to execute, concurrency control is automatically provided.2.3.5.2 POOL-TPOOL-T [11] [12] [13] [14] is a language written in part to explore the possibilities for parallelism that may exist in larger-grained object-oriented systems. To accomplish parallelism amethod is broken into two parts:CHAPTER 2: Background and Related Work 391. The body, which constitutes the processing that must be performed to determinecompletely the result of the method.2. A post processing part, which is processing that is logically part of the methodbut does not affect the result to be returned. Once the body of the method hascompleted and returned the results to the invoker, the post-processing part of themethod is executed.The existence of the post processing part of a method allows the client to execute in parallel with the server, once the post processing part is entered. This form of parallelism is restrictive and dependent upon how much, if any, of the method’s processing may be done in the postprocessing part.The underlying implementation strategy uses message passing and a single messagequeue is associated with each object. To assure a degree of fairness, messages from the queueare serviced on a first in, first out basis. Only one method from an object can be executed at atime and the post processing part is considered part of the method. Concurrency control istherefore automatically provided by the system, but parallelism depends upon how many largegrained objects are created. Parallelism is at the procedure call level and is relatively coarsegrained.2.3.5.3 EiffelEiffel [75] [55] is an object-oriented class-based language supporting multiple and repeatedinheritance. It is a strong statically-typed language with support for parameterized classes anddynamic binding. One of the design goals that has had a significant effect on Eiffel is the concern with software correctness and robustness. This has lead to the ability to associate preconditions and postconditions with a feature’s (method’s) execution. The preconditions specify theset of conditions that must be satisfied when a routine is called, and the postconditions specifythe conditions that must be true when the method returns. If either a precondition or postcondition is violated, an exception is triggered. Eiffel was initially released as a sequential language, but since then, work has been undertaken to add support for parallelism [76].CHAPTER 2: Background and Related Work 40Support for parallelism is provided in two ways. One is through the addition of the keyword separate, that can be used to indicate whether or not a method can be run on a differentprocessor than the one the request comes from. The second modification is to the way preconditions are handled with respect to separate objects and methods. Preconditions are used, in theparallel case, to specify conditions that must be met before execution can proceed. Failure tomeet these conditions suspends the execution until they are met, and does not result in anexception.What is particularly interesting about this approach is that the parallelism is captured inthe declaration of objects and methods and not explicitly through an execution time statement.This means, that by examining a piece of code, it is not possible to determine the potential forparallelism without examining the object’s definition.The granularity of parallelism is at the procedure level and is therefore medium-grained.Concurrency control is automatically provided by letting only one method access an object ata time, and parallelism is created by marking objects and methods as separate.2.3.6 ChoicesThe approaches to parallelism that have been examined in this section have all concentrated onlanguage-level support for parallelism. Choices [26] [27] [881, an object-oriented operating system written in C++[80] uses a different object-oriented approach to providing parallelism.Instead of addressing the issues of parallelism from the language perspective, Choicesaddresses the problem by constructing operating system interfaces that can be tailored to theapplication. Briefly, Choices is designed for distributed and shared-memory multiprocessorsand has generic support for parallelism that can be customized for the application. This customization is termed problem-oriented concurrent programming.Choices is itself object-oriented and presents an object interface of the system. Fivegroups of functions (objects) needed to write a parallel application have been identified. Theseobjects supply services for:CHAPTER 2: Background and Related Work 411. Process creation. The process interface to create new processes for execution.2. Naming. The name server interface used to locate and share objects.3. Messaging. The message passing interface for the sending and receiving of messages.4. Memory sharing. The memory interface that permits the sharing of memorybetween processes.5. Synchronization. The synchronization interface provides a way for processes tosynchronize.Applications use these interfaces to create and manage the parallelism within the application. The advantage of this approach is that applications can be programmed in a particularway without regard for the underlying hardware implementation. The service classes can thenbe modified or subclassed to take advantage of a particular system, without need to modify theapplication. For example, a parallel application might be using message passing. On a shared-memory machine it might be possible to change these primitives to use shared memory to speedexecution. In this situation the interface remains the same, but the underlying support haschanged.Choices supports medium- to coarse-grained parallelism at the procedure or applicationlevel. Parallelism is explicitly created by the user and concurrency control is also the responsibility of the programmer. However, the object-oriented nature of the system services providesubstantial capability for some system support by including concurrency support directlywithin a class.2.4 Support RequirementsThe systems described in the previous section outline various approaches, across differentenvironments, to the problem of supplying support for parallel and distributed computations.By surveying these systems it was possible to identify the key system and language featuresneeded to provide comprehensive support for parallel and distributed programs in shared-memory and distributed-memory environments. A list of the identified features, and what each oneCHAPTER 2: Background and Related Work 42contributes to the programming environment, is given below. The classification of the systemsexamined in the previous section, with respect to the identified features, can be found inTable 2.1. The names used to identify the features in the columns of the table are highlightedat the end of each feature description. The identified features are:1. The model for achieving parallelism should be uniform across both shared-memory- multiprocessor and distributed-memory environments. The applicationshould not have to use one method to create and manage parallelism in a shared-memory environment and a different method in a distributed-memory environment. (Uniform interaction)2. The size of the entities being manipulated should be small enough to support andencourage code reuse. For example, functions and routines to manipulate ageneric list are more reusable than a complete database system. (Code reuse)3. It should be possible to create processes dynamically. This permits an applicationto adjust the number of parallel components to take advantage of different hardware environments or changing service demands. (Dynamic process creation)4. It should be possible to make a parallel request and collect the results at a latertime. This permits parallelism between a client and server, regardless of theserver implementation. (Delayed result acceptance)5. Conversely, the ability to generate parallelism on the server side is greatlyenhanced by the facility to return a result to the client while the server continuesto execute. This allows the supplier of a service to create parallelism independentof the manner the client is making the request. (Early reply)6. It should be possible to monitor the completion status of outstanding parallelrequests. This will allow the client to maximize parallelism with the server byperforming additional processing while waiting for the request to finish.(Request monitoring)7. It should be possible for a client to make a parallel request of a server, and thenignore the results. This recognizes that a client may not need a result and thataccepting the result introduces a needless synchronization point between the client and server that reduces parallelism. (Ignore results)8. It should be possible to serialize requests to a server. A major problem for certainapplications is nondeterminism with respect to requests. Given communicationsdelays and changing load patterns, requests can easily arrive out of order at theserver. Under certain circumstances, a collection of requests needs to be executed serially with respect to each other, but they can be executed in parallel withthe initiating thread of execution. An example of this is screen updates. TheCHAPTER 2: Background and Related Work 43updates may need to be performed in a specific order relative to each other, butthere is no need to execute the requests sequentially with respect to the initiatingthread. (Serialize requests)9. Concurrency control should be automatically supplied by the system. The application programmer should not have to explicitly manage concurrent accesses inthe system. (Automatic concurrency control)10. Support for parallelism should be well integrated into the system. That is, thesupport for parallelism should seem to be a natural part of the language used forprogramming. Typically, this means that support for parallelism is done at thelanguage level to avoid presenting one programming model for the language andanother for the external system responsible for the parallel and distributed processing support. (Language support)11. Servers should be able to control the service order of requests. Without the capability to order service requests, some requests may be forced to fail even thougha different processing order would have allowed them to succeed. Suspending arequest and completing it later, or using guard conditions allows the server todetermine the order of processing and subsequent replies. (Selective request processing)12. It should be possible to group requests together. Certain requests, such asrequests to update a collection of terminal screens, can conceptually be viewedas one request. It should be possible to group these requests together and treatthem as one. This facility allows groups of requests, which are logically related,to be treated as one request. (Request grouping)13. The units of parallelism should be easy to identify and use. To effectively harness parallelism, the unit of parallelism should be naturally defined within theenvironment, and the mechanism for extracting the parallelism should actdirectly on the unit of parallelism. For example, creating parallelism with afork () call is not an easy way to create parallelism because it is difficult toidentify the unit of parallelism and then start it executing. (Easy to identify unitsofparallelism)As mentioned earlier, Table 2.1 provides a summary of the features supported by the systemsthat were surveyed. The rows of the table correspond to the language or programming systemof interest, and the columns identify the feature of interest. In the table a “/“ indicates a featureis supported; a blank is an unsupported feature, and a “-“ indicates that a feature is not applicable. The entries in the table attempt to capture the “spirit” of the system, with a negative entrymeaning that a feature might be supported, but normally one would not think of it as being sup-CHAPTER 2: Background and Related Work 44TABLE 2.1 Summary of language/system support for parallelism.E.!4- 0)C Io a00 0W C 0C G) 0) 4-0 0 0 C .0 0E 0 CU 0 C.)0.C4- i-0 OgOa2u’ o.2 w— 4-- o o.!z .o - E a , w = E CUC 0 0 CU > > o 0) >(U CU t 0 C.! 0.0 • C.! ø U,L..C .— C 0 > 0 (U CLanguage/System 0(1) C) Ui .2’Ø .I Cl) LUParallel RPC C / - - - -Streams and Pipes C / - - - - / /Multiprocessors C / / -- / -Promises C / - - / / / / / /SR C / /// / //Poker M / - - / / / / / /MUPPET C / - / / / / — — /PVM C/I /1/ /Multilisp F / - / / / / / - / / - /Actors F / I//I - //- /Linda C/// / - /Distributed Smalltallc C / - /CST F / - // / // /ACTRA C /- / / /SINA C /- / /1/ /POOL-T C / - / / / / / /Eiffel M/- //// /// /Choices M// //// /ported. For example, a system like Parallel RPCs would work in a shared-memory environment, but that is not the environment it was designed for; therefore, it was given a negativeentry for shared-memory support.In addition to features enumerated above, three other entries are also shown in the table.These are:CHAPTER 2: Background and Related Work 451. Granularity. This column specifies the granularity of the parallelism supportedby the system. C is for coarse-grained, and can be associated with parallelism atthe process level and long communication delays. M is for medium-grained, andcan be characterized as parallelism at the procedure level. F is for fine-grained,and can be associated with parallelism at the statement or instruction level.2. Shared memory. The system is designed primarily for use in a shared-memoryenvironment3. Distributed memory. The system is designed primarily for used in a distributed-memory environment.In order to support programming in the types of parallel and distributed environmentsdescribed in Chapter 1, all the features identified in this section need to be present. If a featureis not present, then either the expressiveness of the system is lacking, and it is unable to handleparallel and distributed computation in a particular way, or else the programming and administrative burden of a commonly used feature is made the responsibility of the application programmer. The effect is that needless constraints are placed on the programming techniquesusable for programming in parallel or distributed environments. None of the systems surveyedprovide support for all the identified programming features. Most of the systems support abouthalf the features, which indicates a concentration on some subset of how to manage parallel anddistributed processing.The programming environment of interest in this work is comprised of both shared-memory and distributed-memory machines. To simplify programming in these types of environments, a single programming model must be supplied to the programmer. The programmerneeds to be able to develop applications that do not depend upon, or make assumptions about,the underlying target hardware environment.Based on a review of these features and the above observations, a broad outline of a system to support parallel and distributed programming can be developed. The requirements for ahigh-degree of code reuse, a uniform interaction environment for both distributed-memory andshared-memory environments, and easy-to-identify units of parallelism suggest using anobject-oriented language. Objects, with their strong data encapsulation and well defined inter-CHAPTER 2: Background and Related Work 46action points, are well suited to be the unit of parallelism and automatic concurrency control.Support for parallelism needs to be embedded in the language to ensure a consistent and uniform programming environment. This uniform environment applies to the programming environment presented for both shared-memory and distributed-memory machines. The systemshould also support the generation of parallelism from both the client and server side, and whenparallel requests are made the client should be able to ignore the results or monitor and collectthem later. For server objects, it is extremely important that some mechanism be provided todelay or suspend a request. Without such a facility, it is difficult to synchronize multiple threadsof control. Taken together, these features can be used to develop an easy-to-use, consistent anduniform environment for parallel and distributed programming on shared-memory and distributed memory machines. To support these identified features, the object-oriented system andlanguage Raven was developed. Raven has built-in support for parallel and distributed programming By using an object-oriented language, issues associated with reusing code, identifying the unit of parallelism, and providing a single uniform programming environment forshared-memory and distributed-memory machines are addressed. Some of the functionalitynaturally arises through the use of an object-oriented system, while others require explicit support from the underlying system. Adding system support does not affect the application visiblefacilities available to manage parallelism and distribution. Instead, it affects the programmingeffort by relieving the programmer of having to perform certain system administration relatedtasks. The system performs the administration automatically instead of making it the programmer’s responsibility.Raven addresses the issues associated with making parallelism accessible by incorporating support for a parallelism directly into the language. Raven identifies a single executionpoint within the language where parallelism can occur (Section 4.1). The result is class-basedparallelism (Section 4.2), which is parallelism implemented on the server side, and user-basedparallelism (Section 4.3), which is parallelism implemented on the client side. These two facilities provide support for early reply, dynamic process creation, and the ability to perform selective request processing. Creation of parallelism is supplemented with the introduction of theCHAPTER 2: Background and Related Work 47notion of Certificates (Section 4.3.1). Certificates provide the mechanism to monitor, track, andgroup the units of parallelism resulting from user based-parallelism. Properties (Section 4.4)are a way of associating system services with individual objects and are the mechanism usedto supply automatic concurrency control. InvokeStreams (Section 4.6) interact with Certificates to serialize certain types of parallel requests. By confining the creation of parallelism toa single point and providing ways to manage and monitor parallelism, the creation of parallelism at the client and user side can be addressed.If the Raven System were included in Table 2.1 it would be classed as a system that supports medium-grained parallelism and provides a uniform interface for programming in ashared memory and distributed memory environment. Its object-oriented nature encouragescode reuse while providing a well defined interaction point for creating parallelism. The welldefined interaction point simplifies the identification and management of the units of parallelism. Because the Raven language has built-in language and system support for parallelism, allthe other features for supporting parallel and distributed computation would be marked aspresent.2.5 SummaryThis chapter began with a discussion of the differences between distributed and multiprocessorenvironments and how they affect the programming of an application. This was supplementedwith material discussing granularity and how to construct parallel programs. A number of different parallel and distributed systems were then surveyed to provide a review of the work thathas been undertaken in this area. Each of these systems uses a different technique for managingand dealing with parallel and distributed computing. Some of the systems dealt specificallywith supporting parallelism, while other approaches were more general purpose in nature.From these systems, the most important features and useful features for supporting parallel anddistributed programming in shared-memory and distributed-memory environments were identified. None of the surveyed systems supported all the features, thereby indicating a biasCHAPTER 2: Background and Related Work 48towards supporting certain types of computations. To overcome this bias, the Raven languageand system with built-in support for parallelism was proposed as a way to provide this supportin a single, consistent programming environment.CHAPTER 3 The Raven System andLanguage OverviewRaven is an object-oriented system that runs on a number of different hardware platforms, ranging from a collection of workstations connected via a network to shared-memory multiprocessors. In the past, these types of environment have used entirely different programming modelsand primitives designed to take advantage of the particular environment. By using objects asthe primary interaction mechanism, Raven makes it possible to reduce greatly the distinctionbetween programming for a distributed-memory environment and programming for a shared-memory environment. This chapter introduces the Raven programming language and environment that form the basis of the work described in the remainder of this thesis.3.1 The Passive and Active Object ModelsIn object-oriented systems, two programming models [31] are predominant, the passivemodel [83] [381 and the active model [67j[68]. Each can be related to a particular way of implementing an object system. The distinction between the two models is based on how the meth49CHAPTER 3: The Raven System and Language Overview 50ods of an object are executed and the effect that has on the way objects are programmed.Methods, or behaviors, are the routines that manipulate the data an object encapsulates.Conceptually, the active model associates a process with each object. Invocation requestsare accepted at the object and processed sequentially. The programmer is, therefore, presentedwith a system that is seen as accepting the request, processing the request, returning a result,and waiting for the next request. The effect is that there is no concurrency within the object andthe programmer does not need to take explicit measures to provide concurrency control.In the passive model, there is no process associated with an object. Instead, a thread ofcontrol takes execution to the object, selects the method to execute and executes the code. Theresult is a model that has many similarities to the subroutine call in a local environment or aremote procedure call in a distributed environment. Since an execution thread goes to an object,it is possible for multiple threads of control to be active within an object at once. This has obvious programming ramifications. It is now the programmer’s responsibility to ensure the consistency of an object’s instance data by explicitly identifying critical code sections andprotecting them with mutual exclusion primitives like semaphores or monitors.The passive and active object models differ primarily in how concurrent access to anobject is handled. This directly affects the programmer by dictating whether or not explicitmanagement of concurrent access to an object is required. The models also affect the overalldesign of an application by placing constraints on the type and scope of objects, along with howthe objects can interact with each other. Consider Figure 3.1, which shows two objects invoking on each other: objects are ovals, the dotted line indicates the flow of execution, and the rectangles are particular methods. In this example Method_i of object A invokes Method_2 onobject B. Method_2 of object B then proceeds to invoke Method_3 of object A. If this scenariowere programmed in an active model, a deadlock would occur, since Method_3 could not complete until Method_2 had finished. In a passive model the execution sequence would be permitted, but it would be the application’s responsibility to manage the concurrent access to theobject. Depending upon the concurrency control implemented by the programmer, this scenarioCHAPTER 3: The Raven System and Language Overview 51might also deadlock in the passive model. It is clear that the programming model affects morethan just the implementation of a single object. The programming model affects the way inwhich a problem is decomposed into objects and how those objects are organized. Instead offocusing on how an object system is implemented, and the resulting programming model consequences, the Raven system has adopted an object model that concentrates directly on theissue of concurrency.In the Raven model, concurrency control is performed automatically at the method level.Whether a method can execute on an object depends upon the other methods executing in theobject and the execution source of a method (see Section 4.5). Basically, methods are classifiedas read methods or write methods, with Raven allowing multiple readers or a single writer tobe active in an object. However, when the read and write methods are associated with the samethread of control, both methods are permitted to be active in the object. In this case the read andwrite method interactions are viewed more on the level of a subroutine call than as a newrequest emanating from a different thread of control. Within the context of the scenariodepicted in Figure 3.1, the invocation of Method_3 on object A by object B would not block.In many respects, the Raven model of programming is a hybrid of the passive and active mod-Object AFIGURE 3.1 Objects invoking on each other.CHAPTER 3: The Raven System and Language Overview 52els. Like the passive model, Raven has the notion of a thread of control. The thread of controlestablishes relationships between objects at runtime. Within an individual object, however, thenotion of a thread of control does not force the programmer to explicitly take action to protectan object’s instance data with critical regions. Like the active model, the programmer canassume that only one method is active in an object at a time, since the system provides the concurrency control.The Raven object model is independent of the underlying object implementation modeland makes no assumptions about the implementation. Both active and passive implementationmodels were considered when Raven was being designed. Ultimately, a passive implementation model was selected because it was perceived to be easier to implement in the local case,would be better able to take advantage of a shared-memory multiprocessor environment, andwould have less execution overhead since context switching on a method invocation would notbe required. Although the selection of a passive implementation model affects how certain services are supplied, it does not manifest itself at the application program level.3.2 RavenOne way to explore the system and language issues associated with supporting and using parallelism is to take an existing system and modify it to provide the required services. Thisapproach has several disadvantages with respect to some of the goals of this work. Two majorrequirements of an environment to explore these goals are:1. Portability: To demonstrate the usefulness of the system, the system needs to beportable across architectures and machine types. In this context, the cost of porting the software, both in terms of money and time, must be considered. Commercial software with its cost, and free software which is often designed to run inonly certain environments, cannot meet these portability requirements.2. Changeability: To be able to support new concepts and to deal with performanceand implementation issues, the system needs to be changeable. With an integrated approach to supplying parallelism it is possible that new services orchanges to existing ones will require changes to the operating system, runtimesystem, or language. When problems arise with the implementation it should beCHAPTER 3: The Raven System and Language Overview 53possible to make the changes to the component(s) of the system most responsiblefor the problem. In a restricted software environment this is not possible.Combining the above requirements with the goal of developing a system to explore theissues associated with building parallel and distributed systems in an object-oriented environment, it was decided that these goals could best be met by developing a new system. The resultis the Raven System [1], a project in the Computer Science Department at the University ofBritish Columbia. To speed up the development of such a system it was decided that, wherepossible, existing software would be used. Besides the obvious implementation speedup advantage, this also permits individual software components to be evaluated independent of theRaven system. This evaluation will identify what software can be used and where design andprogramming effort needs to be concentrated to improve system performance. This approachmakes it possible to concentrate the development effort on support for the new features of interest and on improving performance in the existing system components.The Raven system consists of four main components: a compiler, runtime system, classlibrary, and virtual machine (threads). The separation of Raven into these different componentsrepresents a logical breakdown of the functionality of the various parts of the Raven system.The separation also reflects how likely a component of the system is to change. The virtualmachine component of the system is expected to be the most stable since it forms the foundation of the system, and changes to it can ripple through the rest of the system. This would befollowed by the runtime system, the class library and finally the user’s application.3.21 The Raven LanguageThe programming language, also called Raven, is heavily based in its syntax on the C programming language. The Raven class system is similar to that of Objective C [351 andSmailtalk [49], but unlike these languages it is statically typed like C++ [97]. Although the language is statically typed, the method binding is done at runtime, thus providing dynamicmethod binding.CHAPTER 3: The Raven System and Language Overview 54To illustrate some of the basic constructs in the Raven language and to provide a basis forintroducing the basic terminology and concepts associated with an object-oriented system, consider Figure 3.2. This is a complete Raven program that performs a small simulation of a pencilcup. Common stationery items can be added to and removed from the pencil cup and the weightof the pencil cup computed. Raven keywords and predefined methods have been highlighted inthe example.Object-oriented languages accomplish work by manipulating objects, and by relyingupon properties of encapsulation, inheritance and organization [78] to do this. Conceptually, anobject is an entity that has its own private data and a collection of functions to operate on thatdata. An object is in effect the live data structure that changes during program execution. Howan object is organized and behaves is determined from its class definition. A class definitionspecifies the exact organization of the data and the routines to manipulate the data. The classcan be viewed as a prototype or template for a particular data type. Classes are themselves further organized into a subclass/super class hierarchy. At any level within the hierarchy a classinherits all the attributes from the classes it is descendant from. Some languages support multiple inheritance, but Raven does not. In Raven each class is directly descendant from one class.At the top of the Raven class hierarchy is the class Object from which everything is descendant.Figure 3.3 provides a pictorial representation of the relationship between the programmer-defined classes in the pencil cup example. Object is at the top of the hierarchy and has as itsdirect descendents the classes Cup, Weighable and Main. The class Weighable has two otherdescendants, Pencil and Scissor. At any level the classes in the hierarchy inherit all the attributesof the higher level objects they descend from. This includes both the higher level objects’instance data and their methods.Although inheritance ensures that a subclass is a subtype of a super class, Raven supportssub-typing through contravariant type-checking (a conservative type equivalence policy) [1].Raven’s type checking rules are defined as follows:CHAPTER 3: The Raven System and Language Overview 55#include <Basic.r>#include <Array.h>class Weighable {wght: Int;behavior weight() : Int;constructor(obj_weight: Int);}constructor { wght = obj_weight; }behavior weight { return wght;}class Pencil <- Weighable {constructoro;}constructor{ super.constructor(5); }class Scissors <- Weighable {constructorO;}constructor{ super.constructor(20); }class Cup {items : Int;cup : Array[Weighable];constructorO;behavior add(item : Weighable) : Int;behavior remove() : Weighable;behavior weight() : Int;}constructor {cup = Array[Weighablej.new(1);items = 0;}behavior add (items = items + 1;cup.atPut(items, item);return 1;}behavior remove {if (items > 0) return cup.atGet(items--);else return nil; }behavior weight {var total_weight, i : Int;total_weight = 40; llweight of cup)for (i = 1; i <= items; i++)total_weight += cup.atGet(i) .weightO;return total_weight;class Main { constructoro; }constructor { super.constructorO;}behavior start {var cup : Cup;cup = Cup.newO;}cup.add(PenciLnew);cup.add(Pencil.newO);cup.add(Pencil.new);“The pencil cup weights”.print;cup.weightO.printO; “\n”.printO;cup.add(Scissors. newo);“The pencil cup weights “.printO;cup.weight.print; “\n”.printO;FIGURE 3.2 Raven pencil cup exampleCHAPTER 3: The Raven System and Language Overview 56A variable of type T can reference any object of class S, if S is a subtype ofT. S is a subtype of T if S is identical to T or if the following conditions hold:1. S provides at least the behaviors of T.2. For every behavior in S that has a corresponding behavior in T, the corresponding behavior has the same number of parameters and results. (This is exclusiveof the constructor () method.)3. The type of the result returned by S’s behavior is a subtype of the result returnedby T’s behavior.4. The parameters of T’s behaviors as subtypes of the parameters to the corresponding arguments of S’s behavior.5. S contains at least the public instance variables of T. (Public instance variables,are instance variables visible outside the class.)6. The corresponding public variables in S and T have identical types.This means that a class can be a subtype of another class even if it is not part of that class’sinheritance hierarchy. These typing rules are illustrated in Figure 3.4. In the example, the classEraser is defined, but it does not inherit from Weighable. It is a subtype of Weighable because allthe methods it has in common with Weighable (i.e., weight ()) have the same number andFIGURE 3.3 Pencil cup class hierarchyCHAPTER 3: The Raven System and Language Overview 57class Eraser {wght: Int;wght_KGs : Int;behavior weight() : Int;behavior weight_KG() : Int;constructoro;}constructor { wght_KGs = wght /1000; wght = 1 0000;}behavior weight { return wght; }behavior weight_KG { return wght_KGs; }cup.add(Eraser.new); II Valid usage of instance of EraserFIGURE 3.4 Eraser as a subtype of Weighable.type of parameters, and return the same result. Additionally, the common public instance variable (wght) has the same type. Since instances of Eraser are sub-types of Weighable, they canbe used anywhere that an instance of Weighable can. Implementation inheritance is, therefore,the main use of inheritance in Raven.In addition to explicitly defined classes, Raven supports a second type of object, the basicorprimitive object. These are basic Raven entities that cannot be decomposed further, are supported and manipulated by the compiler, and are manipulated directly by the hardware. Floating point numbers and integers are the current primitive objects. These are denoted as Int, andFloat, within Raven declarations. User-defined classes are complex objects and are composed of primitive objects and references to other complex objects.Class definitions in Raven start with the keyword class and terminate when a new classdefinition begins or when the end of the file is reached. The first class defined in the pencil cupexample (Figure 3.2) is the class Weighable, and its definition terminates when the definition forthe class Pencil begins. A class definition consists of two parts: one part describes the data itemsCHAPTER 3: The Raven System and Language Overview 58for the class, and the other provides the definitions of the routines that must be used to manipulate the data in the objects of a particular class. The data defined within a class is referred toas the object’s instance data. The data defined within a method is referred to as method data.Unlike a struct in C, a class also defines the routines that can be called to manipulate theinstance data. The routines to manipulate the instance data are called methods or behaviors. Itis only through methods that an object’s instance data can be modified. The class Cup definesthe methods add(), remove, weight() and constructor.An object, alsoreferred to as an instance, is a particular instantiation of a class. To provide a consistent objectview, classes themselves are instances of classes. When a class, like Cup, is defined, the Ravensystem automatically defines a meta-class for the defined class. The meta-class has its own predefined methods. One of the class’s predefined methods is new 0. The method new () isinvoked on a class object to create an instance of that class.All objects in Raven are identified by an object ID (OlD), and variables in Raven alwaysreference a valid object. Instance and method variables are initialized by the system to reference the predefined object, nil. An object reference (OlD) is an identifier, or capability, thatcan be mapped to the internal representation of an object. The internal representation containsinformation needed to locate the object’s instance data and to lookup methods.Objects support the is-a relationship in that each object is a particular type. Figure 3.5provides a pictorial representation of the is-a relationships between instances of objects, classdefinitions, and meta-classes along with the inheritance hierarchy and class organization of aRaven environment at runtime. In the pencil cup example the classes Pencil and Scissor inheritfrom the class Weighable.Each Raven class must define the method constructor ( ) . This method is automatically executed when an instance of a class is created. The code within the constructor is responsible for performing any object-specific initialization and for calling the parent object’sCHAPTER 3: The Raven System and Language Overview 59InstancesFIGURE 3.5 Class organizationconstructor. The parent object is referenced through the predefined object super; examples ofthis are shown in several of the constructor methods of the pencil cup example.Every Raven program must also define the class Main with the method start 0. Thisis analogous to main () in a C program in that execution starts with the start method, afterthe constructor for Main has been called. In the start () method of the pencil cup example,a local variable of type Cup is defined and a new instance of Cup assigned to it. Three pencilsare created, and added to the cup. The cup is weighed and the weight is printed. After that, an............#-. Is a relationshipsInheritance pathsClasses MetaClassesCHAPTER 3: The Raven System and Language Overview 60instance of Scissor is created and added to the cup. The cup is again weighed and the cup’sweight printed.In summary, the pencil cup example defines a pencil cup object which can have otherobjects added to or removed from it. That is, there are a number of operations that can be performed on this pencil cup. These operations are known as methods or behaviors. The actual cupdefinition itself is defined by the class Cup, which is a template for the canonical cup. In moresophisticated systems the new operator could be parameterized to impart different qualitiesonto the pencil cup. Cup contains two instance variables: items, an instance of the primitiveclass Int, and cup, an instance of the parameterized class Array. The parameter in this last caseindicates that the array contains items of the type Weighable.Although not shown in the pencil cup example, Raven methods can also support copysemantics. The arguments to methods are always references to objects. If the same reference ispassed to multiple methods the same object is updated. Under certain circumstances a methodmay want its own private copy of an object. To support copying, individual parameters can bemarked as copy parameters. Figure 3.6 shows how the add () method of the class Cup couldbehavior add(copy item : Weighable) : Int;FIGURE 3.6 Tagging a parameter as copyable.have the i tern parameter marked as being copyable. (A returned result can also be marked ascopyable in a similar fashion.) When a parameter is marked as copyable, a copy of the objectis made, and it is the reference to the copied object used in the method. When an object is copied like this, the copy is a deep or recursive copy that may cross machine boundaries. In a recursive copy all top level objects are copied, as are all the objects referenced by the copy and so on.CHAPTER 3: The Raven System and Language Overview 613.2.2 Raven Class LibraryPart of the power of an object-oriented system results from the reuse of code. To address thisissue, a Raven class library is provided. The class library is written in Raven and builds complex objects from primitive Raven objects and other existing classes. This provides a basis fora class system that the application programmer can work with. The Raven class library alsodefines a number of classes used by the compiler. For example, some of the support for parallelism is provided through Raven classes and the compiler generates code to make direct useof these classes. Raven classes in this category are Thread, Companion, Certificate, and System.Although these classes are accessible to the Raven programmer, and can be treated like any userdefined class, their primary purpose is to support the class environment expected by the compiler.Some of the supplied Raven classes also interact with the Raven runtime support system.Because the runtime system is written in C, the classes are a mixture of Raven code and specialescapes to make procedure calls, using C, into the runtime environment. To be able to do thisproperly, the programmer needs to have an intimate understanding of the underlying structureof Raven objects and the runtime environment. Unlike other languages, which often providesupport for inter-language procedure calls, Raven does not supply such support. The outputfrom the Raven compiler is C code and the Raven compiler provides a mechanism that allowsarbitrary code to be inserted into the output file. Raven classes that need to use the Raven run-time support system must use this mechanism to insert C code directly into the compiler output.To be successful, the programmer needs a thorough knowledge of the intricacies and representation of objects and how they are supported by the runtime system. Clearly, classes requiringthis level of understanding of the construction and implementation of the Raven system needto be provided as part of the class library and not as user-written code.Not all the classes in the Raven class library interact with the runtime system. Someclasses are included because they are expected to have considerable reuse by other classes. Theclasses Queue, List, and Stack are examples of such classes.CHAPTER 3: The Raven System and Language Overview 62The class library also makes a contribution to defining what a Raven environment is.Depending upon how the classes are defined and the behaviors the classes have, differentRaven environments can be constructed. To work, the class library has to supply only a fixedset of classes and behaviors expected by the compiler. Any classes and behaviors that the compiler expects to use must also provide the types of results and perform the expected actions. Forthe work described in this thesis, the term Raven refers to a particular environment defined bythe compiler, class library, runtime system and the virtual machine support. Since Raven is achanging language, changes across Raven releases are to be expected and have occurred.3.2.3 The Raven CompilerPrior to the existence of the Raven compiler, all the support for objects and their use was donein the C programming language, and required a preprocessor, runtime support library and considerable effort by the programmer. Although this combination of development tools made itpossible to develop and test many of the ideas and feature implementation approaches ultimately used in the Raven system, it was difficult to use. The preprocessor and macro packagelacked the sophistication to handle the manipulation of complex expressions without becominga compiler. Taken together, all these idiosyncracies made the existing Raven environment difficult for other people to use. To add to the confusion, this environment also presented two different programming models that can be difficult to reconcile within a program: the regular Clanguage, and the Raven object model with its own programming rules and regulations. Asmore people became involved with Raven, the desire for a compiler, and the improved language support it would provide, increased until a compiler was actually written [11. The introduction of a compiler helped to provide a unified programming model where all datamanipulation is done through objects. Stronger type checking and rule enforcement also provides more feedback to the programmer at compile time, thereby reducing the number of runtime errors. During compilation, compilers maintain more information about the currentenvironment than a preprocessor and macro package does; therefore, the compiler can passCHAPTER 3: The Raven System and Language Overview 63along this information to the runtime system to improve performance and increase functionality.The Raven compiler is based on an LALR(1) grammar, using the compiler developmenttools yacc and lex [60]. All the supporting software is written in ANSI C and is compiled usingthe Free Software Foundation’s C compiler, gcc. The output from the Raven compiler isANSI C with the final compilation and linking for a particular environment being done withgcc. Any platform that has gcc running on it should be able to build and run the Raven compiler.3.2.4 The Local Raven EnvironmentThe Raven runtime system is a layer of code between the Raven language and the underlyingthreads system. It is designed to be application-independent and provides a set of runtime services that the Raven compiler can use. In essence, it is a collection of functions that providesthe basic character and functionality of the Raven system. The runtime system is the heart ofRaven and taken together the services provided by the runtime system define the nature andcharacter of the Raven environment. Among other things, the runtime system:• Defines the bootstrap level objects;• Defines the routines used to specify class definitions;• Defines what an object looks like in the running system;• Provides the method dispatch code;• Provides support for remote objects (Section 3.2.6);• Defines the Raven environment.A short elaboration of each of the above points will now be undertaken. One of the majorfunctions of the runtime system is to provide the mechanism to be used in defining new classes.The bootstrap classes defined by the runtime system are Object, MetaObject, Class, MetaClass,and MetaMetaClass. These are the objects and classes (Figure 3.5) at the top of the class hierarchy, and all other objects descend from them. Being the base classes, these initial classes pro-CHAPTER 3: The Raven System and Language Overview 64vide the methods called to construct new classes and instantiate objects. Consequently, theseclasses must be created as soon as the Raven environment begins execution.The Raven compiler is responsible for emitting the code needed to construct the classhierarchy for the application during the initialization of the Raven system. From the runtimeperspective, the class hierarchy is a collection of data structures that provide the information toconstruct an instance of an object. As well as providing information about the basic data layoutof an object, a class also specifies the routines associated with the object.Raven supports dynamic method lookup and this is done by the Raven method lookuproutine that is called when an invocation is performed. Depending upon the type of object andthe way it is being used, different invocation scenarios are possible, and each invocation scenario has an invoke routine associated with it. To make use of this flexibility, the runtime system assigns an invocation handler to each object when it is created. The selection of the handleris based on the special processing needs of the object, how the object is created, and the currentruntime environment. A default generic invocation routine can be assigned to objects thatrequire special services not supported by the existing invocation routines. The default handleris general purpose and supports all types of invocations, but at an increased execution cost. Thisapproach has the advantage that objects which need complicated invocation handlers can havethem without introducing a performance penalty for the objects that do not. Figure 3.7 providesa pictorial representation of how invocation handlers are assigned to objects. In the figure thereare 5 different objects and 3 types of invocation handlers. Each object is assigned its own handler. In this case, objects A and C use a generic invocation scheme, object D uses a remote invocation handler, and objects D and B use a simple invocation handler.In its most generic implementation, an invocation consists of several parts. As soon as theinvocation routine is entered, the method that needs to be run is looked up. A method cache ismaintained to make this operation fast. Once the method is located, a series of pre-invoke handlers is called. Next, the actual code to execute the actions of the method is called. Upon themethod’s return, a series of post-invoke handlers is called. Each object has its own require-CHAPTER 3: The Raven System and Language Overview 65FIGURE 3.7Object Aments for pre- and post- invocation method handlers. At object creation time, the Raven run-time system selects the invocation handler best suited for dealing with the invocation needs ofthe object.3.2.5 The Virtual MachineThe threads portion of the Raven system, which is derived from the work done in [80], definesa virtual machine, or more accurately an operating system interface, that the other layers of theRaven system can use. Through this mechanism the threads package provides a set of systemcalls that insulate the other layers of the Raven system from the particular machine architecture,while at the same time providing the functionality of an operating system. The threads library,however, does more than map its calls to the matching calls of the native operating system: italso provides support for memory management, thread creation, scheduling of threads, semaphores, and send/receive/reply primitives.Object EObjects with different invoke routines.CHAPTER 3: The Raven System and Language Overview 66The threads library also contains all the operating system and hardware-specific code forthe Raven system. Porting Raven to a new environment consists primarily of porting thethreads package. Depending upon the underlying architecture and operating systems, theamount of effort to port the threads software will vary, but other parts of the Raven system areunaffected. As an example of this portability, Raven is implemented within a process onmachines running UNIX and as a collection of Mach threads under Mach 3.0.3.2.6 Supporting Distribution in RavenThe previous two sections have provided an overview of what a local Raven environmentappears like. In Raven, invocations on objects are not confined to the local environment, and itis possible to invoke methods on objects managed in other Raven environments. The approachto dispatching methods makes support for remote object invocation relatively straightforward.Within a Raven environment, each remote object used in the environment has aproxy. A proxyis a local object that acts as the agent, or representative, for the remote object. To the application, a proxy is identical to a local object. To the runtime system, however, the proxy is anobject with remote instance data; therefore, the method invocations need to be run remotely.Since a proxy is a real object, it has an invocation routine associated with it. To support remoteinvocation a proxy’s invocation handler is set to the invocation routine responsible for remoteexecution. Remote invocation requires additional system support to manage the network communication between the source and the destination sites. This includes managing of a methodinvocation protocol and support for the marshalling of data between hosts. Figure 3.8 showsthe relationships between a proxy object and a remote object. In the example, an initiatingobject in environment A performs an invocation on the proxy for object A. To the initiatingobject, the proxy appears like any other object. The proxy identifies the remote request handleras its method invocation routine. The remote request handler packages the request and has itcommunicated to the remote site. At the remote site the request is unpacked and the methodinvoked on object A by a remote worker. Results are returned by the reverse path. The low-CHAPTER 3: The Raven System and Language Overview 67FIGURE 3.8 Example of a remote invocation.level transferring ofdata between Raven environments is handled by the communication managers which exist in each local Raven environment.The terms Raven environment and Raven system are used throughout the thesis. Themeaning is usually apparent from the context. Figure 3.8 shows two Raven environments interacting with each other. The terms are used to convey information concerning the implementation. An environment is a single address space with a particular configuration. A Raven systemis a collection of Raven environments.3.3 SummaryIn this section the Raven language and system have been introduced. Raven consists of a compiler, runtime system, class library, and virtual machine environment. Each of these compoRaven environment A Raven environment BCHAPTER 3: The Raven System and Language Overview 68nents has been briefly described to provide an overview of the system and the type offunctionality assigned to each component. An example program is used to introduced the syntax of Raven and some of the underlying concepts of object-oriented systems and how theyrelate to Raven. Subsequent chapters describe the extensions and modifications made to theRaven system to support parallel and distributed progranming.Support for Paralleland DistributedComputation in RavenMost operating systems provide support for the creation of processes, and, in its most basicsense, that is all that is required to supply some form of parallelism. However, support for parallelism encompasses more than a one-dimensional way of creating new threads of control. Tobe useful, there needs to be cooperation, or common purpose, between the execution threads,and the programming environment needs to make that possible. To provide support for parallelism in an integrated fashion, a whole approach needs to be developed, extending from thecreation of the multiple threads of control to the mechanisms that exist for the cooperating execution threads to interact and synchronize with one another. Thus, to provide a cohesive integrated environment for supporting parallel and distributed computation, one needs to questionthe whole premise that the operating system and programming language are separate entities.Current environments require the programmer to maintain two different programming perspectives. At one level there is the programming model presented by the programming language; ata second level is the view provided by the operating system. Depending upon the aspect of theprogramming problem being examined, the programmer must interact with different program69CHAPTER 4: Support for Parallel and Distributed Computation in Raven 70ming models and then reconcile them within the programming language. This also means thatthe end programming environment varies from machine to machine, depending upon the operating system and functions it provides. This certainly does not facilitate the development ofportable code.Although parallelism can be introduced simply by providing a mechanism to create multiple threads of control, this does not make the task of programming in a parallel or distributedenvironment easier. To simplify the task of programming in a parallel or distributed environment we exploit the strong data encapsulation and well defined object-access interface provided by an object-oriented system to merge the facilities for parallelism with the generalprogranmiing techniques used in object-oriented systems. The motivation for this approachstems from the observation that, at least in the near term, compiler-generated parallelism suffers from the problem of making parallelism either too fine- or too coarse-grained. To combatthis problem the programmer needs to be actively involved in specifying and controlling parallelism. As a result, the decision was made to incorporate features for coarse-gramed parallelism (which assumes that any parallel activity will involve several thousand instructions)directly into the Raven programming language. This has resulted in a multipronged approachto supplying and supporting parallelism.The result is new syntax within Raven for specifying parallelism, the introduction ofproperties to specify concurrency control and to manage the class explosion, along with modifications to the runtime system to manage and control resource allocation. To support distributed computations, the system provides some for object location transparency.4.1 Specifying Parallelism in RavenExisting approaches to parallelism require the designer to decompose a program intofunctions and procedures as well as processes. As a result, the programmer must deal with theissues associated with interprocess communication and process synchronization. This additional step adds complexity to the task of progranuiiing and increases the design, programmingCHAPTER 4: Support for Parallel and Distributed Computation in Raven 71and debugging effort needed to get a program running. Raven supports the explicit creation ofprocesses through the supplied Thread class. But, such an approach does not adequatelyaddress the issues associated with providing parallelism by integrating it at the language level.The encapsulation of the low-level system thread entities into a Thread class ensures a uniformobject view for process creation at the application level. The actual act of creating the threadand associating an object and a method with it, however, involves considerable effort by theprogrammer. Under many circumstances it should be possible to identify parts of the code thatcan run in parallel and then delimit those regions, without having to manage the thread creationexplicitly. Ideally, it should be possible to take an existing program and make only minorchanges to achieve parallelism.Since methods provide the mechanism for object interaction, it is logical that a modification to the way methods are used or specified would provide an easy-to-understand way ofachieving parallelism. Two possibilities arise. First, parallelism can be specified directly withinthe methods of the class: this is referred to as class-based parallelism, since the parallelism isthe result of the way the class is implemented. A second possibility is that the invoker createsthe parallelism at invoke time: this is user-based parallelism, since the user of the class isresponsible for specifying the parallelism.4.2 Class-Based ParallelismClass-based parallelism refers to parallelism that results from the way an object is implemented rather than the way it is used. With class-based parallelism the class implementordesigns methods in such a manner that the method’s use results in parallelism between aninstance of the class and its invoker. This parallelism is accomplished through the use ofRaven’s early result and delayed result constructs.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 724.2.1 Early ResultThe return statement in most programming languages serves two functions: it returns execution control to the caller, and it returns a result. In Raven these two functions can be separated. A result statement can be executed which returns a result to the invoker, thusunblocking the invoking object while continuing execution in the called method. This resultsin an additional thread of execution. A method’s execution completes when a return statement is executed, either explicitly or by completing the method. Once a result statement hasbeen executed, however, the value of any subsequent return statement is ignored.behavior postMail {return_code = message.validFromFieId;result return_code; II added result lineif (return_code == OK)mesage.sendMessageToRecipeints;return;FIGURE 4.1 Example code demonstrating early resultFigure 4.1 illustrates how, through the addition of a single result statement, a formerlysequential method is parallelized. For this approach to be effective, the code following theresult needs to be sufficiently coarse-grained to amortize the cost of thread creation.4.2.2 Delayed ResultWhen an object is being accessed concurrently, it is conceivable that the received method invocation order does not correspond to the order the invocations need to be performed in. Forexample, in the readers/writers problem a read request may be made before there is data available for reading. In these circumstances, it is desirable for the processing of a method to bedelayed or suspended. An approach used in other systems is to place guard conditions on theCHAPTER 4: Support for Parallel and Distributed Computation in Raven 73execution of a method. The guards specify the conditions permitting method execution. Oneproblem with the guard approach is that once the execution of a method has started it must becompleted. Therefore, all the conditions needed for a successful execution must be determinedin advance and only those methods satisfying the guard conditions can be permitted to execute.Depending upon the complexities of the execution restrictions, setting up guards can be anonerous task. Another potential problem with guards concerns their implementation and howoften the guard conditions must be rechecked to determine whether any of the pending methodscan now be executed.In Raven a different approach, that of a delayed result, is used. With this approach anobject accepts and processes methods in the order they are presented, but a method does nothave to run to completion. A method may choose to stop executing so that other methods mayexecute, or to wait for certain conditions to be met, such as resources becoming available. Thishas the advantage over guards in that the programmer does not need to specify all the conditions needed for a method to execute at method invocation time. This eliminates the need torepeat the guard logic inside the method if there are multiple condition sets permitting execution.When a delayed result is used, the method executes until it determines that it needs towait, at which point it executes a leave statement (see Figure 4.2). When the delayed resultis performed, the method terminates without returning a result, thus keeping the invokerblocked. This contributes to parallelism by permitting multiple methods to be active in anobject.’ The invoker remains blocked until another thread uses the result statement to supply the return result.Since there may be multiple threads of control waiting on a delayed result, each waitingthread must be identifiable; therefore, a thread has a me variable associated with it, which identifies the thread of execution. As Figure 4.2 illustrates, when a method wants to use a delayed1. The controlled keyword of the Resource class definition indicates that instances of this object are subject to systemadministered concurrency control.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 74class Resource <- Object controlled {resource_state: lnt;waiting_threads: Queue;behavior getResourceO: Int;behavior freeResourceO: Int;,havior getResource { behavior freeResource {if (resource_state == FREE) { next_user: cap;resource_state = IN_USE; if (waiting_threads.state EMPTY) {return OK; next_user = waiting_threads. nextltemO} result [next_user] OK;waiting_threads.add(me); } elseleave; resource_state = FREE;FIGURE 4.2 Use of delayed result.result it saves the value of its me variable before doing the leave so that the suspended threadcan be identified. The freeResource () method of the example shows the use of theresult statement in returning a value for the method that did the leave. The me variable isthe object ID of the currently executing Thread object.Upon executing result, the specified waiting object (i.e. the object enclosed in brackets) receives the result for its method invocation, and its execution thread is resumed. Thisimplies that no statements in a method which follow leave are executed. In the invokedobject, control is not returned to the statement following the leave but to the caller, exactlyas if a return had been done. The thread of control responsible for completing the delayedresult also continues to execute, thus executing in parallel with the original thread.4.3 User-Based ParallelismFor some objects it is not possible to construct an operation that can respond with a early result.Additionally, relying upon one object to provide another object’s parallelism is not in keepingCHAPTER 4: Support for Parallel and Distributed Computation in Raven 75with the object-oriented design philosophy of hiding the implementation details of a methodfrom the user of the method. Furthermore, it is risky to trust parallelism to another object, sincea change in the implementation technique of the methods for that object could eliminate orreduce the amount of parallelism within the system. To combat this problem, it is essential thatthe invoker of an operation have the ability to specify whether or not the invocation of a methodshould be done in parallel with the object’s own execution. Certificates and companionsaddress this problem.4.3.1 Certificates and CompanionsA companion is a mechanism to generate one or more threads of execution running inparallel with each other and with the initial thread of execution. Specifying a companion consists of providing a set (block) of method invocations (requests) to be run in parallel with themselves and the initiating thread. Each time a companion is created, a corresponding Certificateobject is generated and is obtainable by the programmer. A Certificate is itself an object. A Certificate object identifies a companion and contains status information about the companion andresponds to methods that can be used to monitor the companion or to wait for its completion.The only statements that may appear in the companion are method invocations(Figure 4.3). For the purpose of determining the arguments to the method invocations, the initiator’s execution is suspended until all the parameters for the methods have been evaluated.This ensures that the variables used in the evaluation are not unexpectedly changed by the initiator. Like C, where the order of parameter evaluation for a function is undefined, the evaluation order of a method’s parameters is also undefined. To aid the understanding of howcompanions work and are organized, consider a part of an application that has as its goal thedisplaying of the current time on two workstations through the displayTime () method.From the application’s perspective, the displaying of the two times can be done in any order,or in parallel and the application does not need to wait for the completion of the di splay—Time () method. Figure 4.3 shows three possible ways to do this. The example starts byCHAPTER 4: Support for Parallel and Distributed Computation in Raven 76cert : Certificate; II Variable declarationresl = workstation 1 .d isplayTime;res2 = workstation2.displayTime;cert = !{ resl = workstationl.displayTimeO,res2 = workstation2.displayTime() }!.starto;cert = !{ resl = workstation 1 .displayTimeO,res2 = workstation2.displayTime() };cert.startO;FIGURE 4.3 Examples of certificates and companions.declaring an instance variable of type Certificate. The first technique performs a standard invokeon the workstation objects: there is no parallelism and the application waits for each method tocomplete before proceeding. The second approach encloses the method invocations within thecompanion delimiters of ! { and }!. Taken together, all the methods between the companiondelimiters form the companion. The methods in the companion run in parallel with one another,and in parallel with the thread of control that created the companion. Each method within acompanion is executed through an instance of the class CompanionThread. (A CompanionThreadis a subclass of Thread.) When a companion is created, an instance of Certificate is created andreturned to the application. Methods can then be invoked on Certificate objects to manage andmonitor the execution of the companion. The CompanionThreads of a created companion are allinitially suspended. To start the execution of the CompanionThreads, the start () methodmust be invoked on the Certificate returned as part of the companion creation. This can be doneas soon as the companion is created, as is shown in the second scenario, or the resulting Certificate can be saved and the CompanionThreads started later, as is shown in the final part of theexample. It should be noted that even if the methods within a companion are targeted at theCHAPTER 4: Support for Parallel and Distributed Computation in Raven 77same object, the methods still run in parallel. However, the target object may impose its ownordering on the method execution through locking or the delayed result facility.As indicated earlier, each Certficate contains status information and responds to methodsthat can provide information about the associated companion. A tag can also be associated witheach companion. The tag is more than an identifier for a companion, since it can be used toassociate any type of user data with a companion. It is easier to classify and associate data witha Certificate through the tag at invocation time than it is to build special data structures toreverse-engineer the information when the method completes.The application-level operations supported on a Certificate are:• setTag (value: cap) . This function sets the tag value.• getTag () : cap. This function returns the current tag value.• start () : Certificate. Causes the CompanionThreads associated withthis companion to begin execution.• wait 0. Causes the current thread of execution to block until the companioncompletes. A companion completes when all the CompanionThreads associatedwith the companion have completed execution.• status. This accesses Certificate’s public variable, status, and can be used todetermine the current status of a companion. A companion can be: idle and waiting for execution of the CompanionThreads to begin, running, or finished.Given a Certificate, it is possible to get and set the tag value, get the execution status ofthe companion, or wait for the companion to complete. A companion completes when all themethods forming part of the companion have completed. It is important to know when a companion is finished since only then can a user be assured that all the results returned by the methods in the companion are available.To further extend the usefulness of companions, and their associated Certificates, Certificates can be collected together in the class CertificateGroup. An instance of class CertficateGroup acts as a grouping or collection mechanism and is similar to a set. An important operationCHAPTER 4: Support for Parallel and Distributed Computation in Raven 78that can be performed on a CertificateGroup is iteration. Iteration on CertificateGroup objects isnot like conventional iteration on a set. When iterating, each time a companion forming part ofthe CertificateGroup completes, control is returned to the iterator.To use CertificateGroup an instance of CertficateGroup is created. Certificates from created companions are then registered into the CertificateGroup. Before an iteration is performeds e tWait () is invoked to setup the CertficateGroup for iteration. This initialization call addsflexibility to the use of CertificateGroups by making it possible to iterate over a CertificateGroupmultiple times. After the initializer has been invoked, the method waitForNextCertificate () can be used to collect individual certificates. The semantics of the iterator are:• Unless an explicit break is done out of the iterator loop, each certificate isselected exactly once.• A certificate can be selected only if the companion it is associated with has completed. When there is no completed certificate, the invocation of waitForNextCerti ficate () will block, waiting for a companion to complete.• The waitForNextCertificate () method will return immediately if anycompanions in the CertificateGroup being waited on are finished.• When no more Certificates can be returned, a reference to the predefined object,nil, is returned.In summary, the methods supported on instances of CertificateGroups are:• add(new_certificate: Certificate). AddthespecifiedCertificatetothe CertificateGroup.• delete (cert: Certificate). Delete the specified Certificate from theCertificateGroup.• waitForNextCertificate () : Certificate. Wait for the next companion in the CertificateGroup to complete and return the associated Certificate.• s e tWait 0. Set the CertificateGroup state such that, for iteration purposes, noCertificates have been returned yet.To develop a better understanding of how Certificates, CertificateGroups and companionswork, consider an application based on data collection. The example consists of a collection ofCHAPTER 4: Support for Parallel and Distributed Computation in Raven 79observation stations on the west coast of Canada. When an event occurs, the central site is notified, and it records the information and displays the time of the event on the screen of the monitoring station. To keep the example simple, assume that there are only three monitoringstations. Figure 4.4 shows the Raven code and usage of Certificates and CertzficateGroupscert_group : CertificateGroup;station 1 _cert, station2_cert, station3_cert, res: Certificate;cert_group = CertificateGroup.new;stationi _cert = !{ stationi .waitForEvent() }! .startO; cert_group.add(stationl _cert);station2_cert= !{ station2.waitForEvent() }! .startO; cert_group.add(station2_cert);station3_cert = !{ station3.waitForEvent() }! .startO; cert_group.add(station3_cert);cert_group.setWaitO;while((res = cert_group.WaitForNextCertificateO) nil) {if (res == stationi_cert) stationl .displayTimeO;if (res == station2_cert) station2.displayTimeO;if (res == station3_cert) station3.displayTimeO;}FIGURE 4.4 Example use of CertificateGroups.needed to accomplish this task. In this example, after the initial declarations and allocation ofvariables, three companions are created, started, and added to a CertificateGroup. A call is thenmade to setWait () to prepare for iteration. Iteration over the CertificateGroup is accomplished with calls to waitForNext 0. The iteration terminates when nil is returned.Within the body of the while loop, the returned Certificate is checked against the recorded Certificates so that the appropriate display can be updated. The proposed solution has the undesirable property that as more stations are added it does not scale well. This problem can beovercome by tagging Certificates. The use of tags is shown in Figure 4.5. In this new solutionthe ability to tag a Certificate has been used to simplify the body of the while loop and to makeit easier to scale the solution. In Figure 4.5, the Certificate for station 1 is assigned its tagCHAPTER 4: Support for Parallel and Distributed Computation in Raven 80cert_group : CertificateGroup;station_cert, res: Certificate;cert_group = CertificateGroup.newO;station_cert = !{ station 1 .waitForEvent() }!; station_cert.setTag(stationl);cert_group.add(station_cert.startO);station_cert = !{ station2.waitForEventO }! {station2}; cert_g roup.add(station_cert.start);station_cert = !{ station2.waitForEvent() }! {station3}; cert_g roup.add(station_cert.start);cert_group.setWait;while((res = cert_group.WaitForNextCertificate) != nil) {res.getTag.dispIayTime;}FIGURE 4.5 Use of CertificateGroups and tags.through the setTag () method. The tag value is the object ID of the station. The other Certficates acquire their tag by using the delimiters, { and }, to immediately tag the Certificate withthe station ID. Using the tag value makes it a straightforward matter to update the time on adisplay by retrieving the stored tag and invoking the appropriate method on the tag.Together, early result, delayed result, and companions and certificates provide a set oftools for generating, controlling and monitoring parallelism. These facilities, however, introduce minimal changes into the Raven language and isolate parallelism to the method interface.One shortcoming of companions and Certificates is their inability to impose some serial ordering on groups of requests that can be run in parallel. To address this problem a new class,InvokeStream, is introduced into Raven.4.3.2 InvokeStreamsIn some circumstances, an application that has both a parallel and a serial component is forcedto execute serially because there is no way to specify the serial relationship while allowing parallelism. The example code of Figure 4.6 illustrates such a situation. It consists of a series ofCHAPTER 4: Support for Parallel and Distributed Computation in Raven 81workstation. positionCursor(cursor_positionl);workstation.displayData(datal, amounti;workstation. positionCursor(cursor_position2);workstation.displayData(data2, amount2);FIGURE 4.6 Some potential parallel invocations with serial componentsrequests that update a workstation display in a particular order. More specifically, the pos -tionCursor () method is first used to position the cursor and then the data is written out.The positioning and displaying operations exhibit a tight coupling. For the display to look correct, the operations must be performed in the order they are coded. Although the positioningand updating operations cannot be allowed to proceed in parallel with one another, there is noreason why they cannot proceed in parallel with the main thread of control initiating therequests.Restrictions on the ability to use parallelism in this configuration or across companionsresult because Raven makes no statement about the execution order of methods in a companion. Two parallel invocations targeted at the same object are not guaranteed to execute in anyparticular order. If the target object is remote, the vagarities of network connections can easilyresult in out-of-order execution relative to the coded order. In a parallel machine, the actualscheduling order and competition for local resources could also result in out-of-order execution. In the initial Raven implementation, if order was important the application had to performthe operations sequentially and pay the penalty in the form of decreased parallelism. Some parallelism could be gained by encapsulating the serial component of the requests into a self-contamed method and then executing it. Although this would work for a single block of requests,it does not work for multiple blocks of requests executed at various times. To allow an application to extract some parallelism from these types of invocations, while maintaining therequired serial characteristics of the invocation requests, a new class, InvokeStream, is introduced.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 82The functionality and behavior of an InvokeStream can be characterized by comparing aparallel invocation to a packet in a reliable datagram service. In such a system the requestseventually arrive at the destination object, but nothing can be said about the order. Continuingwith this communications analogy, what we would like to do is establish the invocation equivalent of a TCP/IP stream to an object. The effect would be that the invocation requests wouldbehave strictly sequentially and arrive for execution at the destination object in the order specified by the requesting site, as if they were packets forming a stream. Such a system allows forparallelism between the initiating thread and the invocation requests, while maintaining theinter-invocation sequential constraint.There are several ways that modifications could have been made to the Raven environment to support InvokeStreams. One approach would be to introduce additional syntax into thebase Raven language to differentiate between companions that operate in the way describedearlier, and companions that operate with a sequential relationship. This would result in Ravensupporting two almost identical language level constructs. To be useful, the constructs to support sequencing would have to support multiple invoke streams, for if only one central invocation stream existed, there would be the potential to introduce a sequential relationship betweeninvocations where none exists or was intended. Adding support for multiple streams wouldrequire additional changes to the Raven syntax and base system to provide a mechanism to create and manage the multiple alternate invocation streams.The approach of modifying the compiler to support invocation streams is extreme, andlacks flexibility, so alternate techniques were explored. Since Raven is an object-oriented system, the decision was made to explore ways to make additions to the Raven class system tosupport invocation streams. By introducing a new class to handle this problem we were able toleverage off existing Raven classes and provide a flexible and extendable mechanism toexplore and experiment with invocation streams.Ultimately it was decided that the desired level of support for invocation streams couldbe achieved by adding the new class InvokeStream. This class has a single application-visibleCHAPTER 4: Support for Parallel and Distributed Computation in Raven 83method. The method is push,which takes as its sole argument the Certificate of the companion to be streamed. When a Certificate is pushed, sequencing information is associated witheach CompanionThread forming part of the pushed companion. Within a companion the methods are sequenced based on the their syntactic order and the target object of the invocation.Requests targeted for the same object are sequenced relative to each other but not to requeststargeted at different objects. Once the sequencing information is attached, the CompanionThreadis started with the Raven system enforcing the sequencing (see Section 5.4.6).An example of InvokeStream usage is shown in Figure 4.7. In this example thevar stream: InvokeSteam; /* Declare an Invocation stream *1stream = lnvokeStream.newO; /* Create the actual stream *1bar = !{ workstation.positionCursor(cursor_positionl)}!; 1* CompanionThread *1stream.push(bar); /* Pass the Companion to the invoke stream *1stream.push(l{ workstation.displayData(datal, amounti) }!);FIGURE 4.7 Example stream creation and use.InvokeStream object is first created and a reference to the stream obtained. This causes a CompanionThread to be created and associated with the method to position a cursor on the workstation display. Next, the Certificate generated by this operation is pushed into the InvokeStream.The second push to stream pushes the Certificate for the companion responsible for displayingthe actual data into the InvokeStream. This second push illustrates that there is no need for theapplication to actually acquire the Certificate identifying a companion and that the Certificatecan be pushed directly into the stream. The requests to position the cursor and then displaysome data on the screen will be sequenced and both the requests will be run in parallel with theinitiating thread.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 84A side effect of implementing InvokeStreams as a class results because an instance of anInvokeStream is itself an object. This means that it has all the properties of an object and can beused like any object. In particular, a reference to an instance of InvokeStream can be passed,through the normal Raven mechanisms, to another Raven object in potentially another threadand on a different processor. More concretely this means that one object can push some invocations into a stream and then pass the stream to another object. That object can then push itsown invocation requests into the stream. The result is that InvokeStream objects allow requestsemanating from different threads of control, and potentially different objects and separate processors, to be streamed and ordered regardless of where the invoking object resides.The example in Figure 4.7 could still make use of streams even if other threads of execution were involved. For example, the main thread would create the stream and then push thepositioning request into the stream. It might then do an invoke on another object, passing thestream ID and the workstation object II) as parameters. The target object can then perform acomputation and push the display request into the supplied stream, and the display request willbe sequenced.Since the stream can be passed, multiple objects and potentially multiple threads can havereferences to the stream at once. Consequently, two push () operations can be started simultaneously. Under these circumstances the push () operations are processed in strict time orderas received at the InvokeStream object. In this implementation the result is that the InvokeStreamwill ensure that requests coming from the same thread of control are sequenced for a particularobject, but these requests may be intermingled with invocation requests from other threads. Ifsome sort of additional sequencing is required between threads using the same InvokeStream,then the onus is on the application to perform the coordination.Using InvokeStreams in a role similar to an advisory file lock in other systems makes itpossible to provide a mutual exclusion service to an object. Such a facility is useful when multiple threads of control are trying to access an object that was not designed to support concurrent access. To use a programming convention to establish mutual exclusion, an InvokeStreamCHAPTER 4: Support for Parallel and Distributed Computation in Raven 85would be created and then all requests to an object would be issued by pushing a companioncontaining the request(s) into the stream. Multiple references to the stream can be passedaround, and the stream will ensure that the invocations targeted at the object will be executedserially regardless of what thread of execution pushes the companion into the stream. Eventhough the requests will be executed serially, the underlying system achieves some parallelismat the protocol level by allowing the transmission of requests to occur in parallel.InvokeStreams are not one-way operations that sequence only requests to objects; they alsosequence the return results. The individual methods within a companion can return results.When a companion is pushed into an InvokeStream the return results are sequenced. The programmer can be assured that if the result of a parallel execution request targeted at a particularobject has returned, so too have all the other sequenced requests targeted at the object that usedthe same stream.InvokeStreams are an ideal platform that can be used as the basis for implementing otherinvocation sequencing operations. For example, InvokeStream could be extended to provideordering amongst different target objects. Further functionality could be introduced by providing a lock method to guarantee absolute access to an InvokeStream when more than one reference to it exists. This latter service might be particularly useful when the stream is heavily usedand the thread requires exclusive access to the InvokeStream. This additional functionality canbe provided by creating new subclasses that inherit from InvokeStream or by adding methods tothe existing InvokeStream class. Flexibility like this would have been hard to achieve if supportfor InvokeStreams had been relegated to the compiler.4.4 PropertiesThe exercise of writing a program has several distinct aspects to it. The programmerdecides upon a set of algorithms to use in solving the problem and how they need to be integrated. Once the basic design decisions have been made they are implemented by selecting anddefining the data structures that will be used and the functions and procedures to operate on thatCHAPTER 4: Support for Parallel and Distributed Computation in Raven 86data. In the object-oriented domain, data structures and procedures translate into class definitions and their methods; however, defining classes is somewhat different from specifying datastructures since the programmer can often use existing class libraries or develop classes thatinherit from the predefined classes. The ability to use classes and subclasses allows programmers to benefit from the work of other programmers.In an object-oriented system, when the conventional notion of system services, like support for parallelism or atomic transactions, is moved into the language, some accommodationis required to make these features easier to use. To illustrate this point, consider an object-oriented language that supports parallelism and provides the class List as part of a class library.There are two distinct environments that the class List could be operating in--the sequentialenvironment, and the parallel environment. These two environments are different and place different constraints on objects and how they are used. In particular, in the concurrent environments, instances of List are expected to handle concurrent accesses in a reasonable manner.One approach to managing the concurrency problem is to write the List class for the worstcase concurrent access scenario. In doing this, system services would be used to restrict concurrency and supply mutual exclusion for critical sections. This, however, introduces unneededoverhead when the object is used in a sequential environment where concurrency control features are not required. Additionally, an object can be used in different ways and in differentenvironments than those anticipated by the designer. To fully support reuse, the programmermust anticipate all the ways that an object of the class will be used, and program the classaccordingly. To illustrate the difficulties this presents, suppose that in addition to supportingconcurrency control, the system also supplied support for transactions. With the model beingdescribed, the programmer would be required to program the object for usage in a concurrentenvironment, an environment that supports transactions, an environment that supports bothtransactions and concurrent access, and an environment that uses neither. Additionally, somemechanism would also be required to indicate which level of support to use. The result of thisCHAPTER 4: Support for Parallel and Distributed Computation in Raven 87is that as the number of system services available to the application increases so does the complexity of the programming task.Requiring the programmer to develop a class that can be used in all the different environments does not make programming a class easier. Code reuse is reduced since the class, if fullyprogrammed, introduces unneeded overhead and complexity. The increased complexity alsomakes it more difficult to ascertain whether or not a particular class will operate correctly in aselected environment.Another way to view the problem associated with providing system services to objects isto observe that services like concurrency control can be managed by the system instead of theprogrammer. To accomplish this, the information that a particular object is going to be usedconcurrently must be propagated to the system, so that the appropriate actions and controls canbe placed on the object.A way to incrementally build new classes is by starting with a basic implementation andthen using class inheritance through subclasses to add the required features [93]. If this weredone with the List class, the system might consist of a basic List class designed to operate in asequential environment with another class designed to provide concurrency control, ConcurrentControl. Using the class inheritance approach, if List were required to operate in a concurrent environment, a new class ConcurrentList, which inherits from List, would be created. Theimplementation of ConcurrentList would deal with all the issues associated with concurrentaccess to a list by making invocations on an instance of the concurrency control object. However, as other operating system services are supported they would require their own specialclasses, and as the number of available operating system services increases the permutationson how the services can be combined increases rapidly and a class explosion results. Multipleinheritance would not solve the class explosion problems. Specific classes still have to bedeclared for each of the various ways the system services can be combined. With the list example, a ConcurrentList class would still have to be declared. It would inherit from List and Con-CHAPTER 4: Support for Parallel and Distributed Computation in Raven 88currentControl. This would make the implementation easier than in systems with singleinheritance, but it does not solve the class explosion problem.Another way to address the class explosion problem is to use reflection [61]. In a reflective system an object is essentially split into two components. There is the object the user seesand an associated meta-object that governs the underlying, or system level, semantics of theobject. The meta-object is an object in its own right. Reflective systems have particular problems associated with efficient implementations [108] [109] since all the meta-level behavior ofan object is subject to change. These changes can take place at any time during an object’s execution. The result is that any execution of an object requires constant checking to determine ifany of the associated meta-objects have changed. There are also problems relating to determining the scope of any changes to meta-objects. For example, should the specified change applyto just one object, all objects in the system, or simply to objects of a given type? Another important consideration about reflection is the side affects that result when a meta-object is changed.If care is not taken changes to meta-objects could affect the behavior of other meta-objects,thereby making it difficult to reason about the behavior and relationships between objects.Class hierarchical systems and reflective systems do not adequately address the class explosionproblem; however, Raven does address the problem through the introduction of properties.Properties in Raven are designed to curb the class explosion and provide a mechanism toallow programmers to extend a class’s characteristics on an object-by-object basis. The properties added to an object do not affect the outward interface of the object and the way the objectis used. A property is a way of changing the runtime support for a particular object from thedefault support for that class. In essence, a property is a mechanism used to inform the systemabout the basic services that an object requires.To illustrate this point, consider the programming of the classes List and ConcurrentList.Except for the case when there is concurrent execution, these classes are identical. They havethe same method interface and method description, and the same expected effect during execution. The only difference is that when operating in a concurrent environment the concurrentCHAPTER 4: Support for Parallel and Distributed Computation in Raven 89version should protect the object instance data from concurrent accesses that could be harmful.By associating a property to an object, and having the property managed by the system, it ispossible to program one List class, yet have it operate correctly in both sequential and concurrent environments. This works because once the system knows that an object requires a certainservice, that service can be provided for the object automatically at method invocation time.In Raven, each object is assigned a set of properties that define how the object is managedby the system. A property can be specified as part of a class definition. When a property is specified as part of the class definition, the property is always associated with instances of that classand the property cannot be overridden by either attaching the property to a subclass or explicitly at object creation time. This feature is used in the case when the class has certain expectations about how it is going to be used. For example, a class may be marked as controlledbecause it is only meaningful to use the class in a multi-threaded environment. If the class defimtion of an object does not specify a particular property it can be assigned at object creationtime.A second way for an object to acquire a property is to assign the property at object creation time. An object is normally created by invoking new () on the desired class object. Whenan object is to be created with additional properties, pnew () method is invoked instead. Thefirst parameter of pnew () specifies the set of additional properties the created object is tohave. The remainder of the parameters are exactly the same as new ( ) . This feature means thatall instances of a class are not required to have the same set of properties.In deciding upon the original set of properties, the goal was to specify a set of mutuallyorthogonal properties that could provide a number of operating system services to the application. The initial set of properties dealt with object storage, concurrency control, object security,object recovery, and distribution. Based on this initial set of properties, others working on theRaven system explored the possibility of supporting such services as atomic transactionsthrough properties and the dynamic runtime relationship between objects [41]. The resultingset of properties are:CHAPTER 4: Support for Parallel and Distributed Computation in Raven 90Controlled. This property protects an object against concurrent accesses by multiple threads which may modify its instance data. Multiple readers, but only asingle writer are supported. Threads that cannot be granted the access theyrequire are blocked until the request can be satisfied. The Raven compiler tagseach method as either a read or write method based on an analysis of the statements within the method.• Immobile. Objects which implement machine-specific tasks, such as device drivers, or those which need to remain on the machine where they are started, aregiven the property immobile. This means the object remains on the machinewhere it was started and is not eligible for migration.• Recoverable. The recoverable property is given to those objects that only wantthe changes made to an object’s instance data to be maintained if the method terminates normally. If the method terminates abnormally, through the use ofRaven’s abort statement, the instance variables are reset to the values they hadbefore execution of the method was started.• Persistent. Normally objects only exist in RAM. An object with the persistentproperty has a copy of that object maintained on disk. Since persistent objectsare only written to disk periodically, there are no assurances that the disk copy ofan object will match the RAM copy at the time of a system failure, but the storedcopy will be a complete consistent copy of the object.• Durable. The Durable property is similar to the persistent property except thatcontrol is returned to the invoking object only after the object marked as persistent has been updated on disk.• Replicated: Under certain conditions, for example as a means of improving efficiency, it may be desirable for the system to maintain more than one copy of anobject. Objects given the replicated property are candidates for duplication anddistribution to multiple machines. Replicated objects are only weakly consistentand the individual replicas may not always have the same state.• Immutable: Objects identified as immutable cannot have their instance datachanged. Attempts to invoke a method that modifies an immutable object’sinstance data will result in a runtime error.Each of these properties is supported by the system independently of the others. Anycombination of these properties can be selected and assigned to an object of any class. Note,however, that properties are always stated in the affirmative. Unless an object is given a specificproperty it does not have it, a property once given cannot be taken away. For example, if a classis specified as having the controlled property, then its instances will always have the controlledCHAPTER 4: Support for Parallel and Distributed Computation in Raven 91property. There is no way that a subclass can revoke the controlled property, and similarly thereis no way for pnew C) to take the property away. This approach was taken based on theassumption that if the designer of a class assigned a set of properties to that class, then thoseproperties were required for the class’s correct operation. It should be noted that there is norequirement for objects to have properties assigned. Objects without properties are said to beBASIC objects. BASIC objects are managed in the following way:• The object is maintained only in RAM and therefore does not survive across system failures.• The object is not controlled. This means that multiple threads of control canaccess an object simultaneously and that no concurrency control is exercised.• An object is eligible for migration.• All changes to an object’s instance data happen immediately and the previousstate of the object cannot be recovered.For the majority of programming problems these basic characteristics are acceptable andthere is no need to augment objects with additional properties. Several examples of creatinglist = List.newO;list = List.pnew(CONTROLLED);list = List.pnew(DURABLE);list = List.pnew(DURABLE&CONTROLLED&RECOVERABLE);FIGURE 4.8 Creating objects with different properties.objects with and without additional properties are shown in Figure 4.8. In the example, aninstance of List is first created with the basic properties. The next line shows the creation of aList with the controlled property, because the object will be accessed concurrently. In the thirdline a durable List is created, since a copy of the object needs to be stored across runs of theapplication. The final line demonstrates how an object with the durable, controlled, and recovCHAPTER 4: Support for Parallel and Distributed Computation in Raven 92erable properties can be created. In all these cases, the List class is the same and the user of theresulting instance will manipulate the list in the same way. No new methods result when objectsare created with properties. The objects differ only in the types of services the system automatically supplies for each object.Since most of the objects in a Raven system are basic objects, the system is designed todeal with this case as efficiently as possible. This is accomplished by associating differentmethod lookup and dispatch routines with objects (see Section 3.2.4). Objects with the basicproperties are associated with dispatch routines that take advantage of the knowledge that theobject is basic. Objects with properties are assigned different dispatch routines. These dispatchroutines support properties by executing pre- and post-invocation functions to supply therequired service. Having different dispatch routines associated with objects means that onlyobjects with properties pay the extra execution overhead of dealing with the properties.4.5 Object Concurrency ControlWith the Raven facilities described in the previous sections, it becomes readily apparent that asingle object can have multiple method invocations targeted at it simultaneously, and that properties provide the mechanism to tell the system to supply concurrency control for an object. Themethod invocations that result in concurrent access can be derived from other threads of execution on the same machine or from threads of execution on other machines. In dealing withthe issues of concurrency control, any solution must consider the following points:• How can concurrency be maximized?• How are calls across machine boundaries handled?• What happens when a controlled object invokes on itself?• How are deadlocks dealt with?Figure 4.9 shows two objects concurrently accessing a shared object. In this exampleboth objects A and B are executing in parallel and are performing method invocations on a thirdCHAPTER 4: Support for Parallel and Distributed Computation in Raven 93Execution FlowFIGURE 4.9 Concurrent object accessobject, C. Depending upon the method invocations and the nature of the object, there are a variety of ways in which the access to the object could be handled.The simplest solution to avoiding the possibility of object A or B seeing object C in aninconsistent state is for the object to permit only one method to be active within it at a time.This is the approach taken in POOL-T [12][11] and ACTRA [103], two other object-orientedsystems which support concurrency. Serializing requests through the object keeps the instancedata consistent, but at the price of eliminating concurrency through the object. By forcing anobject to be accessed serially, even when it is not required, a needless bottleneck could be introduced into the system. Concurrency through the object can be increased by permitting multipleIObject CObject A Object BCHAPTER 4: Support for Parallel and Distributed Computation in Raven 94readers of object instance but only a single writer of object instance data to be active in anobject. This is the approach taken in Raven.In Raven methods which only read an object’s instance data are classified as read methods, and those which modify an object’s instance data are write methods. At compile time eachmethod is classified as a read or write method based on whether or not it modifies the object’sinstance data. This information is then used by the compiler to generate code to acquire theappropriate lock at method invocation time and to release the lock when the method completes.By adopting this approach, Raven ensures that each method acquires and releases its locks atthe appropriate time, thus relieving the programmer of performing this task. This simplifies thetask of converting sequential programs to ones that may be used in a parallel or distributedenvironment.4.5.1 Locking CostsAlthough the relatively conservative locking scheme described above does help ensure theintegrity of an object’s instance data, it does impose a cost. For the ease of use of Raven’s locking facilities the programmer pays an execution time cost in both the amount of time an objectcan have multiple methods running in it and in the extra code that must be run to acquire a lock.In the current implementation of the Raven runtime system, the association of a lock withan object results in a significant overhead in the acquisition and freeing of locks. Most of thecost can be attributed to the effort required to ascertain whether or not the lock can actually begranted and to recording the fact that a lock is granted or must be waited for. When the lock isreleased, this information must also be recorded and a check done to see if any method invocations waiting for the lock can now be started.Since locking substantially increases the fixed overhead of performing a method invocation, objects should use locks only when they are needed. For example, applications which areonly single-threaded do not need to lock their objects. The majority of the objects in a Ravenenvironment do not require locks. An object has concurrency control applied to it by associatCHAPTER 4: Support for Parallel and Distributed Computation in Raven 95ing the controlled property with the object. This is either done at class definition time or withpnew () when the object is created. Making an object controlled causes the Raven runtimesystem to automatically generate a lock to control the access to an object’s instance data. Whena method is invoked on an object, the object’s associated lock is acquired, and upon methodtermination the lock is released.An uncontrolled object has no lock associated with it. The expectation is that either concurrent access to the object is not an issue or that the programmer is going to use some othermechanism to maintain the required level of consistency for the object’s instance data.4.5.2 Locking DurationA cost of using a lock can be characterized as any activity of the lock management systemthat increases the execution time. As the previous section described, there is some fixed systemadministrative overhead associated with using locks. Another cost, the length of time an objectlock is held, greatly affects the amount of concurrency that can be extracted from that object.If other threads of control need access to a locked object, their execution is delayed. Additionally, methods hold their locks for the duration of their execution. It is likely that the locks areheld longer than actually needed, and to maximize the concurrent accesses to an object, theamount of time locks are held should be minimized. The automatic locking of Raven minimizesthe programming effort required to keep the instance data consistent at the expense of objectconcurrency.One way to increase concurrency is to use an uncontrolled object and to have the programmer of the class manage the concurrency manually, using, for example, semaphores. Thisprovides the opportunity for the class programmer to minimize the exclusive-use regions in theclass. Although such a solution is always available to the programmer, it does not prevent theuser of an instance of the class from creating a controlled object. If this happens, the runningof the application is slowed because concurrency control is applied twice. Furthermore, thereis the increased potential that the two concurrency control methods will conflict and cause aCHAPTER 4: Support for Parallel and Distributed Computation in Raven 96deadlock. To avoid this problem, the facilities for tailoring concurrency control must communicate with the lock managing subsystem. Two approaches are used to do this. One is to allowthe class programmer to specify the exact lock type a method is to use, and the other is to provide locking statements to control the locking from within a method.In some circumstances, the classification of a method as a read method or a write methodmay not be the best choice, or the method itself may not need locking. To override the compiler’s classification of a method, the class programmer may tag a method as no lock, meaningthat the method does not need a lock, or as write lock to indicate that the method shouldacquire a write lock. With these two method modifiers, the default system locking can be overridden.All the locking control discussed to this point has been applied for the duration of amethod. If finer locking control is desired, then the ability to manage locks within a method isrequired. In essence what is desired is the ability to manage the locking within a region.Accordingly, the statements lock and unlock combined with their read and write modifiersare available to allow the programmer to specify a locking region within a method. This affordsthe programmer the opportunity to exercise a fine degree of control over the locking in an effortto maximize parallelism within an object.In all cases, the lock and unlock statements work only on a controlled object. If theobject is uncontrolled these statements have no effect. The lock statements work in the following manner:• read lock: If the object’s lock is not held, then a read lock is acquired. If thelock is currently held as a write lock, then it is demoted to a read lock. A readlock will remain as a read lock.• write lock: If the object’s lock is not currently held, then a write lock isacquired. In the case when the currently held lock is a read lock, this statementresults in the promotion of the read lock to a write lock.• unlock: If a lock has been acquired by the programmer, then it is released. Animplicit unlock of user acquired locks is performed when a method completes.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 97These statements, combined with the ability to specify the locks on a method-by-method basis,provide the programmer with a set of tools that can be used to develop a large range of lockingstrategies. The programmer is free to manually manage the locking in one method while stillletting the system manage the locking in the other methods. However, the implicit locking provided by the system is probably sufficient for the majority of applications. This locking flexibility is possible only because the lock management system is treated as a complete unit andintegrated into the Raven system.4.5.3 Locking problemsLocking of instance data needs to be more than demand-based, since a purely demand-basedsystem does not take into account the various sequences of events that could lead to a lockbeing requested. For example, it is common for a method to invoke another method on itself.With a strictly demand-based locking scheme this will create problems, since the resulting lockrequest will be viewed as another request for a lock. Suppose that the initial method was a writemethod and the self invocation was a read method: by the locking rules outlined previously, theself invocation would block, waiting for the write method to complete before gaining access tothe object, and, since the write method isn’t going to release the lock until it is completed, adeadlock occurs. The problem is also exacerbated by the fact that the second invocation on theobject may actually come through an intermediate object, potentially resident on anothermachine. These types of invocations are referred to as recursive invocations, and Figure 4.10shows such a legitimate Raven execution. In this example execution starts in object A, whichthen invokes on object B. Object B, which may be on a different machine, then invokes backon A. Assuming that the initial invocation on A was a write method, something must be donewhen B invokes back on A if a deadlock is to be prevented. One possibility is to require theprogrammer to explicitly relinquish the object’s lock before invoking on B and then reacquireit when the method on B completes. This has two disadvantages: one is that the programmermust now deal explicitly with the lock management issues, and secondly the programmer mustbe aware of possible side effects or unexpected method invocations from other objects whichCHAPTER 4: Support for Parallel and Distributed Computation in Raven 98Potential MachineBoundaryObject AFIGURE 4.10 Object invoking back on selfcould modify the instance data while executing in B. The programming burden could berelieved if the compiler automatically released and reacquired the locks, but the problem ofdealing with unexpected changes in the instance data would remain.Since multiple readers and only a single writer are supported in the Raven locking model,a problem with recursive object invocations exists only when a recursive invocations involvesa write lock currently held or being requested. To permit this invocation sequence, Ravenassigns a globally unique session ID to each thread of execution. The ID is used to track chainsof invocations locally and across machine boundaries. An invocation chain is the Raven equivalent of the call stack in programming languages like C, with the addition that an invocationchain may also cross machine boundaries. The purpose of the session ID is to make it possibleto determine when a method invocation attempts to acquire a lock on an object locked earlierin the invocation chain, even if that invocation chain crosses machine boundaries. In particular,a write lock will be granted if no other locks for the object are granted or if all the locks grantedExecution ThreadObject BCHAPTER 4: Support for Parallel and Distributed Computation in Raven 99belong to the same session as the requestor. Similarly a read lock will be granted with outstanding write locks on the object if all the locks belong to the same session as the read request.Pictorially, Figure 4.11 shows a sequence of method invocations and the way that theSession 2locks are managed. In this figure, each column represents an object and the locks that the objecthas granted. Locks closest to the object are the most recently granted ones. When a methodinvocation is made, the method presents its session ID to the object. A check is then made inthe list of granted locks to determine whether or not this lock can be granted. If the lock isgranted, it is added to this object’s list of granted locks; otherwise it is added to list of requestsblocked and waiting for a lock. In the example, Object 2 has blocked Session 2, since Session 1currently holds a write lock on Object 2. Session 2, however, was able to obtain a read lock onObject 3, since that would not conflict with the read lock previously granted to Session 1.Session 1I-c,C0a)CC)FIGURE 4.11 Lock SharingCHAPTER 4: Support for Parallel and Distributed Computation in Raven 100In the scenario depicted in Figure 4.11, Session 1 starts by requesting a read lock onObject 1, which is granted. The session then requests read locks on Object 2 followed byObject 3, and both are granted. From Object 3, Session 1 requests a write lock on Object 2, andit is granted, since Session 1 is the oniy holder of locks on Object 2. If other sessions had readlocks on Object 2, then Session 1 would have been blocked in its attempt to get the write lockuntil all other sessions had released their locks on Object 2. From Object 2, Session 1 makes arequest for a read lock on Object 1 and it is granted. Session 1 makes a request for a write lockon Object 1, and it is granted, since no other sessions hold locks for the object.It should also be noted that in this example the invocation requests, and resulting lockrequests, span machine boundaries. Using the session ID presented to an object at the time ofthe method request, it becomes a simple operation to check the object’s list of granted locks todetermine whether or not an already granted lock belongs to the session making the newrequest.The approach to locking described above provides a mechanism to deal with recursivemethod invocations and with the tracking of chains of invocations across machine boundaries.The locking scheme, however, needs further refinement to deal with the early and delayedresult primitives of Raven. Companions do not require any new support since they are new execution threads with their own unique session ID. This could be illustrated in Figure 4.11, whereit is conceivable that Session 2 is a new thread of execution resulting from the execution of acompanion in Session 1. Being a new thread, the session has no association with any previouslock acquisitions; therefore it presents no new challenges to the locking scheme.4.5.4 Locking and Early ResultControlled objects using early or delayed result present a different challenge to locking. Whena recursive invocation from a thread performs an early result, a problem occurs if some methodin the session chain has a lock on an object while the method doing the early result has a lockon the same object. When the early result is performed, a new thread of execution is started,CHAPTER 4: Support for Parallel and Distributed Computation in Raven 101and the new thread and the one it is derived from could now be simultaneously accessing thesame object, and they both could be expecting to have a certain level of access to an object (i.e.read or write access). These access expectations are potentially in conflict. For example, if theinvoking method is a write method, it will expect to be able to update the object with impunitywhen it gets control back, since it has a write lock. Similarly, the thread resulting from the earlyresult could expect the object instance data to be unchanging if it is a read method, or changeable if it is a write method. This would be in conflict with the access the invoker expects to havewhen it gets control back. Given these possible conflicts, an early result must reconcile thelocking desires of the locks held before the method invocation with those of the method doingthe early result.Fortunately, the semantics of an early result are such that the invoker of a method cannottell whether the method runs, returns a result, and continues executing, or runs and returns theresult upon method completion. (Implicit in this is the assumption that the method doing theearly result eventually returns and does not enter an infinite loop.) The Raven system exploitsthis property to reconcile the locking issues when an early result is done in a controlled object.As described previously, an early result could result in two threads of execution bothexpecting to modify the same instance data. To eliminate this possible conflict, a result isreturned and a new thread of execution created only if no other locks are held on the object bythe invocation chain, or if all the locks on the object are read locks. When a mixture of read andwrite locks are present on the object, no new thread of execution is created and the result of theearly result is returned when the method finishes executing. Although this eliminates the parallelism between invoker and invokee, it maintains the proper method invocation semantics.Raven has adopted the approach that the details associated with the management of parallelism should be as transparent as possible. As such, when an invocation is made, the invokershould not need to know anything about how the method is implemented. In using the abovetechnique for early result, it always appears to the invoker that the invoked method has comCHAPTER 4: Support for Parallel and Distributed Computation in Raven 102pleted, regardless of whether or not an early result was done and of the type of locks theinvoked method acquired.4.5.5 Locking and Delayed ResultDelayed results also present a challenge to Raven’s locking scheme. When the delayed resultis done the method terminates, and the instance data lock acquired to enter the method isreleased. The invoking method, however, remains blocked, holding locks and waiting foranother method to complete the delayed result. It is possible that the session could still holdlocks on the object the delayed result was done from. Delayed results are typically used withina single object. This means that the method that will ultimately complete the delayed result willneed to lock the object. If some other portion of the invocation chain holds a lock on the object,it is likely the lock will conflict with the lock required for completing the delayed result, resulting in deadlock. To avoid this problem, when a delayed result is done, all the locks in the invocation chain associated with the object are temporarily released. This permits invocations fromother sessions to acquire locks on the object, with the expectation that one of them will complete the delayed result. When the delayed result is completed, the suspended thread can berestarted, provided all the locks previously held on the object can be reacquired. If the lockscannot be granted the delayed thread will remain suspended until they are.4.5.6 Deadlock HandlingDeadlock detection, avoidance and prevention are harder to deal with in distributed systems than in single processor systems [1001, since resource usage information is scattered overmany processors. This scattering of data makes an already hard problem even harder to dealwith. The combined execution cost and network overhead associated with running deadlockdetection or avoidance algorithms is high. Additionally, if implemented, these schemes wouldadd to the complexity of the Raven system and impose extra overhead on all applications, eventhose which would not need the facilities to deal with deadlocks. Deadlock detection and avoid-CHAPTER 4: Support for Parallel and Distributed Computation in Raven 103ance are active research areas beyond the scope of this thesis work they present many interesting problems in their own right [1011 [211.It is, however, realized that dealing with deadlocks in a distributed systems is a seriousand difficult problem. The implementation of a general solution for dealing with deadlockswould seriously impact the performance of the Raven system. To deal with deadlocks, Ravenoffers a compromise by providing some tools to help avoid writing programs that have deadlocks in them and by providing a facility to specify how long an application is willing to waitto acquire a lock. Using this approach is not necessarily the best way to handle deadlocks, butit does acknowledge the problem and establish a framework so that Raven can be modified laterto use more sophisticated deadlock detection or avoidance algorithms.The tools Raven provides for dealing with deadlocks rely upon the facilities that make itpossible to override the default automatic locking provided by Raven. For example, one waythat a deadlock can occur is when method’s acquired is not strict enough. The ability to converta read lock to a write lock could fend off deadlock. Consider the scenario where a chain of multiple read type invocations on the same object end up doing a write invocation on this object.Since the initial invocations use only read locks, multiple simultaneous threads of executionthrough an object are possible. Ultimately, if these threads proceed through the object together,a deadlock will occur when the write invoke is done. By explicitly allowing a method to acquirea write lock, even if the current call chain doesn’t immediately need it, it is possible to fend offsome inter-thread deadlock scenarios. The other facilities to upgrade locks, release locks andto acquire a lock independent of the automatic locking provided by Raven provides the programmer with the ability to program around deadlock scenarios, although the burden of identifying such scenarios remains the programmer’s responsibility.Since Raven supports neither deadlock avoidance nor deadlock detection, it is a real possibility that an application using controlled objects could deadlock. From an intuitive perspective, a programmer would probably say that a system is deadlocked if the application has towait more than some specified time to obtain a resource (lock). How long to wait is application-CHAPTER 4: Support for Parallel and Distributed Computation in Raven 104specific. For some applications it may be a few seconds, while for others it may be acceptableto wait for several minutes, or even hours. Regardless of the time, the basic premise is that ifthe lock cannot be acquired in some specified time frame then it is probable that a deadlock hasoccurred.Raven, therefore, makes it possible for an application to specify how long it is willing towait for a lock. When the specified time has elapsed, without the acquisition of the lock, themethod lockAcquisitionTimeOut () is invoked by the system on the object attemptingto acquire the lock. The invoker then has the option of determining how the processing shouldproceed. The application’s action could include an attempt to invoke the method again, backingoff and releasing locks, or even aborting execution. Although this approach to dealing withdeadlocks is not ideal, it does provide some primitive support to allow the application to do itsown deadlock detection.One particular problem with this approach is the mechanism used to perform the time-outnotification. As it is, the object doing the lock acquisition is notified of the lock acquisition failure, and it is not clear whether or not this is the correct object to notify. Perhaps the notificationshould be delivered to the executing thread instead of the object. The proper way to deal withthis is to develop an exception-handling model and mechanism for Raven. Such work is beyondthe scope of this thesis.None of the facilities provided by Raven preclude programmers from implementing theirown deadlock avoidance or detection schemes. However, such a system would be operating atthe application level and as a result would not have access to the locking information maintained by the system. To avoid problems with the automatic locking provided the system, theuser would probably have to disable most of the automatic locking provided by Raven. Thelocking and deadlock handling facilities of Raven have not been optimized for performance.Therefore, it is expected that by examining the literature on these topics [21][87] substantialimprovements in the performance of the locking subsystem could be made.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 1054.6 System Resource managementOne of the jobs of an operating system is to control access to and manage system resources. Ina parallel and distributed system, two of the resources that require special attention are memoryand processor usage.4.6.1 MemoryWhen an object is created in Raven there is an implicit allocation of memory made to hold theobject. Once the object is no longer required, this memory should be freed for reuse. Oneapproach to this problem is to equate the creation of an object to that of malloc () when programming in UNIX, and then using something akin to free () to release the memory. Theproblem, however, is how to determine when an object is no longer in use so that free () canbe called. In a system like Raven, that provides support for parallel and distributed computation, the problems are compounded further as there are now multiple threads of control thatcould be accessing an object, and it becomes difficult for the programmer to reason about whenan object can be freed. For example, when a method is invoked through a companion, wheredoes the responsibility for freeing the object lie? The invoker cannot free it because the invokeecould be still using it or could have passed it off to another object that is now using it. Likewise,the invokee cannot free the object because it does not know what the invoker did with theobject. To properly manage the freeing of an object, all the potential users of the object mustcoordinate their efforts, and this places an unneeded burden on the programmer. Two wellknown solutions to this problem are to use reference counting or to use garbage collection. Garbage collection was selected because it required the least amount of new system support. A discussion of garbage collection in Raven can be found in Section 5.2.4.6.2 Processor ManagementSequential programming languages and environments force the programmer to contortinherently parallel problem solutions into a sequential environment. However, making paralCHAPTER 4: Support for Parallel and Distributed Computation in Raven 106lelism easy to use also introduces problems not present in a sequential environment. One of theproblems concerns processor usage. In some circumstances the natural way to extract parallelism would result in unbridled parallelism far beyond the support capabilities of the hardware.To address this problem a throttling mechanism external to the application is introduced to theRaven system.During the early phases of Raven usage it soon became apparent that certain structureswithin the sequential environment could be easily parallelized, an example of which is shownin Figure 4.12. In this example, a method invocation inside a for loop is used to update andfor (line_no = 1; line_no < MAX_LINE; line_no++)workstation 1 .computeAndDisplayLine(line_no);FIGURE 4.12 Sequential code to update a display.compute a line on a workstation display. Suppose now that the function to compute and displaya line takes a long time, that there is no need to wait for the updating of the display to complete,and that the computations to update and display a line are independent of one another. If theseconditions are met then, as shown in Figure 4.13, the conversion of the code to a parallel implefor (line_no = 1; line_no < MAX_LINE; line_no++)!{ workstationl .computeAndDisplayLine(line_no) }LstartO;FIGURE 4.13 Conversion of sequential code to parallel version.mentation is a relatively straightforward and easy exercise. The only problem with this scenariois that it does not take into account the number of processors that can work on a problem. InCHAPTER 4: Support for Parallel and Distributed Computation in Raven 107many respects that is the way it should be. The application programmer should be free to program in a manner that is, as much as possible, independent of the amount of parallelism thatcan be supported by the processing environment. Furthermore, in this example, which is afairly common way of creating parallelism, making the programmer directly responsible forconstraining the parallelism introduces its own problems. In particular the programmer wouldhave to explicitly monitor the parallelism with respect to a maximum permissible amount. Having to monitor and control the parallelism with respect to this piece of code would greatlyincrease its complexity. Additionally, some mechanism would have to be introduced to conveyto the application what the constraints on parallelism are to be. This would further increase thecomplexity of the application code.The approach adopted in Raven is to consider the controlling of parallelism to be a systemconfiguration issue. The Raven system itself, therefore, needs to have some mechanism todetermine what the acceptable level of parallelism should be and to be able to restrict the parallelism to the required level. In essence, Raven needs to be externally tunable, so that varioussystem resources and limits can be set at execution time when the exact environment is known,and not when the program is being written and little is known about the execution environment.Additionally, there could be multiple disparate environments that a single application is goingto run in, and a different version of the application should not be required for each one. Forexample, initial testing and debugging of an application might be performed in one environment before final runs are made in another.The solution that has been implemented in Raven consists of classifying the variousforms of parallelism possible in a Raven application and then placing limits on them. Toaccomplish this, four categories of processes are recognized by the underlying Raven supportsystem. These categories are:• System processes (threads) that provide services to the application.• User level processes (threads) that form the application.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 108• Companion threads spawned by a user process.• Companion threads spawned by a system process.The categorization of processes into execution classes is transparent to the programmer.A combination of compiler support and runtime system support automatically performs thistask. Although it is conceptually possible to place restrictions on any one of these classes ofprocesses, the initial experiences with Raven indicated that only companion threads spawnedby user processes are likely to cause problems. The number of system processes is relativelysmall and only under rare circumstances do they spawn threads. Likewise, there is usually onlyone user thread per Raven environment and that is the initial starting thread. If there is morethan one user thread, the threads tend to be operating in a client/server type relationship wheretheir number is relatively constant and is not a function of the inputs to the problem beingsolved. With these observations and experiences, it was decided that, at least initially, limits onthe number of execution threads would be applied only to companion threads spawned from auser thread.In the current implementation, the Raven system reads a value from its environment specifying the maximum number of user companion threads that can exist at once. Just what thevalue should be is a function of both the application and the execution environment. Amongother things, the execution environment includes the number of physical processors and theirorganization. As an example, in a single processor environment an application might restrictthe number of user companion threads to 1, while for the same problem on a 20-processorshared-memory machine it might be 20. Although this example has a one-to-one correspondence between the number of processors and the maximum number of user companion threads,it is not a requirement.Once the Raven system has determined the maximum number of user companion threads,it uses that value to control the creation of user companion threads. As long as the current number of companion threads stays below or equal to the maximum, everything proceeds normally.If, however, the creation of the thread results in the maximum number of companion threadsCHAPTER 4: Support for Parallel and Distributed Computation in Raven 109being exceeded, then the creation operation is suspended. As user companion threads terminate, the suspended creation operations are allowed to continue. As long as all the running companions do not depend, either directly or indirectly, upon any suspend or unstarted companionsfor their completion, progress in the computation will be made. Although this scheme is simple,to date it has been a successful approach for applications that derive their parallelism from constructs similar to the approach used in Figure 4.13.With the introduction of this sort of parallelism control into the system, there is a potentialfor deadlock that is not present in a sequential system. If one assumes that the application iscorrectly written, then if there are no restrictions on the number of companion threads the application will not deadlock. A restriction on the amount of parallelism, however, could result inthe application deadlocking.A delayed result is the only way an application thread can suspend its execution indefinitely. For a deadlock to occur because of the restriction on the number of companion threadspermitted to start, all the companion threads must be doing a delayed result or suspended tryingto create a companion thread. As long as at least one companion thread is executing there is thepotential for the delayed results to be completed, or for the executing thread to terminate. If thethread terminates then a new companion thread can be started, and it might be the one to complete the delayed results. Since the restriction on companion threads can only cause a deadlockunder these limited circumstances, Raven allows a new companion thread to be started if allthe companion threads are suspended doing a delayed result or suspended creating a new companion thread. This companion thread will either allow progress to be made by completingsome delayed results, or it will end up suspended too. It is possible that all subsequent companion threads will end up suspended until all Raven’s system resources are consumed. In this casethe application prematurely terminates. Prior to this, Raven would have printed warnings indicating that the specified degree of parallelism was being exceeded. It should be noted that thelimit is effectively removed when companion threads are created in the situation when all othercompanion threads are suspended and the limit on the number of companion threads has beenCHAPTER 4: Support for Parallel and Distributed Computation in Raven 110reached. Therefore, if the system deadlocks, it would also have deadlocked if no restrictionshad been placed on the number of companion threads that could be started.4.7 Object TransparencyFrom a programmer’s perspective, the difficulties associated with progranmiing in a sharedmemory environment are considerably less than those of programming in a distributed memoryenvironment. In a distributed memory environment the programmer must manage many of thetasks done by the memory management system of a shared memory environment. With sharedmemory the programmer does not have to be concerned about locating data or whether or notparticular functions are available at a remote site. In a distributed memory environment, a programmer needs to know how to locate remote data and how to access the functions to manipulate the data. In a distributed environment it is not the case that a simple subroutine call willaccess the desired information. The problems, however, do not end with locating the data or thefunctions. In a distributed environment the application has to address the problems related tothe differing data representations on different machines, communications failures, the passingof parameters, and the returning of results. Compared to calling a subroutine or accessing datathrough a pointer, these types of activities are much more complex. To simplify programmingin a distributed environment, as many of these impediments as possible should be eliminated,with the goal being transparency.Transparency is defined by Coulouris [34] as:The concealment of separation from the user and the application programmer, so that the system is perceived as a whole rather than as a collection of independent components.When trying to decide if system is transparent, there are multiple forms of transparencythat can be considered. Coulouris identifies eight categories of transparency:1. Access transparency. The procedure for accessing remote and local objects is thesame.CHAPTER 4: Support for Parallel and Distributed Computation in Raven ill2. Location transparency. There is no need to know the location of an object toaccess the object.3. Concurrency transparency. Multiple users can operate on shared data withoutinterfering with each other.4. Failure transparency. The failure of hardware or software is hidden from theuser.5. Performance transparency. The system can be reconfigured to adjust to varyingloads and if this is done can the user detect it?6. Replication transparency. Multiple copies of objects are maintained to improveperformance; this should be transparent to the users.7. Migration transparency: Objects can be moved and relocated within the system,without affecting the execution of applications.8. Scaling transparency: The system and applications can change in size withoutneed to change the system structure and without affecting user applications.Of these categories of transparency, the first five are the most applicable to Raven, sincethe remaining three issues are not addressed in the current Raven system. In general terms, ifan environment supports transparency, then it should not be possible to determine whether aparticular function is being performed remotely or locally. Transparency should extend to boththe user and the provider of the service. Some of the issues [101] to consider are:1. Does the programmer have to know which library procedures are local andwhich are remote? (This can be either the application programmer or the libraryprogrammer.)2. Can procedures be written without regard to knowledge about whether they willbe executed in a local application or by a remote application?3. Are there constructs that can be used for a local call but are not allowed for aremote interaction? (As an example of this, can pointers or arrays be passedaround freely?)4. Do remote interactions require special constructs that are not used or optional ina local case?These tests are basically ones of appearance. If a distributed application looks differentfrom the local application then, by implication, there is an acknowledgment that the local andCHAPTER 4: Support for Parallel and Distributed Computation in Raven 112remote cases are not transparent. However, other tests or events at runtime might also removethe cloak of transparency in a distributed system. At runtime, a distributed system may fail toprovide performance and failure transparency, thereby revealing the fact that the system is distributed.Performance transparency can be tested for by having an application probe and monitorits environment in an attempt to distinguish a remote interaction from a local one of the samekind. One test that can be used is the elapsed time for the request. Typically there will be somesort of measurable difference between the elapsed times. Given the additional processing costsof preparing to execute a remote call, the delays and latencies associated with operating over anetwork, and the extra dispatching overhead at the server site, it is expected that on similarlyconfigured machines that these extra overheads would manifest themselves through longer execution times and that the execution times would have greater variances. Certainly the goal ofperformance transparency is a difficult, if not impossible, one to achieve. As a result the issueof performance transparency is only addressed in a superficial fashion, based on whether theremote operations can be perceived by the end user. If remote requests form only a small partof an application the fact, that they are remote may not significantly affect the overall performance of the application. Raven takes no special measures to supply performance transparency.Failure transparency is present in a system when system software or hardware components can fail, but the application can successfully complete its execution. Classic examplesthat illustrate the problem are network partitioning and the failure of the remote site. In Raven,if a remote site becomes inaccessible, either through a machine or network failure, the systemdetects this and initiates error handling to print a message. The application is returned a valueof nil to indicate a problem with the request; however, it does not know whether or not therequest was executed. This behavior is something not expected in the local case and allows aremote access to be distinguished from a local one. The proper way to deal with failures of thistype is to use an exception handling mechanism designed for the system. Raven does not havesuch a mechanism, and as a result the handling of errors in Raven tends to be ad-hoc.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 113Before examining how Raven implements transparency at the programming level, consider how the well known notion of remote procedure call (RPC) operates. With RPC theapproach is to take the well known and understood programming paradigm of the procedurecall and extend it to the distributed environment. RPC’s strength lies in its ability to provide astraightforward way for an application programmer to make use of well defined remote services. Based on the criteria given previously, RPC is not overly transparent. The service provider, or library writer, is all too familiar with the fact that the service is going to be providedin a remote environment, as the various data structures to be used for parameter and result passing must be specified and external data representations for them provided. Stub routines for thevarious functions have to be provided for use by the client, and the server must be structuredto handle the service requests directed at it. Care must also be taken to ensure that pointers arenot used as parameters, or, if they are, that special processing is performed.Raven uses global object identifiers to provide invocation transparency. In Raven eachobject has two identifiers associated with it. One is a global object identifier (GID) and theother is the object identifier (OlD). The GID is a system-maintained and assigned identifier thatuniquely identifies each object in the Raven object space. This identifier is never seen or usedby the application programmer. Instead, the Raven programmer sees and uses only OlDs,which uniquely identify objects within a single Raven environment. An object ID is, therefore,only meaningful within a given Raven environment. It is through the OlD and the mechanismsthat it uses to map ODs to GIDs that Raven achieves its location transparency. Figure 4.14 provides a pictorial representation of how OlDs and GIDs interact to provide location transparency. Within a Raven system all objects are referenced through an OlD, and these object IDsare either part of the instance data of an object or the working variables within a method. AnOlD references either a real local Raven object or a proxy object [911. A Raven proxy, whichis similar to the proxies of SOS [921, is the local agent for an object resident in some otherRaven environment. A proxy object knows the GID of the object it is representing. To the programmer, a proxy object appears identical to a local object. Proxy objects distinguish themselves from local objects by the way they respond to a method invocation.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 114FIGURE 4.14 Object interactions in a distributed Raven environment.When a method is sent to a local object, that object looks up the method and runs it. Aninvocation request on a proxy object results in the proxy object doing some checking to see ifthe method can be run locally, and if it can the method is run; otherwise the request is packagedup and sent to the machine where the real object resides. As part of the request packagingactions, all OlDs are converted into their GIDs. Only GIDs are passed between Raven environments. The GID for a remote object can be obtained directly from its proxy object, and GIDsfor local objects are assigned on an as-needed basis. When the request arrives at the remotemachine, the remote site accepts the request and maps each GID to a local OlD. For objectsresident on the target machine, the OlD references the real object, and for remote objects it references the proxy. When a GID for a remote object is received, a proxy object may have to becreated if a proxy for that object currently does not exist in the environment. After the converRaven environment A Raven environment BCHAPTER 4: Support for Parallel and Distributed Computation in Raven 115sion of all the GIDs to OlDs, the remote method can be invoked as if all the objects were local,thereby achieving location transparency for objects.A concrete example of the way in which remote objects are handled in Raven is providedby Figure 4.14. In this figure there are two Raven environments, A and B. Environment A hasa proxy to object A which resides in environment B. Environment B has a proxy to object Bwhich resides in environment A. Initially some method has a reference to object A, but becauseobject A is remote the reference identifies the local proxy for object A. In environment A, methods can be invoked on object A as if A were local. An attempt will be made by A’s proxy toexecute the method locally, and if it cannot be executed locally the request is forwarded to environment B, where A resides. As part of the forwarding, all OlDs are converted to their GIDs.Once in environment B, all GIDs are mapped to their local OlDs for that machine. In environment B that means, with respect to the servicing of this remote request, that object A can nowbe accessed using its OlD. In this example object A has as part of its instance data a referenceto object B, but because object B is remote to object A, object B is represented by a proxyobject. Any references to B are treated in the same way as are references to A in environment A.Providing a mechanism like GIDs and OlDs only goes part way to addressing the problem of location transparency. The other fundamental data structures available in the languagealso play a contributing factor. For example, if pointers were allowed then the Raven systemwould be unable to provide location transparency. Raven in fact does not have pointers: it conceptually has only objects; therefore a proxy can always be constructed. Some objects, likeintegers and floating point numbers, are basic machine-level data types. Since these objects areimmutable (i.e., the integer object 5 will always be the object 5 and it cannot be changed to theobject 17), they can be freely replicated and distributed, thereby providing location transparency for these basic types of objects. The object model itself also contributes to location transparency by providing a strict way to change instance data. Instance data can be accessed onlythrough the invocation of a method; therefore, there is no way for an application to take an OlDand attempt to modify or read the object’s instance data directly. It is through the combinationCHAPTER 4: Support for Parallel and Distributed Computation in Raven 116of OlDs, GIDs, a pure object model, and instance data access only through methods, that Ravenattempts to provide location transparency.Raven’s success at providing location transparency can be evaluated by using the criteriaestablished earlier in this section. The first criterion listed for location transparency addressedthe issue of whether or not an application programmer needed to know which library procedures were local and which were remote. The equivalent question in Raven revolves aroundwhether or not the application (client) programmer has to be aware of whether or not an invocation on a server object is local or remote. In Raven the client routines to execute for either aremote object or a local object are exactly they same and no distinction is made.The second criterion deals with whether or not the library (server) programmer must program differently if the routines are to be used in a local environment versus a remote one. Inmany respects this issue does not arise in Raven because all method invocations are local. If aninvocation is targeted at a remote object, the information about the invocation is bundled upand sent to the machine where the object resides, effectively turning the remote request into alocal one. The only issue for the server writer is whether or not any special actions need to betaken for parameters passed in that might reference remote objects. Again, the use of proxies,combined with the strict method interface to access instance data, means that the remote objectscan be operated on as if they were local.The third criterion concerns constructs that can be used for a local call but not for a remotecall, with the prime example being pointers. In fact, there are no constructs in Raven that canbe used for a local call but not for a remote call. However, the notion of constructs can beextended in the object-oriented environment to ask if the objects themselves place restrictionson their usage based on being local or remote. Care has been taken in the Raven environmentto ensure that all special purpose or system type objects are location transparent. A good example of this is the location transparency that has been achieved with instances of the class Thread.A Thread object is the runnable entity associated with the thread of execution. It is the Ravenanalog to a process ID. Unlike process IDs, however, the OlD of a Thread is location transparCHAPTER 4: Support for Parallel and Distributed Computation in Raven 117ent. If a Thread’s OlD is passed to another Raven environment, an object in that environmentcan control the remote thread as if it were on the local machine.The final criterion concerns whether of not there are any constructs or features requiredin the remote case that are not needed or optional in the local case. It is on this point that Ravendoes not meet all the criteria for location transparency. In a distributed environment one of theproblems concerns making the initial contact with the other machines and determining whatsort of services these other sites are providing. For example, in Figure 4.14, environment Asomehow had to acquire a reference to object A. In a local environment there is no need toestablish the outside connection since everything is local. This problem, however, is not solelyconfined to the distributed processing environment but is more general in nature and results anytime information must be exchanged across protection boundaries. These boundaries can be theresult of such things as the separation of functionality between the system and the application(i.e. file systems), virtual memory protection schemes on a single processor, or protection viaphysical separation as provided in the distributed memory case. There are a number of techniques available for passing information across a protection boundary, but a typical approach,especially in the distributed memory case, is to create a name server at a well known location(address). The name server then becomes a repository for named items in the system. Objectsthat want to be known to the rest of the world register themselves with the server, and objectsthat want to locate other objects query the server to get back information on how to make thatcontact. The effect of this in Raven is that in the distributed case some setup to register, retrieveor construct the initial contact OlDs is required.The significance of requiring some mechanism to establish contacts between the protection domains is at least partly related to how often the application must perform special operations to cross these boundaries and how it is done. In a UNIX environment, information aboutthe other domains can be obtained through keyboard or file input, command line arguments,calls to the operating system, or environment variables. Based on the information obtainedthrough these various mechanisms, the application might make a system call to get access to aCHAPTER 4: Support for Parallel and Distributed Computation in Raven 118service or establish a network connection to a process on another machine. The approach iscomprised of three major steps. First some user-level information is used to name a service, thatname is converted into an access procedure for the desired service, and finally the desired service is accessed. Similar procedures are required in Raven where first some Raven environmentindependent name is obtained to identify a service. This name is then converted to an OlD, andfinally method invocations are made on the Oil) to actually access the service or informationmanaged by the object. Raven provides several functions that simplify these actions and allowa name server to be constructed. More details on the operation of the name server can be foundin Section 5.4.5.Based on the criteria established for location transparency, Raven has met three of thefour criteria. The only one it has not fully met concerns the use of special calls to allow differentRaven environments to discover information about the objects that each of them has. This discovery or registration-type requirement is not unique to Raven but exists in any system whereinformation has to be exchanged across a protection boundary. Being object-oriented, Ravenhas the opportunity to consolidate the various approaches to dealing with cross-boundary communication in a single cohesive fashion. However, Raven does not perform as well when testsfor transparency based on runtime information are considered. In fact, Raven makes no attemptto address issues associated with performance transparency, and failure transparency is alsopoorly supported. The current Raven environment expects to run on a local area network whichis highly available. It is expected that as Raven becomes more fully integrated into a systemenvironment (i.e. the equivalent of a file system is added and support for transactions is added)the distinction between the local and remote execution case will become less noticeable. Additionally, Raven needs an exception handling mechanism to deal with failures.4.8 The System ObjectThe operating configuration of the Raven system is tailorable by both the end application userand the programmer. The configuration is stored in the predefined object, system, which is aCHAPTER 4: Support for Parallel and Distributed Computation in Raven 119reserved word in the Raven language. There is only one system object per Raven environment and it is an instance of class System. The sys tern object is instantiated during the startup phase of Raven.During the running of the constructor for the sys tern object, environment variables andthe host system are checked for configuration parameters. Based on these values, sys tern’ sconstructor sets its instance variables to reflect the resulting configuration. If necessary, theconstructor will also make the appropriate calls to communicate this information to the runtimesystem. Example specifications that can be set in the constructor are the amount of system-permissible parallelism and the amount of memory the system can use.The system configuration parameters determined during the instantiation of the systemare application independent and reflect the operating environment the application is to operatein. The application program can also reconfigure parts of the system configuration. The configuration parameters directly setable by the application control the usage of services dependentupon the nature of the application. Currently, the primary purpose of this capability is to allowthe application program to enable special system services, for example the enabling of networkbased services and the time-out checking for potential deadlocks.4.9 Failures and ExceptionsWithin a system, failures and exceptions will occur. An application might attempt to divide anumber by zero, the network might fall during a remote invocation, or a requested servicemight not be available. Just how to inform the application of these problems and how to specifyexception and failure handling within the Raven system is still an open research issue. Part ofthe problem is that it is not clear how an error notification should be delivered. Furthermore, insome situations the invoking object should deal with the problem, whereas in other situationsthe thread of control, which is itself an object, might be the appropriate entity to inform.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 120To illustrate the nature of this problem, consider a method that encounters a divide by zeroexception. In this case the method itself may want to deal with the problem. It is easy to imaginethat within a method the same sort of error processing of a divide by zero fault would be performed regardless of the application. In other situations, for example a deadlock time-out, sucha failure, and the desired corrective action, is more closely associated with the thread of control.It is unreasonable to expect every method to be prepared to deal with a deadlock error whensuch an event is more likely a function of the total system configuration than of an individualmethod.Failures associated with the network, or involving inter-environment interactions, aredetected by the Raven runtime system. Upon detecting the error special error handling code isrun to log this exception. In the Raven Configuration Management System [32j, this information is used to construct a response to the failure. In addition to the special error handling, thenil value is returned to the application.The last example illustrates the general problem of how to inform the runtime system ofany failures or exceptions. Two approaches have been tried within Raven. One approach is tohave methods, particularly those that provide services, return nil as a failure indication. Thispresents a problem when nil is a legitimate return value. It also requires an application to beconstantly checking return values for error indications. A second approach is to specify thename of a method to be invoked when a failure is encountered. For example, the class Objectdefines the doesNotunderstand ( ) method. This method is invoked when an attempt toinvoke a method is made on an object, and the method cannot be found. Since all objects inheritfrom Object, they automatically have some limited error handling for this case. An object canchange its error handling by overriding the doesNotUnderstand () method. Thisapproach has problems because it is not clear which object the error notification should bedelivered to (i.e. the invoking object or the thread object). Additionally, the corrective actionsand how they are accomplished need special support. At the moment Raven uses a mixture ofthese two approaches, but there is no clear consensus on the best way to handle these problems,CHAPTER 4: Support for Parallel and Distributed Computation in Raven 121and they have been left for future work. Before developing a proper error and exception handling mechanism for Raven, the large body of literature in this area [39][86J[88] needs to beconsulted.4.10 Unique Raven Language FeaturesUnlike many systems, the support for parallel and distributed computation forms an integralpart of the Raven programming language and was not added once the language was completed.The result is a smooth integration of the support for parallel and distributed computation intothe language. Raven’s support for parallelism is unique in that it recognizes a single logical program location where parallelism and synchronization between threads can occur: All parallelism and thread synchronization in Raven occurs at the point of method invocation. Althoughthe parallelism occurs at the method boundary, the parallelism is functionally extracted oneither side of the boundary. If the method invoker wants parallelism, a companion is used; ifthe method itself wants parallelism, an early result is used; and, if thread synchronization isneeded, a delayed result is used. No other system combines these three related activities tomake them a logical part or extension of the sequential method invocation.Raven’s companions extract parallelism at a higher programming level than other systems by eliminating the notion of explicit process creation. For example, in Distributed Small-talk [20] and Actra [103] the programmer must explicitly create the process and then programthe parallel actions within the process. This approach to creating parallelism is significantly different from Raven’s, where any object, and any of the object’s methods, can be used to generateparallelism at any time. Although CST [37] also uses the method invocation point to extractparallelism, its use is restricted to objects which are aggregates, or groups, of other objects. Inmuch the same way that vector processing is performed, the invoked method is, at least conceptually, simultaneously run on all the objects in the aggregate. Unlike Raven, objects thatcannot be represented as aggregates cannot be the subject of parallel invocations.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 122When a companion is specified, it encapsulates a block of requests that are to be run inparallel. In that context the companion is very much like the parBegin/End style delimitersof parallel requests found in SR [9] and Argus [69]. However, it differs from these in that theblock of requests comprising the companion are run in parallel with each other and with thethread of control starting the companion. In these other systems the block of requests are runin parallel only with each other. The main thread of control, in Argus’s case, waits for all therequests to complete before proceeding. With SR, the main thread of control can wait, or it cancontinue processing. If it continues processing, the main thread cannot later synchronize withthe requests. Raven overcomes this synchronization problem by creating a Certificate for eachcompanion. The Certificate permits the application to, optionally, wait for the companion tocomplete.Raven’s Certificates are similar to the futures of Multilisp [50] and the promises andclaims extensions to Argus. However, all these constructs wait only for a single outstandingrequest to complete. Raven is the only system that extends the futures idea, through CertificateGroups, to allow an application to build collections of futures or promises and then to wait forany one of the requests within the CertificateGroup to complete. By being able to wait for anyone of the outstanding activities in the CertificateGroup to complete, Raven applications canprocess the asynchronous completion of outstanding requests. The other systems offer nomechanism to perform this type of processing on requests.Closely associated with companions and Certificates are InvokeStreams, which provide anordering service for a sequence of companions. The Raven InvokeStream differs from the streamimplementations of Gifford [48], and the corresponding extensions to Argus, by allowing onestream to provide ordering to multiple target objects. This approach eliminates the need to create and manage multiple streams at the application level, where one stream would suffice.Raven’s InvokeStream is also unique in that it can be shared and passed between objects inpotentially different threads of control. For example, one object can push a sequence ofrequests onto an InvokeStream and then pass it to another object in another thread of control.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 123That object can then push its own requests into the stream, and the sets of requests emanatingfrom the different objects will be appropriately sequenced.Synchronization between independent threads of control in Raven is accomplishedthrough the delayed result construct. The Raven approach differs from other approaches to synchronization by basing synchronization on the notion of returning results, instead of on directcontrol of the process. For example, in representative systems like SR [10], with guards, orChoices [27], with direct message passing, there are application visible processes, and synchronization is accomplished by suspending the process and then restarting it. When the suspendedprocess is restarted, it is the process’s responsibility to decide how to proceed with processing,and in particular the object must be prepared to deal with any changes in object state thatoccurred while the object was suspended. The delayed result of Raven uses a different perspective and places the programming emphasis on returning a result. Raven’s underlying assumption is that a method should run until it can return the result. If the method cannot run that long,it does a leave and the method suspends itself and waits for the returned result to be determined. The result to return is computed by another method, and the result statement used todirect the result to the suspend method for returning. Since the suspended method is onlyreturning the result, the problem with object state changing while the method is suspended iseliminated.Of the higher level object-oriented languages or systems, of which Distributed Smailtalkand POOL-T [11] are representative, POOL-T is the only system that supports the notion of anearly result. In POOL-T each method is specified as two distinct parts. One part computes thevalue to return, while the other part specifies the code to execute after returning a result.Raven’s early result offers much more flexibility then the POOL-T approach because the processing to be performed after the early result immediately follows the early result. This allowsthe method to perform different types of post-processing depending upon where in the code theresult is returned.CHAPTER 4: Support for Parallel and Distributed Computation in Raven 124Underlying all Raven’s support for specifying parallelism is the need to control concurrent access to an object’s instance data. Existing systems use two approaches to protect instancedata. One technique, as exemplified by Distributed Smailtalk and Choices, is to do nothing andmake concurrency control the explicit responsibility of the programmer. The other technique,as illustrated by POOL-T and Actra, is to allow only one method of an object to be executed ata time. Some of these approaches also limit which methods can be invoked from outside theexecution scope of the object. Locking in Raven differs significantly from all these systems bypermitting multiple method invocations to be active in an object and by supplying the lockingsupport automatically. Raven allows multiple read methods or a single write method to beactive in an object, thereby increasing the potential for parallelism on an object, while relievingthe programmer of the responsibility of explicitly managing the instance data locking. Theremoval of explicit lock management from the programmer’s responsibility simplifies programming and reduces the chances of errors due to improper use of concurrency control features. Raven’s locking scheme, unlike the other systems, also permits a locked object torecursively invoke back on itself without causing a deadlock. The execution pattern where anobject invokes, either directly or indirectly, back on itself is very common in object-orientedprogramming.One of the problems with object-oriented systems is finding ways to deliver system services, like concurrency control to individual objects. One approach, as illustrated byArjuan [93], is the class-hierarchical approach, where the object inherits from the classes thatprovide the services of interest. This has two problems. One is that the object still has to invokethe appropriate methods to internally manage the desired services, and the second is the classexplosion problem. Reflective systems [108] attempt to solve this problem by allowing entitiesin the system to redefine system level behaviors. There are, however, problems associated withimplementation, with determining the scope of any changes made to the meta-objects providing the system behavior, and with the introduction of side-effects with respect to other systemservices. Raven’s property mechanisms eliminates all these problems by providing a mechanism to attach operating system services to objects on an instance-by-instance basis. TheCHAPTER 4: Support for Parallel and Distributed Computation in Raven 125orthogonality of properties guarantees that multiple properties can be associated to an objectwithout them interfering with each other, or with other objects, and it also eliminates the classexplosion problem.4.11 SummaryParallelism is supported through companions and Certificates, along with the constructs forearly and delayed result. Together these features allow an application programmer to introduceparallelism from within a method (class-based parallelism), and at the point where the methodis invoked (user-based parallelism).The introduction of parallelism into the Raven system also introduces extra dimensionsinto the programming problem with respect to concurrency control. Raven introduces properties to provide concurrency control. Properties allow several orthogonal system supplied services to be associated with an object. As part of concurrency control, automatic locking isperformed on instance data. To support locking in a distributed environment, session IDs areintroduced.In any system there are issues associated with resource management. Raven uses garbagecollection to address issues of memory management. Processor control is provided by limitingthe number of companion threads that can be started. InvokeStreams can also be used to extractprotocol parallelism from sequential requests.The final issue addressed by this chapter is location transparency. A scheme using globalobject IDs, local object IDs and proxy objects is described that allows for Raven to have a substantial degree of location transparency.CHAPTER 5 ImplementationDetails and IssuesThis chapter deals with the underlying implementation issues associated with providing thefacilities for parallel and distributed computation described in Chapter 4. The chapter startswith a discussion of how objects, and in particular class objects, are implemented in Raven,along with their role in supporting parallel and remote invocation. During the implementation,problems associated with method dispatch and memory management were encountered. Theseproblems severely affected the performance of Raven. The problems are discussed in detail andthe resulting solutions described.5.1 Object ImplementationAt the application level, everything in Raven is an object and all objects are semantically identical. An object is accessed through an object identifier (OlD), which is also known as a capability. At the implementation level, however, some objects are handled in a special way. Theseare objects that have a machine level representation can be manipulated directly by the hard-126CHAPTER 5: Implementation Details and Issues 127ware. These objects are called primitive objects. Examples of primitive objects are integers andfloating point numbers.The second type of objects identified by an OlD are complex objects. Complex objectsare object types that are programmer or system defined, and their component parts consist ofany combination of primitive objects and complex object references. A complex object has allthe mechanism and organization needed to support the full range of actions expected of anobject. All primitive objects have a complex object representation. It is the compiler’s responsibility to determine when a primitive object needs to be converted to a complex object or whena complex representation of a primitive object needs to be converted to its primitive form.The Raven system is organized around the principle that everything is an object. Withinan application objects can serve two functions. The one function, filled by class objects, is toprovide the basic template or definition for a class so that new instances of the class can be created. The second function, filled by instances of classes, is to provide the actual data structuresmanipulated to solve a problem.Since new objects can come into existence only through invocations on a class, specialconsideration must be given to initializing the basic Raven class objects. During initialization,the runtime system creates the basic classes Object, MetaObject, MetaClass, and MetaMetaClass(see Figure 3.5). Once these classes are created the UndefinedObject class and its sole instance,nil, are created. The remainder of the classes are created and initialized in a breadth first manner based on the inheritance hierarchy constructed by the compiler. Only those classes neededby a particular application are loaded. This class loading is accomplished by having each classrecord in well- known data structures the name of its parent class and the names of all theclasses that it uses. Each class has a well-known system initialization routine that gets calledduring system start-up. When the initialization phase starts, and the classes Object, MetaObject,MetaClass, MetaMetaClass and UndefinedObject have been initialized, the class Main, whichdescends directly from Object, is initialized. Class initialization consists of building all the datastructures needed to support the class in a running system (see Section 5.1.3). As each class isCHAPTER 5: Implementation Details and Issues 128initialized, it must ensure that all the classes it uses or inherits from are also scheduled for initialization.5.1.1 Object OrganizationThere are three broad categories of objects in the Raven system. These are instance objects,class objects, and meta-class objects. All objects have the same structural organization. Anobject’s uniqueness is established by the contents of its instance data and its methods.Figure 5.1 provides a pictorial representation of the component parts of an object as they appearin main memory.Each object ID in the system maps to one object in main memory. In the current implementation of Raven, an object ID is simply the 32 bit address of an object block. An object’sFIGURE 5.1 Object organization in memory.CHAPTER 5: Implementation Details and Issues 129representation in main memory is called an object block. An object block has the followingfields:1. Invoke function pointer. This field identifies the function to perform the methodlookup when an invocation is performed on this object. This allows differentlookup functions to be attached to individual objects.2. Object ID. This field contains the object ID of this object and allows an object tospecify its identity to the rest of the Raven system.3. Is-a pointer. This field points to the object block of the class this object is a member of.4. Method type field. This field is used to select the category of methods this objectis to use. The category of methods selected depends upon whether this object is alocal object or a proxy object. Since proxy objects are agents for a remote objectand do not have any instance data, any request to execute a method must be forwarded to the remote site. This field is used to make that distinction.5. Global ID pointer. This field points to a region of memory that contains the specification for the global ID assigned to the object. Global IDs are assigned on anas-needed basis, and if an object does not have a global ID this pointer is set toNULL. Proxies always have a global ID.6. Lock pointer. This field points to the lock control data for the object. The lock isused to provide concurrency control on the object.7. Object properties. This field specifies the properties this object has. Propertiesare set on an object-by-object basis and assigned at object creation time.8. Instance data pointer. This field points to the region in main memory where theinstance data for the object actually resides. The interpretation of the instancedata is a function of the class that the object is created from. Proxies do not haveinstance data.With the data provided by an object block, it is possible to discover all needed runtimeinformation about an object. The class hierarchy can be traced, the methods to executeobtained, the instance data can be obtained, and the object properties determined.5.1.2 Global IDsA global ID (GID) is the mechanism used by the Raven system to provide a unique identifier for an object over all Raven environments. For the vast majority of objects, a local objectCHAPTER 5: Implementation Details and Issues 130ID is all that is required. Only those objects used in a remote invocation need a GID; therefore,GIDs are assigned on an as-needed basis.Global IDs are used to identify objects in one Raven environment to other Raven environments. Global IDs are used only at the Raven system level and not by the application. Anapplication always references an object through the local object ID. When an object ID needsto be passed between Raven environments, the object ID is converted to its GID and it is theGID that is passed. At the receiving site the GID will be converted to a local OlD, eitherthrough a proxy, or, if the object exists on that site, the actual object. Figure 5.2 illustrates theHost IDLocal Raven Env. IDLocal Object IDClass NameFIGURE 5.2 Format of global ID information.global ID information transmitted between Raven environments and stored as part of an object.The fields in a GD provide the following information:• Host ID. This field provides the host id of the machine where the object resides.In the current implementation, this is the simply the IP address of the host.• Local Raven Environment ID. Each Raven environment is implemented using aprocess. Since multiple Raven environments can be running on a single machine,an identifier is required to select the appropriate Raven environment. In the current implementation this is the environment’s UDP port number.• Local Object ID. This field identifies a particular object within a Raven environment.CHAPTER 5: Implementation Details and Issues 131Class name. This field is the textual representation of the class name that theobject represented by this GID belongs to. The class name is required so that thereceiving site can construct a proxy for the object identified by this GID, if aproxy is needed.When an object ID is passed as a parameter in a remote invocation, a GID needs to besent. If the object ID does not identify a proxy, then a global ID is assigned to the object andpassed to the remote site. Once the global ID is assigned to the object, that association is fixedand the global ID can be obtained from the data structure pointed to from the object block. Theinformation that constitutes a GID is sufficient to allow a proxy to be constructed at the remotesite and to allow the remote site to locate the object.On the remote site, when a global ID is received a proxy object may need to be createdto represent the remote object and to provide an object ID for use in the local environment. Aspart of constructing the proxy, space for a global ID is allocated and the global ID that waspassed in is stored: the class name passed as part of the GID is used to set the proxy’s type (isa field in the object block). A proxy, like all objects, is represented in main memory with anobject block, except there is no instance data. A proxy object will have the data fields in anobject block (Figure 5.1) set in the following manner:1. Invoke function pointer. This field will point to a function tailored to performremote method invocations.2. Object ID. This is set to be the object ID of the proxy.3. Is-a pointer. This points to the object block of the class the remote object is amember of. The class name passed in the OlD is used to lookup the class.4. Method type field. This field is set to indicate that the remote version of a methodshould be invoked.5. Global ID pointer. This field points to the global ID of the remote object.6. Lock pointer. This field is NULL. Locking is handled by the site where the objectresides.7. Object properties. This is set to BASIC (Section 4.4). Any special handling of anobject because of its properties are handled at the site where the object resides.CHAPTER 5: Implementation Details and Issues 1328. Instance data pointer. This pointer is NULL.The global ID information stored with the proxy will be used either when the proxy ispassed as a parameter, or when an invocation is performed on the proxy. When a global IDarrives at a site, a proxy is constructed only if a proxy for the object does not exist and if theobject identified by the GID does not reside here.5.1.3 Class ObjectsAs mentioned earlier, classes in the Raven system are objects. They obtain their uniquenessthrough their instance data and behaviors. It is through a class object that methods are lookedup, and a description of an object’s instance data maintained. Each object block in Raven hasan is-a field which references a class. A class object describes the common properties associated with all instances of the class. Figure 5.3 shows the layout of the instance data for a classobject. For all but the basic classes the description of the instance data is taken from the classdeclaration. The format of the class instance data is needed before the system starts so that thebasic classes can be created. The layout of the class instance data and methods is, therefore,specified in the runtime system. The fields and the functions of the class instance data are asfollows:I. Parent class object ID. This is the object ID of the parent class and is used toimplement inheritance. When a method cannot be found at this level, the parentclass is consulted to determine if the method resides there.2. Class data size. This field indicates how large the class data is. Raven supportsthe notion of class data that can be associated with a class. This data is set whenthe class is created and cannot be changed after that.3. Class data pointer. This field points to the class variables accessible to the programmer.4. Class ID. This is a unique identifier for class on this machine.5. Instance data size. This field indicates how much instance data is associated withinstances of this class.CHAPTER 5: Implementation Details and Issues 1336. Max parameters. This field records the maximum number of parameters that anymethod in this class uses. This is used to decide the default invoke function toassign instances of this class.7. Default invoke pointer. This field points to the invoke function assigned to basicinstances of this class. The invoke function assigned to an object can be changedat object creation time depending upon the program-assigned properties.8. Method count. This field indicates how many methods are in the method table.9. Method table pointer. This field points to a table listing all the methods implemented by this class. The method contains: the method ID field to identify themethod; the parameter count and type; the lock type (i.e. read, write or nolock);return type; a list of pre-handlers to invoke before running the method; a list ofClass Instance DataMethod ID ParametercountParametertypes Lock typeMethod Lookup TableReturn Pre invoke Post invoke Local Remotehandlers handlers Method ptr Method ptrFIGURE 5.3 Instance data of an object of type class.CHAPTER 5: Implementation Details and Issues 134post-invoke handlers to run when the method terminates; the method to run for alocal invocation; and the method to run for a remote invocation.10. Instance variable count. This is the count of the number of instance variables thisclass has.11. Types of instance data. This field, which is variable in length, specifies the typeof each piece of instance data.A class object plays two pivotal roles in the Raven system. First, when a new object iscreated, the class is consulted to construct an instance of the object. This involves allocatingspace for the instance data and making the appropriate class relationships. Secondly, when amethod is invoked, class objects are consulted to determine the method code to execute.5.1.4 PropertiesEach object in Raven can have properties attached to it; and the assigned properties are specified in the property field of the object block. Properties associate particular operating systemservices with an object, and the object property field indicates what properties are currently ineffect for this object. Based on that information, the invoke routine will perform specific processing. When the invoke function is called it checks to see what sort of properties are associated with the object, and executes the pre-invoke functions for the properties The method isexecuted and then any post-invoke functions required by the properties are executed. As anexample of the processing, consider the actions taken when an object has the controlled property. In this situation the pre-invoke function will acquire the appropriate lock for this method;the method will be executed and, when it returns, the lock released using the post-invoke function. The data structures to manage the lock are pointed to by the lock pointer in the objectblock.5.1.5 Method Invocation and LookupTo invoke a method, a method invocation function is called. The function takes as parametersthe object, some ancillary information useful for looking up methods, and the parameters forthe method. The invoke function is responsible for looking up the method to run, making copiesCHAPTER 5: Implementation Details and Issues 135of any parameters marked as copy, performing object specific pre-invoke processing, determining if the method is local or remote, running the method, performing object-specific post-invoke processing, and returning the result, making a copy if necessary. Figure 5.4 provides apseudo-code description of the steps that need to be performed by the invoke function.-lookup method to be run-if (parameters need to be copied) then copy required parameters;-if (propertyl in effect) then execute pre-invoke function for propertyl;-if (property2 in effect) then execute pre-invoke function for property2;-if (property3 in effect) then execute pre-invoke function for property3;-execute other required pre-invoke functions-if (local method) then return_value = execute_lookup method;else return_value = execute_remote_requests;-execute post-invoke functions-if (property3 in effect) then execute post-invoke function;-if (property2 in effect) then execute post-invoke function;-if (propertyl in effect) then execute post-invoke function;-if (return result needs to be copied) return_value = copy of return value;-return return_value;FIGURE 5.4 Pseudo code illustrating steps involved in performing an invoke.The pre-invoke and post-invoke processing consists of checking the object to see whichproperties are in effect. However, each one of these checks takes time, and in the vast majorityof cases there are no properties in effect, so the tests are wasted. A similar observation wasmade about remote invocations and the copying of parameters and returned results: again, mostobjects do not need special processing to handle these cases. Given the large number of conditions the invoke routine must check and the fact that every method must run this code, theinvoke code is a potential performance bottleneck. It is therefore imperative that the invokeroutine perform the minimum amount of work required to properly execute a method.CHAPTER 5: Implementation Details and Issues 136One way to reduce the invoke processing overhead is to have multiple invoke routinesdesigned to handle different invocation scenarios. For example, one routine could handle thebasic case when no special processing is required, while another could be a generic routinecapable of handling all the special cases. Then only those objects needing the extra processingwould use the generic routine and pay the execution cost. The difficulty with this approach isdetermining which invoke routine to run for a particular object. Partly because of properties,each object is different; therefore, the invoke routine needs to be dynamically determined onan object-by-object basis. In Raven the invoke field of an object’s object block specifies theroutine to use when processing method requests, and each object can be assigned the invokeroutine best suited for handling its invocations. As part of the object creation process, the run-time system examines the properties of the object, checks to see if parameters or results are copied, and checks to see if the object is remote (this is done if a proxy object is created). Basedon these checks, the runtime system selects one of several invoke routines for use by this object.An object that has no special processing needs will be assigned the invoke routine which simpiy looks-up the method and executes it, whereas an object with multiple properties and copyparameters might be assigned the generic invocation routine. The generic routine is basicallythe one described by Figure 5.4.By using this approach, the runtime system can provide a large collection of invocationroutines specially tailored to the needs of the objects, as opposed to one generic routine that canhandle all possible types of objects. The result is that the invoke routines need to execute onlythe code relevant to the object, so only those objects that require special processing pay theincreased execution costs. The invoke routine does not need to be concerned about servicerequirements that do not apply to the object.For a method to be executed on an object, the class and object must be co-resident, sincethe class is the only entity that identifies the methods (code) that can be used to manipulate anobject. The association between an object and its class is accomplished through an object’s is-a pointer, which points to the class the object is a member of. Once the object’s class is estabCHAPTER 5: Implementation Details and Issues 137lished, the methods runnable on the object can be determined. If the object being invoked on isitself a class then the is-a pointer identifies the meta-class for this object. Meta-classes are automatically constructed by the runtime system when a new class is created and are not user specified. All meta-classes have their is-a pointer directed at the system defined classMetaMetaClass, whose is-a pointer points at itself. Again Figure 3.5 can be consulted to see theis-a relationship between instances, object classes, meta-classes and MetaMetaClass.In Raven, method lookup is dynamic; therefore, for Raven to work quickly and efficientlyit is imperative that as little time as possible is spent determining the method to run. Conceptually, when a method is invoked, the object’s is-a pointer is followed to get to the class thisobject is a member of. The method is then looked for in the method table of the class using themethod ID as a key. If the method is not found, then the search continues for the method bylooking in the parent’s methods, and so on until either the method is found or the top of theinheritance hierarchy is reached and a method lookup error signalled.In the first implementation of Raven, methods were identified and dynamically looked-up at runtime using the method name as the key. Profiling was performed on the code and itwas discovered that a substantial amount of time was spent performing the string comparisonsin the method lookup. It was obvious that string comparisons needed to be avoided. After someexperimenting, a technique using hashing was selected. The approach relies upon replacing, formethod lookup purposes, the string representation of the method name with a four byte hashvalue. For this to be an effective solution there needs to be few hash collisions, and the hashvalue, for performance reasons, should be computed during compilation.The problem of hash collisions was solved by selecting a good hash function. A numberof hash functions were evaluated and a public domain hash function was taken fromsdbm [1071, a hashed database system. This function was found to be very good at generatingunique hash values. In one test 84,165 strings taken from a list of English words and the symboltables from various programs were hashed and no collisions occurred. The number of methodsin a Raven application is typically an order of magnitude less than this, so the chances of a hashCHAPTER 5: Implementation Details and Issues 138collision are expected to be small. Of course, hash collisions are of consequence only if the collision occurs for methods in the same class or in the inheritance hierarchy as traced from a leafclass to the root. Still, a hash collision, although highly unlikely, could have devastating effects,and could cause a correct program to fail. This problem is addressed by checking hash valuesfor collisions as the methods are registered with a class during the system initialization phase.If a collision is detected, the Raven program stops and reports the collision. It is then the programmer’s responsibility to change one of the method names so that a collision does not takeplace. Using this load-time check, the hash is assumed to be perfect. To date, no Raven programhas ever reported a method name collision.To avoid the expense of having to compute the hash value at every invocation, the key formethod lookup is the hash value. Within Raven this hash value is called the method ID, sincea one-to-one mapping between the method IDs and method names is assumed. When an invocation is performed, the method lookup function takes the method ID to be looked for andchecks it against the method IDs stored in the appropriate method table. Since the methodsbeing invoked are known at compile time, the compiler can compute the method ID immediately and use a constant for the method ID instead of calling a runtime routine to compute theID.Although the use of method IDs significantly reduced the execution overhead associatedwith method lookup, a substantial amount of time was still used in looking for the method. Itwas therefore decided to use a method cache to speed up method lookup. However, it was alsoimportant to minimize the overhead associated with determining whether or not a method wasin the cache. This requirement effectively ruled out using any caching strategies that involvedsome organized search of the cache. The solution selected was to use a fixed size hash table tostore information about the cached methods. The initial hash function used an exclusive oringof the object ID and the method ID. The method ID could not be used by itself, since methodswith the same name, but belonging to different classes, would have the same method ID. It wasobserved that, although this worked, the hit rate on the cache was low, and the same methodCHAPTER 5: Implementation Details and Issues 139was often used by different objects. These observations resulted in changing the hashingscheme so that the class ID of the object was exclusive ored with the method ID. Implementingthese changes resulted in the cache hit rate with a 1024 slot hash table exceeding 95% for thetest applications. Of course, this result was not unexpected, as programs typically exhibit alarge amount of execution locality. Once a computation enters one of these phases, the methodscurrently in use will fit in the cache and a large number of hits will be registered before movingonto another phase of the computation and reloading the cache.This combination of changes significantly improved Raven’s method lookup performance. However, the effect of the changes is dependent upon the application being run. In particular, the access patterns to the cache, and how far up the inheritance hierarchy a method is,will have a significant effect on the observed improvement. For code that had a high cache hitrate and methods that were found in the method tables at the leaves of the class hierarchy, themethod invocation time went from over 10 procedure call times to about 2.7 procedure calltimes. A procedure call time is the amount of time required to execute the null C procedure.What this means is that if the C call takes 1 time unit to execute then the equivalent Raven invocation takes 2.7 time units to execute.This particular configuration served Raven well until Raven was ported to run on a sharedmemory multiprocessor. On the first run of an application executing multiple disjoint threadsof control and using two processors, there was only a slight performance increase. As the number of processors working on the problem was allowed to increase there was actual performance degradation when compared to the single processor case. The problem was traced to themethod hash table. Since there was the potential for the hash table to be modified if there wasa method collision, or if a new entry was added, the table had to be locked for exclusive accessby the method lookup routine. It was originally assumed that the amount of time the lock onthe table would be held would be much shorter than the amount of time between lock requests;therefore, there would not be a contention problem. This assumption was incorrect. There was,CHAPTER 5: Implementation Details and Issues 140in fact, extremely heavy contention for the hash table, and this accounted for the observed performance degradation.For Raven to run in a multiprocessor environment, some way of reducing the contentionfor the hash table during method lookup was required. One solution considered was to changethe granularity of the locking on the table so that each entry would be independently lockedinstead of the whole table. It was not clear if this solution would solve the problem. It is easyto imagine an application with a number of parallel threads all executing the same method in aloop. With this solution, each one of these requests would still block and there would not beany noticeable performance improvement. A modification of this approach, such that read andwrite locks could be used on entries, was also considered. This approach seemed overly complex and it would certainly require considerably more CPU time to run than the much simplersingle lock approach.Another way to reduce contention is to have multiple caches and use some sort of simplescheme to decide which hash table to use next. Part of the difficulty with this solution is deciding upon how many hash tables to have, and there is still the possibility of hash table contentionif applications get in step with one another. The likelihood of this happening is reasonably high,given how often methods are looked up in Raven, and the experience with a single lookup table.The three solutions proposed so far all had the advantage that they could be easily and relativelyquickly added to the existing Raven system. Their primary disadvantage was that there wassome question as to whether or not the solutions would actually deliver improved performance,and if they did, how well the solutions would scale.Since it appeared that the proposed solutions were not going to be able to provide thedesired level of support, it was decided to use a more complex strategy, based on multiple hashtables. The approach taken is to associate a hash table with each Raven thread. By doing this itis possible to dispense with locks on the hash table, since only one thread of control can everaccess a table. To be effective, there must be quick access to the hash table. With the existingorganization that was not the case. The hash table exists and is managed completely by the run-CHAPTER 5: Implementation Details and Issues 141time system; therefore any access to process specific data structures, which might be useful forkeeping information about the hash table, are one layer removed from the runtime system andnot accessible without a call to the threads package. If possible, function calls to determine thelocation of the hash table were to be avoided.With the hash table being so closely bound to a thread, it seemed appropriate to allocatethe hash table at the same time the process stack was created. The threads package was modifled so that as part of the basic starting of a thread a hash table for the thread would also beallocated. To get quick access to the hash table, it was decided to pass the address of the hashtable as a parameter to the invoke routines. Since an invoked method could itself invoke amethod, the hash table information needed to be passed into the actual Raven method. Thisrequirement necessitated a change to the compiler so that all Raven methods would be expecting an additional parameter with the hash table address. Additionally, the compiler had to bechanged to ensure that all calls to the invoke routine that it emitted also included this newparameter.By passing in the hash table, a mechanism was being developed that allowed thread-specific information to be made easily and quickiy available to the runtime system. Once this wasrealized, it became apparent that other parts of the runtime system and compiler could takeadvantage of this organization. As such, a new entity, the Raven environment, was specified. Itis actually this new Raven environment structure that is passed around. Currently, the per-thread environment regions specify the process ID of the system process running this thread,the Raven object ID for this thread, the hash table, and the end-of-stack location. Some of thisinformation is accessed frequently, and having it in the environment offers performanceimprovements over function calls to the threads subsystem. Some of the other information isdifficult to obtain and modify, so having the information available to the runtime system provides greater flexibility in the type of code generated by the compiler and ease of use by theruntime system.CHAPTER 5: Implementation Details and Issues 142Once all of these changes were made, the original application program that highlightedthe deficiencies of the hash table system was rerun. This time the performance improvementswere as expected, and the solution scaled with the number of processors. The hash table wasno longer a system bottleneck.5.2 Memory Allocation and Garbage CollectionWithin Raven, the unit of memory allocation is the object. There is no notion of directly allocated memory through a function like mal bc ()that is common to C programs. Instead, theprogrammer explicitly creates new objects. The creation of an object results in the Raven run-time system acquiring some memory to hold the object and in the initialization of the object. Areference (OlD) to the object is returned to the application program. In the C/Unix environment, once the allocated memory is no longer needed the program frees it. With multiplethreads of control, it becomes increasingly difficult to determine when it is feasible to free allocated memory.The problem is compounded in Raven since capabilities can be freely distributed andduplicated. When an invocation is performed, the parameters to the method reference otherobjects. When control is returned to the invoker, it is not possible, in general, to determine whatwas done to object references passed as parameters. The invoked method can make a copy ofthe object reference, pass the reference on through a another invocation, or be finished usingthe object. If copies were made, then the object cannot be freed. Likewise, the method thatmakes a copy of a reference cannot, in general, know what other objects already have a reference to an object. Many systems deal with this problem by establishing strict memory allocating and freeing protocols between program modules. The very nature of an object-orientedsystem and its approach to data hiding and encapsulation discourages such an approach. Onepossible solution would be to adopt a compiler-based reference counting technique. However,this technique becomes increasingly difficult to implement as local method variables go in andout of scope, capabilities get passed as parameters, returned as results, stored in instance van-CHAPTER 5: Implementation Details and Issues 143ables, assigned to method variables, and lost as method and instance variables have new dataassigned to them. Instead, the approach adopted in Raven was to use a conservative garbagecollector.The memory allocation and garbage collection system being used is the implementationby Boehm and Demers of the Xerox Corporation and described in Boehm and Weiser 1988[23]. Modifications to operate in the Raven parallel processing environment were made as partof this thesis work. This collector has the advantage that it can be grafted onto an existing system through the storage management functions of free () and malloc ( ) . To use the collector, the standard C library calls for memory allocation and freeing are replaced with theequivalent ones from the memory management package. The design of the Raven system issuch that there are only a few locations in the runtime support code where memory allocationis performed, thereby making the task of changing the memory allocation techniques a trivialone. Raven application programs are not affected by any changes to the interface to the memoryallocation routines because Raven applications cannot directly allocate memory. Raven applications acquire new memory to work with by creating objects, and this is done through an interface to the Raven runtime system.The collector being used is described as a conservative collector, designed to operate inan uncooperative environment. The garbage collector uses a mark and sweep approach consisting of two passes. The first pass marks all the allocated data objects accessible by a program,and the second pass returns the unmarked memory objects to a free memory list. The markphase treats all data accessible by the program as potential pointers. This includes all the datain registers, on the stack and in static program data areas, as well as the data within an allocatedmemory object. Data alignment within each of these regions is assumed to be on boundariesdictated by the particular hardware. For the machines Raven currently runs on, 4 byte boundaries are used. Each piece of data in the region is checked to determine whether or not the datacould be a reference to memory managed by the allocation system. If it could, then the assumption is made that this data value is indeed a pointer and the memory object is marked as accesCHAPTER 5: Implementation Details and Issues 144sible. It is possible that an integer value may map to a valid object, and this will result in unusedmemory being marked as in use. This will not result in incorrect program execution, only inexcess memory consumption. Further details on the implementation of this storage allocationsystem can by found in Boehm and Weiser [23].5.2.1 Modifications to the memory allocation system.The original collector offers some support for a parallel environment by providing hooks toserialize requests for memory. Serialized memory requests severely degrade a parallel applications performance when the application places high demands for memory on the storage allocation system. The initial efforts at using the storage allocator in a parallel environment usedthe provided facilities. As long as Raven was operated in a single processor environment therewere no problems. However, a problem was encountered as soon as the test applications wererun on a shared-memory multiprocessor. It was apparent that the memory management systemwas experiencing problems when high demands were placed on itDuring parallel operation the Raven system placed extremely high demands on the allocator, and the allocator was essentially unable to keep up with requests. The rate of memoryrequests was so high that the memory allocator became a bottleneck and forced the applicationto become sequential. Although each application has its own memory use profile, generally, asthe number of processes and processors in use increases, the rate of memory requests increases.Once the interval between requests becomes less than the service time for a memory request,the application is effectively serialized while processes wait for memory. As the queues formemory requests grow, the overhead in maintaining the queues also increases and the systemgets progressively slower and less and less useful work is done.Figure 5.5 is a graph of the processor efficiency1versus the number of processors and theproportion of memory allocation delays versus processors. The application being run is aRaven program consisting of a loop that creates new objects and then discards them. In this1. Efficiency is speedup divided by the number of processors. More formal definitions are provided in Section 6.1.1.CHAPTER 5: Implementation Details and Issues 1450.50.4UI0.30.20.1Processing Efficiency and Memory Delays vs ProcessorsFIGURE 5.5 Processor efficiency and memory allocation delays when no heaps are used.example, the main program creates 200,000 new objects, which translates directly into thatnumber of calls to the memory allocation system. The graph shows that with two processorsthere is only contention for the memory allocator 5% of the time. As soon as three processorsare used, over 99% of the memory requests result in contention and there is a correspondingdramatic drop in efficiency. By this point the application has been effectively serialized, and,as processors are added, there is an increasing amount of overhead associated with managingaccess to the memory system. In fact, the elapsed times for this application are increasing.Although this example represents a worst-case scenario for memory allocation, it illustrateshow sensitive the memory allocator is to the number of simultaneous requests being made ofit. In an effort to reduce the contention on the memory allocation system, modifications weremade to the allocator.The modifications consisted of dividing the memory allocation pool into multiple parts.For example, instead of having a single pool of memory to allocate from, this single pool is0.80.7Efficiency- — -Memory allocation delaysProcessorsCHAPTER 5: Implementation Details and Issues 146divided into several pools. The number of pools to use is a configuration parameter of theRaven system, but typically the number of memory poois equals the number of processors inuse.Figure 5.6 shows the changes in processor efficiency and the proportion of memory alloProcessing Efficiency and Memory Delays vs Processors0.9 Efficiency — —.0.8 Memory allocation delays0.7O.60C0.4/z:::ProcessorsIGURE 5.6 Processor efficiency and memory allocation delays when multiple heaps are usedcation delays when memory is divided into pools and allocated from the pools using a roundrobin approach. In this example there is a one-to-one correspondence between the number ofmemory pools and the number of processors allowed to work on the problem. This significantlyreduced the amount of contention in the system for the memory allocator. With 3 processorsand 3 memory poois, there is memory allocator contention less than 10% of the time. Over therange of processors tested, the maximum proportion of memory requests that were delayed wasaround 25%, which is significantly less than the near 100% delay rate initially experienced.CHAPTER 5: Implementation Details and Issues 147It is expected that contention can be reduced by increasing the number of memory pools.This hypothesis was substantiated by increasing the available memory pools to 20 and using 5processors. In this test, memory request delays were reduced to 2.86%. Although more memorypools decrease contention, they do not come without a cost. Each memory pool has a numberof data structures associated with it, and these must be initialized. The result is that as the number of memory pools increases so does the administrative overhead associated with the initialstart-up and continued management of the memory allocation system. The initializationsequence is single-threaded, thereby reducing the overall efficiency of the application. For longrunning jobs, this is not an issue, as the start-up time will be dwarfed by the total running time.With shorter jobs, such as the examples used to check for memory contention, the additionalinitialization costs may more than offset the potential performance gains. Again, this is a situation where a memory allocation system designed for use in a parallel environment should beable to offer an important performance improvement.In Figure 5.6, the proportion of memory allocation requests that delay peaks at 4 processors and then drops and never reaches the same level of delays again. Further instrumenting ofthis application provided additional insight into what was happening. It was determined thatthe memory allocation system was sensitive to the size of the memory pools. A contributingfactor is that the pool size can affect how often garbage collections are performed. For example,if the same application is working with a one-megabyte pool one time and two-megabyte poolanother, the amount of memory that can be allocated between collections is greater for the twomegabyte pool. The memory allocation system attempts to limit this problem by dynamicallyadjusting the size of the memory pools. However, it is not successful under all circumstances.To illustrate the problem, compare the memory pool from which memory is allocated toa barrel. As memory is allocated, objects, representing the allocated memory, can be droppedinto the barrel. When the objects in the barrel reach a certain height, a process comes along tocheck all the objects to see if they are being used (referenced). Objects which are not referencedare removed, thereby reducing the level of the objects in the barrel. The difference between theCHAPTER 5: Implementation Details and Issues 148height when the checking is finished and the level where checking is initiated will affect howlong the application runs before a collection takes place. If the height after checking does notdrop below a specified level, a new larger barrel is acquired and the objects transferred to it.This corresponds to the memory management system expanding the memory allocation pooi.Under certain circumstances this approach results in the system entering a thrashingphase where few memory objects are allocated between requests, and the increased number ofgarbage collections causes performance problems. To get into this state, the following conditions need to be met:1. The rate of memory requests needs to be high.2. Allocated memory soon becomes garbage.3. The base memory usage is just below the level where a heap expansion wouldtake place.When these conditions are met the system will spend a great deal of time garbage collecting.When a memory request is made, a pool is selected to allocate the memory from using around robin strategy. The pool is then locked while the allocation is made. If a selected pool iscurrently in use, the requesting thread will block, waiting for the pool to be freed. When a garbage collection takes place, the allocation pool is locked much longer than for a straight allocation, and requests will start to block waiting for this heap to be freed. The problem is mostpronounced when the memory request rate is high. Under these conditions, requests tend tobunch up with a resulting drop in parallelism. It is this phenomena that was occurring aroundprocessor 4 in Figure 5.6. When an additional processor is used, an extra heap is also added.The extra heap and processor combine to change the allocation patterns and heap sizes so thatthere are fewer garbage collections. Fewer collections result in fewer memory allocation delaysand the amortization of the administrative costs of doing a collection, with a resulting improvement in system performance.CHAPTER 5: Implementation Details and Issues 149Ideally, heaps should be large so that no garbage collection has to be performed. For mostapplications that is not possible, and a compromise between the frequency of garbage collections and the duration of the collections needs to be established. For example, large pauses fora huge garbage collection may not be acceptable when the application involves a communications protocol, whereas a large number of smaller collections may be fine. The best pool sizeto work with varies on an application-by-application basis.Further work was not done on improving the memory system due to time constraints. Bymaking the modifications outlined here, the parallel constructs of the Raven system can beproperly demonstrated. In general, performance issues and quirks associated with garbage collection have been left for future work.5.3 The Thread EnvironmentThe Raven virtual machine defines the low level operating system interface that the runtimesystem uses. The runtime system is based on a threads package developed at UBC [801. Thebasic threads package supplies a basic set of services for process creation and management,memory management, and message passing. The message passing functions are SendlReceive/Reply. This section describes the modifications and extensions to the Raven runtime system tosupport parallel and distributed processing. How Raven uses these services to implement companions, early result and delayed result is also described.In its simplest form, parallelism can be supplied by providing access to system functionsto explicitly create processes. The Raven language, instead, provides direct syntactic supportfor creating parallelism through companions and early result, thereby hiding the system-specific way of creating process from the application. The Raven runtime system, however, needsto have access to the process creation services to support Raven’s facilities for parallel programming. Since several of Raven’s system classes (e.g Certificate, CertificateGroup andInvokeStream) need to work with threads of control, and to have as much of the Raven systemwritten in Raven as possible, it was decided to create a Thread class to provide an object inter-CHAPTER 5: Implementation Details and Issues 150face to the underlying thread creation and management functions. Additionally, an executingthread needs a language-level interface to be able to perform control operations, likesleep ( ) , on itself.5.3.1 The Thread ClassThe Raven system is implemented using a layered approach. Each layer of software providesa certain functionality and typically makes use of only the layer directly below it. Figure 5.7ApplicationsClass libraryRuntime supportThreads subsytemHardwareFIGURE 5.7 Layers of the Raven system.shows the layers of the Raven system from the hardware up. The lowest layer depicted is thehardware. On top of the hardware is the threads subsystem. This subsystem provides the lowestlevel of process (thread) creation, management, and control. Unlike other layers in the Ravensystem, the runtime layer actually knows something about the Raven class library, which is thelayer directly above it. The runtime layer knows about the threads subsystem and how to interact with it, and it knows about system specific objects. As a result, the runtime system, in conjunction with the Raven compiler, provides a “bridge” between the threads subsystem and thehigher level Raven abstractions. To accomplish this bridging, the runtime system makesassumptions about the objects that exist in the layers above it and about the methods theseobjects support. The layers above the runtime system are the class library, which supplies a setof predefined Raven classes, and the actual Raven applications. In addition to providing predefined classes, the class library also implements system classes. System classes are thoseCHAPTER 5: Implementation Details and Issues 151classes which provide some sort of basic service or require the runtime system to provide a service being mapped to the Raven object world.The Thread class provides an object interface to the process creation and management services of the threads subsystem. Within the threads subsystem there are four categories ofthreads:1. System processes (threads). These threads typically provide services to theRaven system; the communications manager is an example of a system process.2. User level processes (threads). These are the processes that form the application.3. User companion threads (Section 4.6.2). These are threads spawned from a userprocess using Raven’s certificate and companion support.4. System companion threads. These are the same as user companion threads,except they are started from a system process.Any threads associated with managing or monitoring the Raven system are system threads.User threads, including companion threads, are threads created by an application. When aRaven environment starts, there is exactly one user thread, the one executing the application.Other user threads are created on demand by the Raven system to handle early reply, companions, or to service remote invocation requests.Each thread is managed by the underlying threads package. This means that if a threadneeds to use the process management services (e.g. sleep ( ) ) this request must be communicated to the threads package, which has a C interface and not a Raven one. This problem isovercome by providing a Raven interface through the system class Thread. Each applicationlevel thread has associated with it an instance of the class Thread. From an application, the following steps need to be performed to create a new thread of control:1. Instantiate an instance of Thread.2. Start the Thread executing by invoking the start () method. The start ()method takes as parameters an object ID, the name of the method to invoke onthe object, and the parameters for the method.CHAPTER 5: Implementation Details and Issues 152It is only when the start () method is called that the underlying threads package is called tocreate a new process and start it executing. The most commonly used methods of the Threadclass are:• start.Start execution of the Thread on the provided object with the givenmethods and arguments.• sleep () . Suspend execution of the Thread for the supplied time. One thread ofcontrol may not issue the sleep () method on another thread of control.• suspend (). Suspend execution of the target Thread.• resume () . Resume execution of the target Thread.• kill ().Terminate the target Thread.Other methods are supported by the Thread class but they are used internally and not generallyavailable to the programmer.In Raven, it is the language-level features of companions, early result, and delayed resultthat use the process-creation facilities of the threads package. Processes created by the threadspackage as part of a Raven application are mapped to Thread objects. To perform this mapping,the threads package was modified to allow application-level data to be associated with a thread.The runtime system uses this feature to map system threads to a corresponding Raven Thread.5.3.2 CompanionsIn Raven, a companion is a collection of method invocations to be run in parallel. A companionis identified by a Certificate. Each invocation within a companion (Figure 5.8) is allowed to runin parallel with each other and the initiating thread. In Raven this is implemented through aCompanionThread. A CompanionThread is a subclass of Thread and differs from Thread only inits added support for InvokeStreams (Section 5.4.6). In the threads subsystem there is a corresponding companion thread process type. It is like a regular thread except there are limits tohow many companion threads may be active at once (Section 4.6.2). Figure 5.8 shows a sampleRaven companion containing two invocations along with a pseudo-code description of theCHAPTER 5: Implementation Details and Issues 153II Raven code!{ resi = workstationl .displayTime,res2 = workstation2.displayTime() }‘.startO;II Pseudo-code illustrating the underlying use of the Raven system-create a Certificate-create a CompanionThread-add the companion thread to the certificate-indicate to the companion thread the object, methods, andarguments for the invocation on workstationl-create a CompanionThread-add the companion thread to the certificate-indicate to the companion thread the object, methods, andarguments for the invocation on workstation2-invoke the start method on the certificateFIGURE 5.8 Actions undertaken upon companion creation.actions undertaken by the Raven system to support the companion. Since all companions areidentified by a Certificate, the first action in setting up a companion’s execution is the creationof its Certificate. If the Certificate had a tag value assigned to it, it would be set at this point. Foreach invocation within the companion the following steps are performed:1. Create a CompanionThread.2. Add the CompanionThread to the Certficate.3. Invoke methods on the CompanionThread to indicate the object, methods, andmethod arguments to perform the invocation with. The location of where to putthe returned result, if needed, is also specified.Once all the statements comprising the companion have been processed, the CompanionThreadscan be started. This is done by invoking start () on the Ceruficate or pushing the Certificateinto an InvokeStream, where the start is performed implicitly.CHAPTER 5: Implementation Details and Issues 154Companions, in Raven, are new threads of control. They are created through the threadpackage’s process-creation services. Simple process creation can provide most of the functionality needed for companions, but not for companion threads. To support companion threads, afiner degree of control over process creation is required. This control is provided through modifications to the process-creation routines. Companion threads are supported by providing theability to classify the processes being created, and by creating the process in the initially suspended state. The classification type is needed so that limits on the number of companionthreads can be enforced. New threads need to be created in the suspended state to permit Ravenlevel Threads to be attached to the new processes. Once the attachment is made, the controllingof a thread’s execution is managed through Thread objects. When a companion thread is createdin Raven, it is initially suspended and the companion thread object ID is returned to the application. The application then issues the start () method on the OlD to initiate the thread’sexecution.5.3.3 Early ResultAn early result differs from a companion in two ways:1. The new thread needs a copy of the execution state of the method. In particularthis means that copies of all local variables, including parameter values, arerequired.2. It may need a lock on the object instance data.When an early result is performed, a new thread of control is introduced where the earlyresult is performed. The initial thread of control returns to the calling object, and a new threadcontinues execution in the current object. Since the new thread is executing instructions afterthe result statement, it is expecting to access all the parameters and local variables that wereactive prior to the early result. To supply the new thread of control with a copy of the executionstate, a limited Fork () function was added the threads subsystem. This Fork () function wasdesigned specifically to support early result and takes advantage of the fact that stack framesdo not contain pointers to other stack frames, and that only the stack frame of the method doingCHAPTER 5: Implementation Details and Issues 155the early result needs to be preserved. This makes the restricted Fork () simpler to implementand quicker to execute, since it has less work to do than a full-featured Fork ()Figure 5.9 shows a Raven code fragment taken from Figure 4.1, along with some pseudoII Example Raven code showing early resultreturn_code = message.validFromField;result return_code;II Pseudo-code to describe actions to perform for early resultif (early result hasn’t been done before) then {-indicate that early result has been done-attempt to acquire any needed locks for the new threadif (could not get lock) then-compute return result and save itelse { II Got the needed locks so can do early result-perform Fork()if (in parent process) then {-compute return result of methodreturn result) else { II In the child process-create a new Raven Thread object-in user setable data of this process record thethe object ID of the new Thread object-change the session ID in any granted locks to bethis thread’s session ID.}FIGURE 5.9 Pseudo-code illustrating early resultcode to outline the steps the system takes when an early result is done. Only one early resultcan be done during each invocation, so the first thing the code for an early result does is checkto see if an early result has already been done. If an early result has already been executed, thenexecution continues as if the result statement was not executed. If an early result has notCHAPTER 5: Implementation Details and Issues 156been performed, then processing continues in an attempt to start a new thread of control. Thenext step is to acquire any locks the new thread of control might require. It may be that boththe new thread and the existing thread may need locks on the object, and these locks could bein conflict, so a call is made to the lock management routines to determine whether or not locksrequired by the invoker and the new thread can be held simultaneously. If there is a lock conflict, the return result is computed and saved for the actual return and execution continues to besingle threaded. If, however, the locks can be granted, then a new thread of control can bestarted and a call to Fork () made. When Fork () returns, the parent computes the returnresult and returns. The child process then needs to create a Raven Thread to associate with itsthread in the thread subsystem. After creating the Thread object, a method is invoked to attachthe system-level thread to the Raven Thread object. Within this method the process ID of thethread is recorded in the object’s instance data and the user setable instance data in the systemthread descriptor is set to the object ID of the Raven Thread object. Once this is done, the newthread proceeds to execute the code following the result statement.5.34 Delayed ResultA delayed result is different from a companion or early result in that it does not create a newthread of control. Instead, it assumes that there are already multiple threads of control withinthe Raven system and that if one thread of control needs to wait for certain conditions to be metthen it can do a delayed result, thereby suspending itself. Another thread of control will returna result for it at a later time.A delayed result has two parts two it to it. One is the method doing the leave and theother is the method that will ultimately complete the delayed result with the result statement.The result statement takes an argument in square brackets that indicates which waitingthread to complete along with the return result. (Without a square bracket argument the threadto reply to is considered to be the calling thread and the result is an early reply.) UsingFigure 4.2 as a base, Figure 5.10 excerpts the relevant Raven code and provides a pseudo-codedescription of the actions the system performs for a delayed result. If multiple threads have out-CHAPTER 5: Implementation Details and Issues 157?havior getResource { behavior freeResource {waiting_th reads.add(me); next_user = waiting_threads. nextitemO;leave; result [next_user] OK;‘seudo-code II Pseudo-codetore OlD of waiting thread in a shared data -get OlD of thread to send result tostructure -use blocking send to send result to threadlease any locks on object,ait for another process to send a resultleceive result?acquire locks?turn resultFIGURE 5.10 Delayed result example and pseudo code describing actions taken by system.standing leaves within the same object, the method(s) performing the completions need to beable to determine the thread of execution to complete. The way an application accesses theunderlying threads is through objects of the Thread class. Each system-level thread has associated with it an instance of the Raven class Thread. A method can get the OlD of the Thread it isrunning in through the reserved word me. Once a method has its thread OlD, the OlD can bestored and a leave performed. The first action undertaken as part of the leave statement isthe releasing of any locks this Thread is holding on the object. This ensures that another threadcan get any locks it needs on this object so that the completion of the delayed result can beaccomplished. Once the locks are released, a call to Receive () is performed to wait for theresult. Receive () is a blocking operation, and when it completes the receiver will have theresult to return as the delayed result.The method doing the completion uses the stored Thread OlD to specify the underlyingthread to send the result to. There is the slight possibility that the thread being replied to hasnot performed its receive yet. This is not a problem as the sender will block, waiting for thereceiver to become ready.CHAPTER 5: Implementation Details and Issues 158The threads package is a realization of a virtual machine description that the Raven run-time system is targeted to. When Raven is ported between different architectures, only thethreads package needs to be modified, since it is the only piece of Raven software that interactsdirectly with the host system. The only part of the threads package important to the runtimesystem is the interface; therefore, considerable flexibility is available when a port is done. Thisflexibility was demonstrated in the port of the threads package to Mach 3.0 [70]. Mach’s support for thread creation and management is very similar in functionality to the services provided by Raven’s thread package. Consequently, instead of porting the threads packagedirectly, the interface of the threads package was mapped to the Mach system calls throughwrapper functions.5.4 Remote InvocationIn the previous chapter the general design approach used to achieve transparent method invocation was described. This section describes Raven’s support for remote invocations that helpsrealize this transparency. To actually accomplish a remote method invocation, considerationwas given to the following issues:1. What relationship has to exist between class hierarchies on different machines?2. How are problems associated with heterogeneous machine environmentsaddressed?3. How are the remote objects located (found)?4. What mechanism needs to be used to exchange data between Raven environments?Before elaborating on these points consider Figure 5.11, which details the steps of aremote invocation. In this figure rectangles represent objects and circles represent processes.The remote invocation is started by an initiating object in environment A performing an invocation on a proxy. The method lookup routine for the proxy then calls a remote request handlerroutine. This routine marshals the request, contacts the local communications manager process,and instructs the communications manager to send the data comprising the request to the enviCHAPTER 5: Implementation Details and Issues 159ronment where the target object resides. The communications manager contacts the remotecommunications manager process and together they ensure the reliable delivery of the remoterequest handler’s data. Once the remote communications manager begins receiving the data, itstarts the remote worker process.The remote worker is responsible for examining the incoming data stream to determinewhat should be done with the request. In the current implementation, the only request expectedis a remote invocation request. Any other request type is considered an error and ignored.The remote worker continues to accept data until a completed request is received. Arequest is completed when the target object has been identified, the method to execute specified, and all the arguments received. The de-marshalling of arguments might require proxies tobe created or objects to be copied. The remote worker, which is operating as part of the runtimeRaven environment A Raven environment BFIGURE 5.11 Remote invocation sequenceCHAPTER 5: Implementation Details and Issues 160system, assembles the incoming data into the format of an invocation request and calls amethod invocation routine to lookup and execute the method. When the method completes, aresult is returned to the remote worker and the information concerning the returned result ispropagated back to the initiating object.5.4.1 Class relationshipsEach Raven environment has within it a class hierarchy. The class hierarchy is accessed for allthe local actions associated with objects, and it plays a special role when a remote request isreceived at a Raven environment. As part of a method invocation, an invoker may pass asparameters references to other objects in the system, or an actual copy of the object. A copy ispassed if the copy option was specified for a parameter during the method declaration. Whentwo Raven environments interact there is the potential for the class hierarchies to be in conflict.A conflict results when a class does not exist, classes have the same name but do not describethe same type of object, or if the method signatures for a class are not identical.The problem can be highlighted by considering what happens when a global ID isreceived in a Raven environment. This GID needs to have a proxy constructed for it, and theproxy must be associated with a class so that its is-a relationship can be established. The classname of the object is sent as part of the GID, so that the is-a relationship can be established.Copied objects also rely on the remote site class information to provide a proper description ofthe object’s instance for reconstruction purposes and to provide the actual method code that canbe run on the copied objects. In constructing either a copy or a proxy, the class is first lookedup. If the class is not present on this machine then a problem exists.One solution to a missing class is to ask the class hierarchy where the original objectresides for the needed information. It might be possible to splice this information into the classhierarchy if only proxies existed; however, in the case of a copy, the method code would alsohave to be transferred. This presents numerous problems with respect to locating and transferring the actual object code for the methods and then splicing it into the target system. AddiCHAPTER 5: Implementation Details and Issues 161tional problems would also be expected if this were done in an heterogenous environment withdifferent machine architectures. Simply passing enough of the class description to allow the siteto perform method dispatch also would present a problem. An object could be asked for itsclass and then a new () operation performed on the class. In this case, the method’s code wouldbe needed, so passing a method signature is not an acceptable solution. Numerous variationson this approach to solving the problem are possible. For example, one could have a centralclass hierarchy that all environments refer to. This solution is best suited to a homogenous environment connected via a local area network. The solution becomes less feasible when themachines are heterogenous and possibly physically separated, such that updates to the classhierarchy from a central repository would be impractical. Additionally, the simple grafting ofa class into an existing class hierarchy will not ensure that the parent classes are compatible.One way to deal with this problem would be to have all environments have the same classhierarchies. This would ensure that copies could always be made and proxies constructed, butit means that the resulting environment is the union of all the Raven environments involved inthe distributed computation. The problem with this approach is that it places the onus on theprogrammer to make sure that the Raven environment being constructed has all the neededclasses. It also makes it difficult for environments not originally intended to be involved in thecomputation to successfully become part of the distributed computation.A slight modification to this last idea is the one used in Raven. Raven encourages typing,both for instance data and for parameters. This greatly reduces the probability that a GD foran object will enter an environment where its class is not known. This means that it is not necessary for every environment to know about all the classes, only the ones that it will usedirectly. It is also assumed that classes with the same name have the same organization ofinstance data and the same method signatures.This does introduce the small possibility that a GID will be delivered to an environmentthat does not know how to construct a proxy for it. Since the assumption being used with thisapproach is that each environment is self contained with respect to its class hierarchy, then it isCHAPTER 5: Implementation Details and Issues 162not an option to obtain a class description from elsewhere. One way to deal with this problemis to create the proxy as an instance of class Object with the expectation that no methods thatthe class Object does not know will be used on the proxy. Although this may be acceptable forobject references, it is not an acceptable approach for copy parameters. In the copy scenario,information about the copied object is lost. In particular, the information about what class theobject is a member of is missing. This missing information could be detrimental to the correctoperating of the object, especially if the object copies itself or is passed as a parameter toanother Raven environment. Given the nature of the problems associated with making a proxyor copy be a member of Object, the decision was made to return an error indication to theinvoker. The fact that a class does not exist will be discovered while the remote worker processis decoding the request and before an attempt is made to execute the request. Consequently, nochanges to the normal method dispatching software are required. With Raven’s strong typing,the arrival of an unknown object type at a Raven environment probably indicates some sort ofprogramming error. In an environment where distributed applications are directly derived fromprograms that run in a single environment and have as one of their aims the attempt to harnessthe processing power of distributed idle machines, this is a reasonable assumption.However, as applications become less tightly coupled, and larger both in scale and physical dimensions, the assumption that the receiving Raven environment should know all aboutthe classes associated with the objects delivered to it is not reasonable. In a client/server model,for example, the server could be providing an OlD storage service for clients on the networks.A mail service is a prime example of this. Both the sender of a message and the receiver of amessage may know how to deal with objects forming the body of the message. There is no needfor the intermediate message transfer agents to know anything about the objects forming thebody. Addressing the problems associated with this scenario have been left for future work.The three remaining issues that have to be addressed in order to support distributed computations are: dealing with heterogenous machines, locating objects, and transferring data. Inbroad terms these issues are all related and each forms a small part in the mechanism used toCHAPTER 5: Implementation Details and Issues 163establish a communications path between Raven environments. Within the Raven environmentthe communications managers (CMs) are responsible for managing network access. The CMaccepts packets for transmission and sends them to the destination machine. On the remotemachine the CM accepts the packets from the network and then, based on a protocol type fieldin the packet header, passes it to a routine that knows how to deal with packets of this type. Theprotocol handler examines other fields in the packet and determines what Raven process thispacket is destined for and arranges for delivery of the packet to that process. In Figure 5.11, theremote worker and the remote request handler are responsible for managing the stream of messages between the two Raven environments. These two processes manage a reliable streamedrequest /response protocol between each other by breaking a stream of data into packets thatthe communications managers transfer. The result is an invocation protocol with at-most-oncesemantics. The resulting protocol is functionality similar to other remote procedure call protocols [98] [561. If subsequent performance measurements of the Raven system identify problemswith remote interaction, a high performance remote procedure call facility could be adapted foruse in Raven.Figure 5.12 shows the format and fields transferred as part of an invocation request. Thedata can be broken into three parts: information associated with invoke streams, informationabout the actual request, including the target object, and information associated with the parameters. All the fields of this request are encoded/decoded using XDR [98].The first part of the request deals is used to support the InvokeStream class. The invokestream data is always present and associates this request with a particular invocation stream.Special values are assigned to the invoke stream fields if this request is not part of an invokestream. Further details on the implementation of invoke streams can be found in Section 5.4.6.After the invoke stream information, data about the actual remote request is presented.The fields associated with managing the request are:CHAPTER 5: Implementation Details and Issues 164Substream Substream InvokeStream ObjecID Sequence Number GIDTarget Object Method Name (variable length text field)GIDSession ID Return Type I ParameterCountParameter Parameter Classification specfic dataClassification (variable length)000FIGURE 5.12 Example of the data fields for an invocation request.• Target Object GID. This is the GID of the object the method is to be invoked on.Since the target object is on the recipient machine, the GID will specify the localobject that is the target of this request.• Method Name. This text field contains the name of the method to be invoked. Inthe local case method names are represented by a hash value, but in the remotecase the actual method name is sent. This allows different Raven environmentsto use different hash functions to compute the method ID, or to use an entirelydifferent technique for dealing with method dispatching.• Session ID. Session IDs are used in the support recursive method invocationsthat cross machine boundaries (Section 4.5).• Return Type. This is the return type of this method invocation. The type is not atype as defined in the Raven language, but a low-level type that indicates the format of the data being returned. The possible return types are integer, copy, andcapability. If the type is integer, then a primitive integer is being returned. If it iscopy then a copy of the object is to be returned, and if it is capability the GID ofthe object is returned.CHAPTER 5: Implementation Details and Issues 165• Parameter Count. This is the number of parameters being sent. This allows theapplication to determine when all the data for this request has arrived.• Parameter Classification. This field is similar to the return type field. Each passedparameter is classified as to format, using the same classifications as ReturnType, with some minor extensions. (These extensions are described later.) Thisfield is always paired with the next field, and together they provide all the information needed to decode a parameter. These two fields are repeated in theremainder of the request until Parameter-Count parameters have been decoded.• Parameter classification specific data. This is the actual parameter data todecode.Some of the information passed in the request could be derived by looking up the methodand examining the appropriate data fields, but that is not done. It was decided to encode theinformation directly into the request to avoid duplicating the method lookup steps. Additionally, not all Raven method invocations have a fixed number of parameters. When a method witha variable number of parameters is invoked, the receiving site must be prepared to decode allof the parameters and make them ready for use. In this case the required information cannot beobtained from the method description and has to be contained in the remote request. In theinterests of keeping the remote request processing as uniform as possible, parameter typinginformation is always passed in the request.Each passed parameter is sent in two parts with the first part being the parameter classification field described previously. The first part of the parameter specification is the type ofparameter being passed. This indicates whether the parameter is an integer, a floating pointnumber, an object ID in the form of a GD, or a copy. Floating point numbers and integers aredecoded appropriately, and global IDs are converted to a local reference either by matching theobject directly or through a proxy. When an object is passed as a copy parameter, the object andits instance data are recursively copied until all the instance data for the object has been sent.The copy operation is a deep copy that will cross machine boundaries if needed. Copying anobject consists primarily of identifying the class the object is from, specifying the properties ofthe object, and encoding the object’s instance data. With this information, an object of theappropriate class can be constructed and the passed instance data to placed into the new object.CHAPTER 5: Implementation Details and Issues 166Once all the parameters have been decoded the remote site has the following information:• The local object ID of the target.• The hash value of the method to invoke as derived from the method name thatwas passed.• The local object IDs for all the parameters. (Some of the local IDs may referenceproxies.)• The session ID.• The format of the return type.To properly handle locking, the session ID of the thread that has decoded this request, and willultimately execute the request, is set to the passed session ID. Once the session ID has beenchanged, an invocation request is constructed and a method invocation routine called to executethe request. When the request completes, the return type is used to format the response and theresult is returned to the initiating site.5.4.2 Avoiding Copying LoopsWhen an object is being copied, special attention must be given to proxy objects and to dealingwith circular references. An example of a circular reference occurs when an object’s instancedata contains a reference to itself. Attempting to perform a recursive copy operation in this situation would have disastrous results, since the copy operation would not terminate. As part ofthe copy operation, the instance data of the object must be marshalled. For a proxy there is noinstance data held locally so the data cannot be marshalled. To get the data for marshalling, thesite where the proxy resides needs to be consulted. This will result in a copy from the real objectsite to the proxy site, with the proxy site then copying this information to the site of the methodinvocation. Such an approach incurs needless copying. Instead, a proxy targeted for copying istagged for remote copying within the request, and the GID for the proxy is transmitted. Thereceiving site detects this, and if the object is resident at the receiving machine, the copy canbe done locally. In the worst case the receiving site simply asks for a copy from the site whereCHAPTER 5: Implementation Details and Issues 167the real object exists, thereby saving a copy and network access compared to having the initiating site perform the copy.When an object is being copied, it may contain references to either itself or to a sequenceof objects that eventually reference itself, thereby resulting in a cycle of object references.Blindly recursively encoding these references will result in an infinite loop. To avoid this problem, some mechanism is required to detect the cycle and then communicate this information tothe remote site.A two-level approach is used to avoid cycles. Each copied object is transmitted in potentially two parts. The first part is the object’s global ID, followed, if needed, by the informationnecessary to duplicate the object. On the sending side, before the copying of the top-levelobject is started, a data structure keyed on the object 1D is built. Before an object is encoded, acheck of the data structure is made to see if this object has already been sent. If the object hasnot been sent then the encoder places the object’s GD in the data stream and registers the objectin the data structure. The GID is then followed by the information needed to copy the object.If the object has been seen before then only the GID is placed in the stream.On the receiving side, a data structure similar to the one used by the sending side is created at the start of a top-level decoding. This structure is keyed on the sent GID and has associated with it the object ID that resulted when the object was reconstructed. When an object isbeing decoded, the GID is the first item encountered. This value is looked up in the data structure to see if the object has been encountered before. If the object has been seen, then the localobject DJ is retrieved and used as the object ID. If the search reveals nothing, then the GID isadded to the lookup structure, and a new object is constructed based on the data in the streamfollowing the object ID. After the object has been fully constructed, the new local ID is associated with the GID used in the lookup.By using this approach, cycles are avoided even if they cross machine boundaries. For acycle to cross a machine boundary, a piece of copied instance data must be marked as a remoteCHAPTER 5: Implementation Details and Issues 168copy. Upon receiving this data the decoder first checks, using the GID, to see if the object hasbeen copied. If not, it will ask the remote site for a copy, which the remote site will provide. Inthat copy there could also be a piece of data marked as a remote copy. If that object has alreadybeen copied, then when the decoder encounters this item it will find it in its list of copied objectsand not make a request of the remote site for a copy. If that object was a part of a cycle, thenthe cycle will be broken. In addition, any object referenced multiple times throughout oneencoding operation will only be reconstructed once at the destination site. In this way, the samerelationship between the original objects and the copied objects will be maintained.5.4.3 Object EncodingIn the discussion of the copying of objects the implied assumption has been that the two sidesengaged in the processing of the remote invocation agree upon the storage format of the databeing exchanged. In a homogenous environment the possibility of a data format mismatch isoften overlooked. However, being designed to operate in a heterogenous environment, Ravenhas, from the outset, taken these issues into consideration. Some form of external data representation has always been considered. The primary candidates for the external data representation were ASN. 1 [59], Sun Microsystems’ XDR [98], and a customized exchange format.Using a customized format was quickly ruled out as being unacceptable from the perspective of development time and completeness. The ASN. 1 and XDR systems available providedsupport to ease the job of specifying the encoding rules to use at the application level. Animportant consideration in the selection of the encoding format involved being able to encodeinto a fixed region (transmission buffer), and then when the buffer was filled notifying theapplication. The application could then take the buffer and arrange for it to be transmitted tothe remote site. Once arrangements for the sending are made, the encoder is allowed to continue its encoding into a new buffer. With this approach, the encoder essentially produces astream of packets filled with the encoded data. The stream of packets is then turned into realnetwork packets for transmission. On the receiving side, the decoder needs to be able to consume packets in a similar manner. XDR, with some minor modifications was able to provideCHAPTER 5: Implementation Details and Issues 169this service, while the ASN. 1 packages available at the time could not. XDR routines are alsomore generally available than ASN. 1 routines, so issues of portability also helped in selectingXDR as the external data representation system for Raven.5.4.4 Locating ObjectsTaken together, the previously described services provide a mechanism to initiate a remoteinvocation. The only remaining issue concerns actually locating objects at runtime. The key tolocating an object is its GID. When an invocation is determined to be remote, the invocationsoftware gets the GID that the proxy is associated with. Since Raven currently runs as a processin a variety of Unix environments, communication between Raven environments is accomplished using UDP. Raven’s own higher-level protocol, implemented using UDP, is designedto support remote invocations. To operate the protocol correctly, the GD needs to provideinformation on how to contact the Raven environment where the object resides, and, once contact is made, how to identify the object on the target machine. The result is that the GID contains four pieces of information:1. The host ID, in IP form, of the machine where the target Raven environmentresides.2. The local ID, or port number, of the process that the Raven environment is running in.3. The actual object ID on the remote machine of the target object.4. The class name of the object. (This is required so that the receiving site can construct a proxy for the remote object.)Combined with the class description maintained on the local machine, this information isall that is needed to construct and deliver a remote execution request. As long as an object doesnot move, which is the case since the current Raven implementation does not support objectmigration, this scheme will work. As soon as object migration is allowed, some mechanismwill be required to permit an object’s proxies to be updated. This might take the approach of aforwarding mechanism or perhaps a more general lookup procedure that allows a site to locatean object when the location specified by the GID does not work.CHAPTER 5: Implementation Details and Issues 1705.4.5 Name ServerWhen Raven is being used in a distributed environment, a mechanism is required to allow thedisjoint Raven environments to establish communication with each other. Communication atthe application level is not established directly by the application, but implicitly by performingmethod invocations on remote objects. To start this interaction, Raven provides two methodsthat make it possible for disjoint Raven environments to locate objects in each other’s space.Within each Raven environment one well known GID is maintained. This GID is reservedand allows the system to provide a mapping from the well known GID to a designated localobject. All objects respond to the method makeWoridVisible 0. When this method isinvoked on an object, it causes a well known GID to be constructed for that object and registered within the runtime system. When a remote request is received at a Raven environment,the incoming GIDs are first checked against the well known GID to determine whether the GIDmatches the well known GID.For a system to be able to access objects that have been made world visible, a way to create a proxy object with the GID of the remote object is needed. An underlying assumption tothis interaction is that the application wanting to communicate with the remote object knowsits class. This is not an unreasonable assumption, since the need to access this object is probablybased on the desire to use some service, and to use the service knowledge of the class will berequired. To create the proxy object the application invokes the method configureAsRemote () on the class object corresponding to the remote type. The method will create a proxyfor an object of this type using its method arguments to construct the appropriate GID. To construct a GID the following information is needed: an ID that identifies the target host the Ravenenvironment is running on, an ID that identifies the particular Raven environment on the targetmachine, and an ID that can be mapped to the local object. When conf igureAsRemote ()has been invoked, it returns an object ID that identifies the proxy to the remote object.CHAPTER 5: Implementation Details and Issues 171With configureAsRemote () and makeWoridVisible () Raven environmentscan make objects accessible between Raven environments. Each Raven environment isrestricted to supporting only a single mapping of a well known GID to a local ID, so this cannotbe used as a general name server. By using configureAsRemote () and makeworidVisible () as a foundation, a more general-purpose name server can been constructed.The name server in Raven works by designating one Raven environment as the host ofthe name server. The designated environment is responsible for the creation of an instance ofclass NameServer. The class NameServer defines the method interface to the name server object.The provided NameServer class supports methods to register a name along with an associatedobject identifier, to lookup an object based on its name, and to remove a name to object mapping. A Raven environment that wants to use the name server object creates a proxy for thename server object by invoking configureAsRemote () on the class NameServer. Thearguments to configureAsRemote () provide the host and environment identifiers neededto construct the proxy. Figure 5.13 shows the steps required on the client and server to createII Code on server sidename_server = NameServer.newO;name_server. becomeWorldVisible(456);II Code on a client sidename_server = NameServer.configureAsRemote(REMOTE_HOST, LOCAL_ID, 456);II The name server is now ready to use.FIGURE 5.13 Creating a name server.and access the NameServer. Once the proxy has been created the application can then registerobjects it wants other environments to use. The application can also lookup objects that it wantsCHAPTER 5: Implementation Details and Issues 172to use. Queries to the name server simply return object IDs in the form of GIDs. The runtimesystem will convert the GD to a local ID for use in the local environment.Currently, the application is responsible for either creating the name server object or theproxy to it. This approach was taken since not all applications will make use of a name server.The determination of the name server location and the creation of a proxy to it are system configuration issues. As Raven matures as a system, the functions associated with the creation ofthe name server and references to it should be made part of the start-up processing for the system.5.4.6 InvokeStreamsInvokeStreams are used to sequence requests targeted at the same object when the requests cannot be executed in parallel with each other, but when they can be executed in parallel with otherrequests (Section 4.3.2). An example of such a sequence of requests is shown in Figure 5.14.stream.push(!{workstation.positionCursor(cursor_position)}!;stream.push.(!{workstation.displayData(datal, amount))!);FIGURE 5.14 Example of stream usage.In this example, two requests to manipulate a display are executed. These requests can be executed in parallel with the initiating thread, but must be executed serially with respect to eachother; therefore, they are pushed into an InvokeStream objecf. The stream object performs animplicit start on the CompanionThreads making up the companion and ensures that the requestsare executed serially relative to each other.There are two components central to the implementation of invokeStreams. One is the classInvokeStream and the other is the class Sequencer. InvokeStream provides the application-leveldefinition of the invoke stream service, and the Sequencer object implements the service at theCHAPTER 5: Implementation Details and Issues 173Raven system level. Within an InvokeStream, each unique invocation target object is associatedwith its own substream ID. This ID identifies a stream of requests all targeted at the sameobject. Individual method requests are assigned a sequence number within the appropriate sub-stream. The sequence number specifies when a method should execute relative to the otherrequests on this substream. When all the method requests on a substream have completed, thatsubstream is discarded. A new substream ID is used if the target object of the discarded sub-stream reappears. Discarding substream IDs conserves memory and reduces execution time bymaintaining only the information currently needed to manage the outstanding streams. When acompanion is pushed into an InvokeStream the following actions are performed by the push ()method for each constituent CompanionThread:• Determine the substream ID for the request. A new substream ID may need to beassigned if there are no outstanding requests for the target object.• Assign a sequence number for the method request.• Associate the sequence number, InvokeStream ID, and substream ID with theCompanionThread.• Issue a method to start the CompanionThread executing.• When the method completes, check to see if the substream ID can be abandoned.All the actions taken by the InvokeStream object simply establish where in theInvokeStream sequence a method should execute. The Raven runtime system is responsible forensuring the proper sequencing of the method invocations; it is the responsibility of theSequencer object to actually sequence the requests.There is one instance of the class Sequencer per Raven environment, and it is created atsystem start time. The Sequencer maintains a list of all InvokeStream objects that have requestsactive in this Raven environment. When a method invocation is to be sequenced, a specialinvoke routine is used. This routine performs special pre- and post-stream processing. Beforethe method is executed the invoke routine performs an invoke on the Sequencer object to havethis request sequenced. The parameters to the request are the InvokeStream object ID, the sub-CHAPTER 5: Implementation Details and Issues 174stream ID, and the sequence number. Control does not return to the invoke routine until it isthis method’s turn to execute. When control returns, the requested method is executed and thenthe post-stream processing is performed before returning the result. The post-stream processinginforms the Sequencer object that the method has completed, and that the next method on thissubstream can execute.The Sequencer object is implemented entirely in Raven and the stream invoke routine usesthe method sequenceRequest C) to request that an invocation be sequenced. The methodrequestcompleted () is invoked by the stream invoke routine once its invocation has finished executing. When the SequenceRequest () method is invoked it checks to see if thisrequest can be executed, and if it can, it returns immediately. Otherwise the Thread object IDof the current thread is stored and a leave is performed. When the requestCompleted ()method is executed it checks to see if there is a request that can now be executed. If there is, adelayed result is performed and the waiting request can now execute.The sequencing of requests is always perfonned at the site where the target object exists.This means that if a remote invocation is sequenced, the information necessary for sequencingmust be passed to the remote site. From Figure 5.12 it can be seen that the first three fields ofa remote invocation request contain the substream ID, substream sequence number, and initiating InvokeStream global object ID. If this request does not involve a substream, then the sub-stream ID will be 0. If it does involve a substream, then the other two fields of the request aremeaningful and they are decoded appropriately. To have the remote request properlysequenced, the worker process responsible for performing the invocation at the target site usesthe special stream-invoke routine instead of the generic invoke routine.5.4.7 Lock ManagementOne of problems of operating in a distributed environment when using locking is dealing withrecursive locking invocations that cross machine boundaries (Section 4.5). The general problem of supporting locking is addressed by associating a lock control data structure (Figure 5.1)CHAPTER 5: Implementation Details and Issues 175with each controlled object. The lock control structure maintains a lists of all the threads currently holding the lock, and all the threads waiting to be granted a lock. When a request toacquire a lock on an object is made, the checks are made of the current lock holders to see ifthe lock can be granted. If the lock cannot be granted the thread is blocked until it can get thelock.One of the problems identified earlier was dealing with locking when recursive invocations are made. The key to supporting locking with recursive invocations is being able to identify the thread of control a request is coming from. To do this each thread has a globally uniquesession ID associated with it. Although the thread’s Thread global object ID could serve thispurpose, it was decided to generate a different globally unique ID to keep session IDs and GIDsseparate. This reduces dependencies between unrelated parts of the Raven code and simplifiesfuture changes if either the GID or session ID formats need to be modified.A session ID is not confined to a single machine. When a remote invocation is performed,the session ID is transmitted as part of the request and assigned to the worker processes performing the remote invocation. This means that all threads of control, in all Raven environments, that have the same session ID are in the same execution chain. The ability to determineif the locks are being held by the same execution chain, even when the execution chain crossesmachine boundaries, is crucial to the management of locks associated with recursive methodinvocations.5.4.8 Remote Invocation FailuresSection 4.9 discussed the difficulties associated with handling failures and exceptions inRaven. When performing a remote invocation there are two broad classes of problems that canoccur. One class of problems can be described as communication problems. This type of problem results from events like network partitioning, machine crashes, or protocol errors. The second class of problems are those associated with actually executing the method. These areproblems like being unable to locate the target object, being unable to decode a parameter orCHAPTER 5: Implementation Details and Issues 176complete a copy operation, or perhaps the invocation itself failing because of a bug in themethod code.If any of these errors occurs, it is detected by the runtime system software managing theremote execution and waiting for the result. The software will detect the error and proceed toexecute failure handling code. Currently, the failure handling code does little other than log theerror. This is because it is not clear where notification of the failure should be delivered. Additionally, the runtime system signals the application of an error by returning nil. In many casesapplications will not be checking for this failure so it is not an effective general-purpose wayof handling remote invocation failures.55 SummaryThis chapter has discussed the issues, implementation details, and programming problems thathad to be addressed to effectively implement support for parallel and distributed computationin Raven. To do this the object and class organization in the Raven system were discussed,including details of how objects are organized in memory, how classes are organized and howmethods are dispatched. A method ID hashing scheme was developed to speed-up dynamicmethod lookup and to eliminate conflicts during lookup. The scheme to handle remote methodinvocation was described, with attention being paid to the implementation of proxies and to thecopying of objects. The changes made to the threads package to provide support for the Ravenruntime system were also described.The distributed and parallel nature of Raven makes it difficult to detennine when objectsare no longer in use. A conservative garbage collector from Xerox is used to reclaim storage.This collector was modified to permit multiple memory requests to be granted simultaneously.For the purposes of testing, the allocation system performs adequately; however, under highdemand its performance degrades rapidly. The degradation is attributable to the garbage collector, which has a large sequential component and can act as a barrier to other requests.6 PerformanceThis chapter provides information about the performance of the Raven system. The chapterstarts with a general discussion of the performance metrics used and how the measurementswere performed on Raven. The performance results for several problems running on a sharedmemory multiprocessor are then reported. This is followed by descriptions of a distributed mailapplication and other miscellaneous applications. The chapter closes with an overview of whatthe performance results have taught us about the Raven system and a summary of the chapter.6.1 Performance MetricsIn many parallel programming environments the raw performance of the system, when compared to the best existing benchmark for the same problem in a similar hardware and systemsoftware environment, is all that matters. If the new system cannot offer performance comparable to existing systems then it is a failure. Other environments, like Raven, however, emphasize the writing of system level programs where increasing the raw performance of the system177CHAPTER 6: Performance 178is not as important as it is in computationally intensive scientific applications. What is important is the ability to harness at least some of the extra processing power available in parallel anddistributed systems. It is therefore crucial that the overhead of using the system’s parallel constructs be small. The cost of using the parallel constructs can be determined by measuring theperformance of an application using different numbers of processors relative to the sequentialversion the parallel application was derived from. By performing measurements in this way,the system overhead associated with using the parallel constructs can be determined, any problems with the scaling of the support software can be revealed, and insight into any system levelbottlenecks obtained. Although these measurements can provide some information on the costsof using parallelism, each application will have its own performance profile, and it is only bydeveloping a large body of experience that the programmer can make an informed decisionconcerning how to employ Raven’s parallel constructs.To allow the performance of parallel programs to be assessed, some performance metricsneed to be defined. This will allow comparisons to be made between various runs of the sameproblem using different numbers of processors or different algorithms[63]. The metric beingused is the running time of the program, which is also referred to as wall clock or elapsed time.This particular metric was selected over others since it captures any inefficiencies or performance bottlenecks present in a program that are not captured in a metric like CPU time usage.The running time also captures any delays, like communications overhead, present in distributed applications but not in sequential single processor applications. The efficiency andspeedup of a parallel application can be determined when the running time is known.As an example of why the running time is a more interesting metric than CPU time usage,consider a sequential program that uses X seconds of CPU time. Assuming there is no overheadfor paraflelizing the application then the total amount of CPU time used when running in theparallel environment would remain at X seconds, regardless of the number of processors usedon the problem. The elapsed time for completing the problem, however, will vary dependingupon how successful the parallelization of the problem has been. In the most successful caseCHAPTER 6: Performance 179the elapsed time will be reduced to X/P where P is the number of processors being used onthe problem. In practice, this is seldom the case, since some portion of the problem may beinherently sequential, and in general it is not possible to keep all the processors busy 100% ofthe time. A worst case scenario would have the elapsed time for the P processor solution begreater than the elapsed time for the same application running on a single processor. If theamount of CPU time used was the metric then the discrepancies between the various scenarioscould be absent. The running time of an application is also probably the most meaningful metric, since it tells the users exactly how long they will have to wait to get results. More formally,the running time for a program is given as rt (f, m, p) , wheref is the program being run, m isthe machine, and p is the number of processors being used.6.1.1 Speedup and EfficiencyAlthough the elapsed time is interesting, it does not directly provide any information about howwell one implementation compares to another. A more useful metric for comparison isspeedup [63]. Speedup captures how much faster one program runs relative to another. Moreformally the speedup of programf1 running on p1 processors relative to program f2 on p2 processors on machine m:rt(f2,m,p2)speedup(f1,pf2m)r vi,m,PiTo be of any use, the programs being compared should solve the same problem. The most common use of speedup is to compare a sequential implementation of a program with a parallel version (i.e. p2 = 1). Knowing the performance of a parallel program relative to a sequential oneis important, because few people will write a parallel program when a sequential one performsas well or better than a parallel one. In addition to providing information about speedup, thecomparison between a single processor implementation and a parallel implementation on oneprocessor provides insight into the overhead introduced by the parallel constructs.CHAPTER 6: Performance 180The efficiency of a program is a relative measure of performance that compares a program to a specified standard. Efficiency is defined as speedup, relative to one processor, dividedby the number of processors. The efficiency, eff, of a program, f1, running on machine musingp processors relative to a program, f2, using one processor on machine m is:speedup(f1,p,f2 1, in)eff(f1,m,2p)=Programf2 represents the standard against which the first program is being compared. Depending upon the information desired, the standard may vary. In some situations the standard maybe the best sequential implementation, while in others it might be the single processor runningof a parallel implementation. In this last case, the metric captures the overhead associated withrunning the same program using different numbers of processors. An ideal parallel implementation will experience no additional program execution overhead, with the result that the program’s efficiency will be 1. Synchronization, communication and shared resource contentiontypically prevent applications from achieving this, with the result that the efficiency for morethan one processor is less than one.The measurements that will be presented in this thesis are efficiency and speedup, and arecalculated using the formulas just introduced. In both formulas, the ratios for speedup and efficiency will be between a sequential implementation and a corresponding parallel implementation using varying numbers of processors. Since one of the goals of this work is to make it easyto convert sequential programs to parallel programs, the parallel programs are derived directlyfrom their sequential ones. For ease of presentation, the speedup and efficiency information ispresented graphically.All the measurements reflect the running time for the true working part of the problem.Being a large system, the Raven environment goes through a number of phases before theactual user part of the application is started. The main phases of interest are:CHAPTER 6: Performance 181• The start-up phase.• The application execution phase.• Garbage collection phases.• The shutdown phase.To provide a realistic measure of the efficiency of the Raven parallel constructs, the run-fling time reported corresponds to the application execution phase. This number is arrived at bymeasuring the total elapsed time and subtracting the times for the start-up, garbage collectionand shutdown phases. The start-up phase involves the setting up of the memory allocation environment and the loading and initializing of the Raven class library. The start-up code is executed before any application code is run. Removing this time from the measurements isreasonable because this start-up time varies from program to program and the code is highlysequential and memory intensive. All of these factors contribute to an increase in running timethat does not accurately reflect performance changes due to the use of the Raven parallel constructs. Since the start-up code is executed before the application begins, any changes in theapplication to support the various degrees of parallelism are properly captured.Practical constraints also place limits on how long it is feasible to run an application program for testing. Some of the timings for a single processor run take over two hours, while theequivalent multiple processor run takes only a few minutes. In these latter situations the running time is dominated by the start-up time, and, as a result, does not provide an accurate measure of the costs of working in the parallel environment. Ideally, the runs taking severalprocessors should take hours, with the result that the start-up costs would become insignificant.If that were done, however, timing experiments using a single processor would take days tocomplete, and this was impractical.The second running time component subtracted from the reported times is the time spentin the garbage collector. Garbage collection occurs when a request for storage cannot begranted without first reclaiming unused memory. When this happens some processing is donewhile the other threads continue to execute. But at some point all the other processes have toCHAPTER 6: Performance 182be stopped so that the marking and sweeping of their data spaces can be done. This stop-the-world approach to garbage collection is not well suited for operations in a parallel environment.A garbage collector designed for use in a parallel environment would reduce or eliminate theneed to stop other executing processes for garbage collection related activities, and this wouldincrease the overall efficiency of the application. Subtracting the time that processes arestopped for garbage collection gives a more accurate measure of the performance of the parallelconstructs. The performance should also be more representative of the efficiency and speedupexpected with a parallel memory allocation and garbage collector system installed.The performance numbers in Section 5.2 demonstrate that even with a predominatelyserial memory allocator it is possible to reduce contention within the allocator. Although theexisting allocation code does not address the problem of performing storage reclamation in parallel, it does address the issue of acquiring storage in parallel. Improvements in performancesimilar to those obtained for allocation should be possible for the garbage collection phase,which remains a serial operation in this implementation. The garbage collection time, whichincludes a large sequential component, is not included in the elapsed time for an application’sexecution. By Amdahl’s law [7] the speedup of a program is constrained by the sequential partsof the program.An additional problem with garbage collection involves the starting and stopping of otherthreads during collection. During the stopping phase, parallelism is reduced as active threadsare suspended. Likewise during the restarting phase, parallelism gradually increases as the suspended threads are restarted. Other factors to consider concern the fact that as processes aresuspended they cannot free resources that other threads might need. This results in some processes waiting for resources and the parallelism being reduced. When processes are restarted,the parallelism gradually increases, but again it is possible that a restarting thread will be inmiediately blocked on a resource one of the suspended threads is holding. Additionally, while acollection is in progress, that target memory pool is locked. This locking prevents other threadsfrom using the pool and causes threads to block waiting for access to the pool.CHAPTER 6: Performance 183The third phase of an application’s execution that was removed form the timing is theshutdown phase. This shutdown code is executed once the user-written code has terminated.The amount of code executed depends upon the postprocessing the system is performing. Ingeneral this overhead is negligible. This is not the case in an instrumented system like Raven.The post processing in Raven consists of extracting system statistics and timings and displaying the information. Some of these operations are quite time consuming and are thereforeexcluded from the timings.For the purposes of these measurements, an application performs two types of work. Oneis the useful work of the application; the other is the overhead, or administrative work, associated with servicing the application’s requests. The ratio of administrative work to real work canbe quite different between tests using different numbers of processors. This depends upon suchfactors as the application’s sensitivity to the memory pooi size. In one situation the applicationmight run the garbage collector several times, while in another not at all. The addition ofanother processor can change memory allocations so that garbage collections are more frequent. Consequently, one application can pay a much heavier administrative cost for memoryallocation than the another. Again, changes to the memory allocation system can eliminate thisproblem.6.2 Example ProblemsTo demonstrate the usefulness and practicality of the Raven support for parallel and distributedcomputing, several example applications were programed. There are two major operating environments that Raven runs in. One is a collection of Sun SPARCstation class machines andMIPS machines connected via an Ethernet [74]; the other environment is a 20-processorSequent Symmetry 81, courtesy of the Department of Computer Science and Engineering atthe University of Washington in Seattle, Washington. The Sequent is a shared-memorymachine that utilizes 80386 microprocessors running at 20MHz, and it is running Mach 3.0CHAPTER 6: Performance 184with local modifications to support the shared-memory multiprocessor environment of theSequent.Each of these environments presents different properties to the application and as a resulteach highlights something about the Raven environment. The Sun machines are distributedaround the department and suffer the vagarities of being on the network. The machines are notisolated enough to be able to provide a consistent environment from one run to the next. Peoplecan login to machines, unsolicited requests can be made of a machine, and the traffic load onthe network can undergo dramatic changes depending upon what other machines on the network are doing. Compared to the immediacy of local invocations, the latency for remote objectinvocations can be substantial. The distributed-memory environment does have the advantagethat the contention for storage requests is greatly reduced in the situation where the applicationsrunning on the nodes are essentially single threaded. However, if the amount of concuencyon a single node is increased, similar problems with memory contention will be introduced.The shared-memory environment of the Sequent is more controlled, and as a result provides a more consistent environment to conduct measurements associated with how well theparallel constructs within Raven are capable of scaling. It is for this reason the only reportedresults are for a shared-memory environment. As pointed out earlier, this is not without its ownproblems, as the system resources often do not scale as processors are added. For example, thesystem being used has 32 megabytes of main memory, and that remains fixed. With a singleprocessor this amount of memory may be perfectly adequate for some applications. However,when 20 processors are used the amount of main memory per processor is grossly reduced. Theresulting amount of memory per- processor may not be enough to execute the application process in a reasonable manner. Consider that if the application is memory-request intensive itwould be using up memory twenty times faster than a single processor, yet the system configuration has not changed.CHAPTER 6: Performance 1856.2.1 Mandelbrot SetThe first example selected is a Mandelbrot set computation. It is an easy to problem to parallelize and it supports a high degree of parallelism. By choosing a straightforward problem toparallelize, the goal was to immediately highlight any fundamental problems in the Raven subsystem and runtime system support for parallelism. Since the application is method and computation intensive, any problems with the method dispatch system or with the thread schedulingprocedure would be quickly exposed. In essence, the Mandelbrot set computation provides ameasurement of the basic level of overhead that the Raven system imposes when managingmultiple threads of control on multiple processors. Since this problem has minor memoryrequirements, any potential contention areas associated with the memory management systemare avoided. Although this problem may seem unduly simple, it has the same form and structure as such problems as ray tracing in graphics.6.2.1.1 Problem DescriptionA simple description of the Mandeibrot computation is that it is a computation that applies afunction to a number in the complex plane. A region of the plane is divided into a grid and thefunction is applied to each grid point. Each computation is independent of the other computations; consequently, it is possible for all the computations to be performed in parallel. Thiscomputation is an easy problem to parallelize and therefore useful for developing experienceon working with Raven in a parallel environment.The programming approach for this problem decomposes the computation into threemajor object components. One object is responsible for displaying or recording the data, oneor more objects are responsible for the computation, and one object decides what points are tobe computed. A subset of the Raven code used to perform the Mandelbrot computation on theSequent multiprocessor is shown in Figure 6.1. Two versions of the class Main are provided.The first is a sequential version of the Mandelbrot computation. The second is the parallel version.CHAPTER 6: Performance 1867 Worker object that computes the7 Mandelbrot set one line at a timelass Worker{}ehavior execute {var displ : Display;var mand : Mandel;var line : lnt;II Get information about the regionII being computed (i.e. screen res.)display = disp_mgr.getDisplayO;mand = Mandel.new(display);II Get first line to computeline = disp_mgr.getLine;while (1) { II compute linesif (line <0) return;else {})mand.computeRegion (line);disp_mgr.done(line);line = disp_mgr.getLineO;II Controller of computation non-parallelclass Main {}behavior start {var disp_mgr: DisplayManager;var worker : Worker;}disp_mgr = DisplayManager.newO;worker = Worker.newo;worker.execute(disp_mgr);II Controller of computation parallel versionclass Main {}behavior start {var disp_mgr: DisplayManager;var worker : Worker;}disp_mgr =DisplayManager.pnew(CONTROLLED);while (!disp_mgr.computation DoneO) {worker = Worker.newO;!{worker.execute(disp_mgr) }! .starto;}FIGURE 6.1 Subset of Raven code to perform the Mandeibrot computation.When the Mandeibrot application starts, it first creates an instance of the DisplayManagerobject. The DisplayManager object encapsulates information about the region that the Mandelbrot set it being computed over, and it is responsible for displaying the computational results.The actual computation is performed one line at a time through a Worker object. When theWorker object has the execute () method invoked, it first contacts the DisplayManager to getdisplay information. This information consists of data describing the spacing of the lines andgrid points over the computation region and the number of colours to use. Next, the worker creCHAPTER 6: Performance 187ates a Mandel object to perform the detailed Mandeibrot computation of a line. The workerobject then continually queries the DisplayManager object for a line, computes the results forthe line and returns the results to the DisplayManager. This continues until there is no more workto be done.In a sequential computation all the start () method does is instantiate aDisplayManagerand Worker object. Invoking the execute () method on worker is sufficient to compute theMandeibrot set over the region defined in DisplayManager. A version of the code to perform thecomputation in parallel is shown in the second implementation of the class Main in Figure 6.1.Parallelism in this application is achieved by creating multiple Worker objects and performing the invocation of the execute () method inside a companion. The creation of thecompanion and the check with the DisplayManager, along with a way to specify the number ofcompanions to use distinguishes the parallel computation from the sequential computation. Inthis example, the application relies upon Raven’s automatic control of process creation(Section 4.6.2) to restrict parallelism to the number of available processors. This, however,requires the addition of a call to the DisplayManager to see if there is more work to perform.Another approach would have been to pass, as a command line argument, the number of workerprocesses the application was to use and then having the application create only that number ofworkers.6.2.1.2 Performance ResultsFigure 6.2 provides the graphical representation of the performance results for the running ofthe Mandelbrot set computation on from 1 to 20 processors. On the horizontal axis is the number of processors used by the computation and the vertical axis has the speedup factor. The dotted line represents perfect (maximum) speedup and the solid line is the measured speedup. Forthis particular example the speedup is quite good with only a slight decrease in efficiency asextra processors are added. The gap between the measured speedup and perfect speedup is theoverhead associated with running this problem. The efficiency of the solution ranges from amaximum of 0.98 to a minimum of 0.93 with, in general, the efficiency decreasing as the numCHAPTER 6: Performance 188FIGURE 6.2 Mandelbrot set speedup.ber of processors working on the problem increases. A drop in efficiency is expected since theadministrative overhead associated with adding and managing processors will increase as morethreads of control compete for access to common system management resources, such as thescheduling queues and message passing services.6.2.2 Bayesian SearchAnother problem that demonstrates the use of Raven’s parallel constructs is an applicationbased on the work of Poole [84]. The particular problem example consists of performing faultdiagnosis on a multiple-bit adder constructed from a sequence of 1-bit adder units.Speedup vs Processors2 4 6 8 10 12 14 16 18 20ProcessorsCHAPTER 6: Performance 1896.2.2.1 Problem DescriptionA subsection of a multiple-bit adder circuit (also known as a ripple-carry adder) is shown inFigure 6.3. This figure shows two connected 1-bit adders and each adder consists of four gates,FIGURE 6.3 Subsection of multiple-bit adder.Input2(N)Inputl(N)1each with two inputs and one output. Each gate in this circuit can fail and the purpose of thisapplication is, given a known input and an incorrect output, to determine the most likely failurescenario for this circuit that produced the observed results.Each gate in the circuit has a probability of functioning in a particular way. The operational states that a gate can be in are:Input2(N+1)Inputl(N+1)CHAPTER 6: Performance 1901. The gate works correctly with probability P.2. The gate is stuck and always returns a 1 with probability N.3. The gate is stuck and always returns a 0 with probability N.4. The gate fails in an unknown way and returns a 0 with probability U/2.5. The gate fails in an unknown way and returns a 1 with probability U/2.Given these possible operational modes the task is to determine the most likely failure scenariogiven a particular set of inputs and an observed error output. The total number of possible failure scenarios is defined by:where C is the total number of components of interest and F is the number of ways a component can fail. In this example all the components have the same number and type of failuremodes. Each element of the summation represents the number of failures for a particular grouping. For example when i is 1 the number of 1 component failures is determined, when i is 2 itis the number of two component failures and so on. As is evident from the formula, the numberof failure scenarios grows quickly as a function of the number of components. For a 128 bitadder with 640 components and 4 failure scenarios per component the total number of failures89 . .that need to be checked is greater than 2.9 x 10 . Obviously, it is not feasible to check all thesepossibilities, and some strategy is required to test the most likely candidates.Since it is impossible to check all the failure scenarios, a central server approach is usedto direct the search into the areas determined to be the most likely to produce useful results.Basically, the central server is responsible for determining the next failure scenario to test basedon criteria selected by the programmer. The result is that the implemented solution has twomajor components to it. One is the search space manager and the other is the worker object thatCHAPTER 6: Performance 191does the actual testing of a failure scenario. This approach also has the advantage that it is easyto modify the search strategy by simply changing the search space generation object. Thisapproach also lends itself well to parallelization, as multiple error testing objects can examinedifferent error scenarios concurrently.Additionally, this solution relies heavily upon the object-oriented approach of usingbuilding blocks to construct applications. In the implemented solution, a ripple-carry adder iscomposed of a collection of identical 1-bit adder circuits connected to form a complete adderunit. Figure 6.3 contains two 1-bit adder circuits. In the Raven implementation each gate withinthe unit is also an object. A circuit is then constructed by creating gate objects; the gate objectshave methods that are called to indicate what objects the outputs need to be directed to. To simplify the construction of the adder circuit, a further grouping of gates into 1-bit adder objects isperformed and these adder units are finally connected together to form the completed adder circuit. Each Worker object creates an instance of the adder circuit that needs testing. The workerhas complete control over the circuit and can specify the failure behavior of each gate.Figure 6.4 shows the Raven code that forms the main body of a Worker object. The execute () method uses the instance data iwordsl and iwords2, which are the two inputnumbers to be added, with owords being the output number being looked for. Within theexecute () method, the Worker object first creates an Array of AdderStates. The informationin this array specifies the operational state of each gate in the adder circuit. The Worker then getsits first failure scenario and while there are failure scenarios the following actions are performed:1. Set the operational status of the gates to those required to check this failure scenario.2. Add the two input numbers together, getting a result.3. Compute the probability of this failure scenario.4. Check the returned result to see if the result indicates that this failure scenariocould have produced the desired results.CHAPTER 6: Performance 192behavior execute {var fail : Array[AdderState]; II State each 1- bit adder unit is invar res : Array[lnt]; llThe resultvar prob: Float; I/The probability of the resultfail = Array[AdderState].new(O);‘fail = state.getFailureScenario(fail);while (fail.atGet(O).unit < size) { II Unit is a complete adder unitunit.setOperationalStatus(fail);res = unit.add(iwordsl, iwords2); II iwords{12}, instance dataprob = unit.computeProbability;if (compare(res, owords) == EQUAL) {state.reportMatch(fail, prob);} else {state.reportNoMatch(fail, prob);}fail = state.getFailureScenario(fail);}FIGURE 6.4 The main body of the execute method in the Worker class.5. Report the outcome of the scenario test, along with the probability of the outcome.6. Get a new failure scenario,The actual implementation consists of 1 or more Worker objects querying a central objectto get new adder unit states to test. Figure 6.5 shows the implementation code for the main bodyof the sequential and parallel implementations of this application. Both implementationsrequire code to instantiate the StateGenerator object, which determines the failure scenarios toexamine. The parallel implementation requires extra code to specify the companions and theirnumber. The StateGenerator object also needs to be created with the controlled property toensure the consistency of the object in case multiple workers attempt to access it simultaneously. The parallel implementation also demonstrates how an application can explicitly control parallelism by explicitly starting the number of desired worker processes.CHAPTER 6: Performance 1937 Sequential implementation II Parallel implemenationbehavior start {,ehavior start { var stg : StateGenerator;var stg : StateGenerator; var w : Worker;var w : Worker; var worker_count: Int;stg = StateGenerator.new0; NextVarArg;w = Worker.new(stg); worker_count = NextVarArg.strTolnt;w.execute0; stg=StateGenerator.pnew(CONTROLLED)while (worker_count > 0) {!{Worker.new(stg).execute() }! .start0;worker_count--;}FIGURE 6.5 Sequential and parallel control portions of Bayesian search.6.2.2.2 Performance ResultsFigure 6.6 is a graph of the speedup obtained for examining part of the error space for the 128-bit adder described earlier. The speedup remains close to ideal until about 11 processors andthen begins to tail off. The efficiency of the processors remains above 0.85 for the first 15 processors and then begins to level out and even drop off. In particular, an unexpected large dip inthe speedup occurs in the 16 to 19 processor range.In many respects, this problem has structure similar to the Mandelbrot set problem. Aspeedup graph similar to that one would, therefore, be expected. A comparison of Figure 6.6 toFigure 6.2 indicates that the graphs are not similar and start to diverge at around 9 processors.A further examination of the implementation and the Raven system was undertaken to try toexplain this discrepancy.Since the checking of a failure scenario is a faster operation than the corresponding computation in the Mandelbrot set, it was decided that perhaps the object providing the failure scenarios was being overwhelmed and could not respond quickly enough. If that were the case,CHAPTER 6: Performance 194Speedup vs ProcessorsFIGURE 6.6 Speedup for Bayesian search space.20then increasing the length of time required to check a failure scenario would drive the speedupcurve considerably closer to the perfect speedup line. This possibility was checked for by artificially doubling the failure scenario checking time. The resulting measurements showed nosignificant change in the reported speedups. This implied that the failure scenario server wasnot being overwhelmed. This conclusion was not entirely unexpected, since there was a dramatic drop-off in speedup as opposed to a gradual leveling out of the speedup.Another possibility was that some of the worker objects were checking significantly morefailure scenarios than other objects. In particular, maybe all the checking was being done bythe first workers started and none of the work was being done by the last workers started.Unlike the Mandeibrot set computation, each worker object has a lengthy initialization phase2 4 6 8 10 12Processors14 16 18CHAPTER 6: Performance 195to go through before the failure checking phase can start. This results because each workerobject creates a model of the adder circuit to work with. The construction of this circuit objectis memory intensive, so any unfairness in the memory allocation scheme could favour someobjects. This would result in some objects starting work on failure scenarios sooner. To checkthis, the worker objects were instrumented to count how many failure scenarios they checked.Although it was true that the work load was not perfectly balanced, the discrepancy was insufficient to account for the dramatic dropoff in speedup.The results of the two previous experiments indicated that the drop in speedup was notrelated to the checking of the failure scenarios, but to the initial construction of the adder circuit. The building of the adder circuit involves the creation of many objects, thereby putting aheavy strain on the memory management system. The elapsed time for the purposes of measuring speedup spans the time from when all worker processes are created until they have completed executing. This time includes the time to construct the circuit object. If there wascontention in the memory allocation during the initialization phase, then as more processorswere used the contention would become more pronounced.To test this hypothesis, the worker objects were modified to first initialize themselves andthen to synchronize with one another before proceeding. Figure 6.7 is the code for the Barrierclass that was used to perform the synchronization, and it demonstrates the usefulness ofdelayed result. The Barrier class is an application level controlled class with a constructor andtwo methods, and its designed to perform a multi-way synchronization between threads of control. The constructor takes as its argument the number of threads to be synchronized. To usethis class, the main body of the application creates the Barrier object and passes the object IDof the Barrier object to each worker. Each worker, which is a thread, invokes the synchroni z e () method, once it has created its circuit object and is ready to start checking failure scenarios. The synchronize method determines whether or not this invocation is from the lastworker. If it is not, then the me value for the thread is stored in an array, and a delayed result isinitiated by executing the leave statement. This results in the threads execution being susCHAPTER 6: Pertormance 196class Barrier controlled {waitfors, index, done_thr: Int;blckdjhr: Array[Thread];behavior synchronizeo;behavior finishedo;}constructor {blckd_thr = Array[Thread].new(O);index 0; waitfors = 0; done_thr = 0;}behavior finished {if (++finished_procs == waitfors) {code to print elapsed time})behavior synchronize {var thread : Thread;if (index == waitfors- 1) {I/unblock the threadscode to get starL timefor (index--; index >= 0; index--) {thread = blckd_thr.atGet(index);result [thread];}}} else { / need to wait *1blckd_thr.atPut(index++, me);leave;}FIGURE 6.7 Barrier class to synchronize and time a collection of Threads.pended. When the last worker invokes synchronize, all the suspended threads synchronizing onthis Barrier are released through calls to result. When a worker has finished all its tasks, itinvokes finish () to indicated it has completed. When all the workers have finished theelapsed time since the workers were synchronized is computed and displayed.The measured speedup that results when all the worker processes synchronize is shownin Figure 6.8. In this figure the new speedup values are superimposed on the old values. Thespeedup has improved significantly and is consistent with the expected speedups. With themodified measurement technique the efficiency of the computation ranges from 0.92 to 0.99with the exception of the case when 20 processors are used. With 20 processors, efficiencydrops to 0.84. The drop in efficiency is not unexpected, since at 20 processors the host systemis starting to become over-committed. This overcommitment occurs both at the applicationlevel and at the total system level. At the application level, 20 object worker processes plus anyCHAPTER 6: Performance 197FIGURE 6.8 Speedup vs. processors using a modified measuring technique.Raven system management processes are running. The result is that more application threadscould be ready to run than there are processors. When this happens, the speedup will drop. Conceptually, a worker object must share a processor with the administrative threads instead ofhaving the whole processor to itself. Given the structure of the program, more administrativeoverhead is present in this problem than in the Mandelbrot set. In addition, when the application is using all 20 processors, there are no other processors available to do processing for theunderlying host operating system. This extraneous processing includes other users logging in,servicing network traffic or the managing of memory and disk requests. Servicing these otherrequests decreases the efficiency of this application.Speedup vs ProcessorsProcessorsCHAPTER 6: Performance 198The new measurements support the hypothesis that the dropoff in speedup is related tothe initialization phase of the objects. A further examination of the figures reveals that in the10 to 20 processor range the initialization time of the worker object represents between 15%and 30% of the elapsed time. Almost all the memory allocation requests were made during theconstruction of the adder object. This suggests that some aspect of the memory managementsystem is causing a bottleneck. This point will be elaborated on further in Section 6.3.6.2.3 Prime Number GenerationAnother application programmed in Raven is a search for prime numbers. Since restricting thesearch for prime numbers to the word size of the machine is not particularly computationallyintensive, a new class, Longinteger, was programmed. This class manipulates integers too largeto be represented in the natural word size of the machine. The class provides the basic arithmetic integer functions such as: addQ, subtractQ, multiplyQ, d±videQ, less_thanQ, greater_thanQ, equal (),and negateQ. These methods are supplemented withthe isPrime 1 method to test the primality of a number.6.2.3.1 Problem DescriptionThe basic problem is to start from a specific number and then search the next X numbers looking for primes. A simple algorithm is used to test a number, N, for primality. Essentially, all thenumbers from 2 to the square root of N are tested to see if they will factor into N. Only minoroptimizations to the checking routine, such as only checking to see if the odd numbers are factors, are made. With a method to check for primeness, a simple way to look for prime numbersis to loop generating new numbers. The new numbers can then be checked for pnmality. Someexample code to do this is shown in Figure 6.5. The left column provides a sequential implementation and the right column a parallel implementation. Extending the sequential version tooperate in a parallel environment is a simple matter of turning the checkPrime () invocationinto a companion and invoking the start method on the companion.1. The algorithms for these functions are taken from Knuth [62] with Newton’s method, which is used in isPrimeO, from[24].CHAPTER 6: Performance 199‘ISequential implementationehavior start {var n : Long Integer;var i : Int;var incr: Longlnteger;var max: Int;NextVarArg;max = NextVarArg.strTolnt/2;n = Longlnteger.new(1 00001);range = Longlnteger.new(2);for (i = 1; i <= max; n= n.add(range))checkPrime(n, range);II Parallel implementationbehavior start {var n : Longinteger;var i : Int;var incr: Long Integer;var max: Int;}NextVarArg;max = NextVarArg.strTolntO/2;n = Longlnteger.new(1 00001);range = Longlnteger.new(20);for (I = 1; i <= max; n= n.add(range))!{ checkPrime(n, msg)}!.startO;FIGURE 6.9 Sequential and parallel control portions prime number search.Both solutions perform the following steps:1. Use the command line argument to determine how many numbers to check(NextVarArg).2. Start the search at 100,001.3. The variable range is set to indicate the increment to use when computing thenext candidate number to pass to checkPrime 0. The methodcheckPrime () checks the numbers from n to (n + range - 1) for primalityand reports the prime numbers.4. All the numbers from 100,001 to (100,001 + n) are checked for primality.This solution is not a general purpose in that it always starts looking for prime numbersfrom the same integer. The application was coded in this fashion since the goal was to concentrate on demonstrating Raven’s parallel constructs and not on providing a general purposeprime number tester and generator. The checkPrime () method is somewhat unusual in thatit takes a range parameter. In the initial implementation checkPrime () tested only one number. The modification to checkPrime () was undertaken to improve the performance of theparallel implementation with the goal being to provide each companion with more work so thatCHAPTER 6: Performance 200the cost of creating a companion would be amortized over more tests for primeness. As thenumbers being tested get larger, the need to increase the work load of a companion decreases;just performing the basic tests will be an adequate work load. This last approach would havebeen preferred; however, verifying the results of the application is difficult and reasonable running times for measurement purposes are difficult to achieve.6.2.3.2 Performance ResultsFigure 6.10 shows the speedup achieved when searching for prime numbers. It can be seen thatFIGURE 6.10 Speedup vs. processors for prime number searching.this implementation of the prime number searcher does not achieve the sorts of speedupsobtained by the Mandeibrot computation and the Bayesian search routine. At 20 processors thespeedup is 7.25 and the marginal performance increase in performance is small. In going fromSpeedup vs Processors10ProcessorsCHAPTER 6: Performance 20118 to 20 processors the speedup only increases by 0.25. In the both the Mandeibrot computationand Bayesian computation the efficiency remains high and generally above 0.90. In this primenumber searching example, that is not the case. The efficiency for two processors has alreadydropped to 0.83, well below that of the other two examples. Figure 6.11 is a graph of efficiency>C-)CC.)LUEfficiency vs Processorsvs. processors. The efficiency in this application drops rapidly and is on a constant downwardtrend. When 20 processors are in use the efficiency has dropped to 0.36. Both Figures 6.10 and6.11 indicate that the incremental performance increase, especially after 6 processors, is verysmall. This result is somewhat unexpected, as the problem has all the properties that indicate itshould parallelize well.2 4 6 8 10 12 14 16 18 20ProcessorsFIGURE 6.11 Efficiency vs. processors for the prime number searching routine.CHAPTER 6: Performance 202Further analysis of the performance results, and additional experiments and instrumenting of the Raven system, indicated that the problem was with the memory management system.In performing the arithmetic operations, new objects representing the current and intermediateresults of a calculation are constantly being created. These objects are discarded almost immediately, once the next step in the computation is performed. With this rapid consumption ofmemory, the free memory in the allocation pools is quickly used up, thereby forcing garbagecollections. Each time a processor is added, the memory allocation rate of the system goes up.When one processor is in use there are approximately 1,500 memory allocations/second, andwhen 20 processors are active this has risen to 10,900 allocations/second. The allocation ratewhen normalized by the number of active threads has dropped from 1,500 allocations/secondto slightly less then 550 allocations/second. Since each one of the threads is independent, theexpectation is that each thread would like perform about 1,500 allocations/second.This allocation rate consumes memory extremely quickly, and causes the system to garbage collect frequently. Although the garbage collection times have been subtracted from theelapsed times for the purpose of computing the speedup, it is still instructive to analyze what ishappening. The amount of time spent garbage collecting increases from about 10% to nearly50% of the elapsed time by the time 20 processors are being used. With this much time spentgarbage collecting, it is clear that garbage collection is a bottleneck restricting parallelism.Adding more heaps to reduce the frequency of collections is not a practical solution because ofthe limited amount of memory available to the system. The proper solution is change the memory management system to operate in a high request and reclaim environment.6.2.4 A Distributed Version of the Mandeibrot ComputationA parallel version of the Mandelbrot computation was initially developed in a distributed processing environment. This allowed the DisplayManager and Worker sides of the program to betested and debugged separately. The main body of the code for the distributed- and shared-memory versions of the implementation are identical, except for less than 10 lines of welldefined code needed by the distributed implementation to get the system configuration. AsCHAPTER 6: Performance 203Raven matures and configuration information becomes more fully integrated into Raven, theseextra lines of code will disappear.To provide some indication of the types of speedup possible in a distributed environmentthe Mandelbrot computation was run on a Sun ELC to establish a baseline. Several other runswere then performed using a variety of different machines. Table 6.1 enumerates the types ofmachines used in the distributed computation and attempts to provide some indication of therelative performance of the machines by reporting the processor speed and various SPEC [106]TABLE 6.1 SPEC results for selected processorsSPEC92 SPEC92Machine MHz Integer Floating pt SPEGMark 89Sun SPARCstation LX 50 26.4 21-Sun Sparcstation ELC 33 18.2 17.9 20.3Sun Sparcstation IPC 25 13.8 11.1 13.5Sun Sparcstation 2 40 21.8 22.8 25.0Sun Sparcstation SLC 20-- 8.6Sun Sparcstation 1 20-- 10.0Sun SS1O/41 40 53.2 67.8 71.2Sun SS1O/51 50 65.2 83.0-MIPS 3260 25-- 17.3performance benchmarks. The measured elapsed times for the Mandelbrot computation areshown in Table 6.2. Making direct comparisons is difficult given the wide variety of machinesused for these tests and the inability to control individual machine and network loading. However, the results demonstrate that significant speedup is possible for a distributed computation.For the 1, 5, 10, and 20 processor runs, the typical processor utilization of a client was 95% to99%. For the run with 29 processors, several of the machines were in general use, with theresult that some of the client processes received only 30% of a processor. It should be notedthat when 29 processors were in use, the application was operating in a heterogenous machineenvironment utilizing Sun and MIPS processors. An indication of the discrepancy between theCHAPTER 6: Performance 204TABLE 6.2 Elapsed time and speedup for distributed Mandelbrot computation.Processors Elapsed timeused (seconds) Speedup Processor types used1 2590- 1ELC5 525 4.93 5 ELCs10 280 9.25 7 ELCs, 2 IPCs. 1 SS220 160 16.19 7ELCs, 4 IPCs, 5 SS2, 2 SLCs29 115 22.59 7ELCs, 4 IPCs, 5 SS2, 4 SLCs,2 LXs, 3 SS1, 1 SS1OI51,1 SS1O/41, 2 MIPS 3260sprocessor speeds can obtained by examining Figure 6.12. This figure shows a partially corn-FIGURE 6.12 Partial results of the Mandeibrot computation performed in a distributedenvironment.pleted Mandeibrot set calculation when 29 processors were in use. The clear areas are regionswhere no results have been returned. How long a particular calculation will take depends uponthe processor speed, processor loading, and the difficulty of the calculation. Generally, the moreblack in a line the more difficult the computation, and the longer it takes. If the processor speedsand loading were identical, there would not be uncompleted gaps between large completedareas.CHAPTER 6: Performance 2056.2.5 A Distributed Mail ApplicationAs a further demonstration of Raven’s ability to support programming in a distributed environment, a distributed mail application was programmed by a colleague. This application consistsof two major classes:• A server object that accepts messages and allows client objects to retrieve messages.• A client object which acts as a user agent and allows the user to perform administration functions and to compose, send, accept and display messages.Figure 6.13 depicts the message system and some of the interactions between the variouscomponents. In this figure, the round-cornered rectangles are client objects; the rectangles represent mail messages; the oval represents the mail server, and arrows indicate a message transfer and its direction. The client objects and server object each exist in their own Ravenenvironment, thereby making this a distributed application. The mail server object plays a central role in this application by acting as a repository for message exchanges between clients. Toillustrate the mail server application, consider the following user-initiated actions and theresulting interactions within the mail system:• A user composes a message. This is done within the client object and does notinvolve the mail server.• The user requests the composed message be sent. This results in the client objectinvoking the postMsg () method. The actual message is an argument to themethod.• The mail server object accepts the message and stores it. The method returns tothe client and the client object can process its next request.• A user issues the accept messages command to the client object. The clientobject invokes the retr±eveMsg () method on the server object.• The server object checks to determine whether or not there are any messageswaiting for this user. If there are, the messages are returned as the result. Theserver also deletes its references to the returned mail objects.• The client object receives the messages as the result of its invocation and the usercan then issue further commands to display or delete the messages.CHAPTER 6: Performance 206FIGURE 6.13 A distributed mail application.The programming of this application used several features of the Raven programmingenvironment. The mailserver made use of a name server. When the server object starts, it registers itself with the name server. When clients start, they use the name server to obtain theobject ID of the mail server. The mail server is a central server and multiple clients can invokeon it simultaneously. Extra programming to deal with concurrent accesses to the server isavoided by creating the mail server as a controlled object. The store and forward nature of amail system is captured by designating message parameters and returned results to be copied.For example, when a message is sent to the mail server, the complete message is copied andliser= BobAction = ReadingmessagesTo: BobFrom: DonUser = BillAction = ComposingmessagesTo: Bob, DonFrom: BillCHAPTER 6: Performance 207stored by the server. When messages are retrieved from the server, the returned messages areall copied to the receiving object.The mail server is a good example of code that makes use of early result. When both thepostMsg () and retrieveMsg () methods are invoked on the mail server, there is theopportunity to perform an early result. An abbreviated version of the message posting andretrieving code is shown in Figure 6.14. In the postMsg () method, an early result is done asII Method to post a message II Method to retneve messagesbehavior postMsg { behavior retneveMsgs {local variable declarations... local variable declarations and codeand initial message processing code ... to check for messages and constructa response.II check if sender allowed to post messagesif ( !regdb.isRegistered( user)) II Early result of responsesreturn Result.new( RC_NOT_REG); result rc;llEarly result... code to perform cleanup processingresult Result.new( RC_NOERROR);... message delively notification etc...Code to go through each intendedrecipient and perform needed actions todeliver the message and perform anyerror notification to sender}FIGURE 6.14 Example early result usage in distributed mail application.soon as it has been verified that the invoker is allowed to post a message. The postMsg ()method can then proceed to check the remainder of the message request and start the sequenceof steps involved in delivery of a message to a user. If the postMsg () method detects anyerrors, such as an invalid recipient address, it composes a message detailing the error and sendsit to the user. When messages are retrieved, the mail server object similarly uses an early resultonce the appropriate response has been determined. The mail server is then free to performCHAPTER 6: Performance 208post-message-retrieval processing, which includes actions such as message confirmation processing. In both of these methods, the early result minimizes the amount of time the client iswaiting for the result from the mail server. Additionally, there is substantial opportunity forcomputational overlap between the client and server once the early result is performed.6.26 Miscellaneous ApplicationsThe applications that have been presented to this point have all concentrated on using the parallel constructs in Raven to make an application run faster. However, the constructs for supporting parallel and distributed computation are not always used in situations where an increasein performance is the sole goal. A problem may be coded in a parallel way because that is itsmost natural representation, or an application may be made distributed because the resourcesthe application uses are physically distributed.A colleague used Raven to build a simulation of an ATM (asynchronous transfer mode)network. The ATM simulation had several components running in parallel. Raven’s companions greatly simplified the task of specifying this parallelism. Again, the controlled propertyplayed a significant role in simplifying the programming of the simulation. With Raven’s built-in support for object access control, the application programmer did not have to develop anduse a set of access control primitives to manage access to shared objects.Raven’s delayed result has proven to be very useful in the construction of synchronizationobjects. The InvokeStream class, used to provide Raven streams, and the synchronization technique used in determining what the problem was in the Bayesian search space program ofSection 6.2.2, both use delayed result. In both cases threads perform an invocation to an objectthat will decide whether or not execution should continue. If it should, the method returns, andnothing else is done. If the thread should not continue, then a delayed result is done. When conditions change to allow the execution to continue, the thread detecting this issues the appropriate result statement, thereby starting the suspended thread. The controlled property is also usedon these synchronization objects.CHAPTER 6: Performance 209Due to the difficulty of getting a consistent work environment with respect to the numberand type of processors, only one set of performance results have been reported for applicationsrunning in a distributed environment. However, both the Mandeibrot set and the prime numbersearching application were initially programmed and tested in a distributed environment. Theseproblems were later converted to run in a shared-memory environment by simply merging theclient and server so that they were compiled as one program instead of two.6.3 Performance Analysis OverviewThe dominant conclusion from this section is that memory management is an important issue.Note that there is only so much memory within a computer system, and, as more processors areadded to work on the problem, the machine becomes effectively faster and it uses more memory and other resources, like backplane bandwidth. One must be aware of the issues associatedwith the scaling of system resource availability relative to the number of processors in use.During the execution of the user-specified part of an application, the amount of parallelism varies. At some points the application will be sequential, while at others it will be fully parallel. To get the maximum speedup, the fully parallel portions of the application need to bemaximized, and their computational component should dominate the elapsed time. In keepingwith the approach that changes to convert a sequential program to a parallel one should be minimized, the majority of the effort will go to those parts of the application that benefit the most.Typically this means concentrating the effort on the main part of the application and not on theinitialization phases. By doing this, the start-up part of the application will remain sequentialeven though there is some potential for parallelism. During the transition between the start-upcode and the main body of the application responsible for the work, the constructs for parallelism are used. The result is that the speedup for an application ramps up to its maximum amountof parallelism, and on completion there is a similar ramping down as the parallel componentsterminate. Depending upon the nature of the computation, some execution streams will terminate while others continue, and this lowers the speedup. For a long-running application, smallCHAPTER 6: Performance 210regions that lack maximum parallelism are not significant to the elapsed time of the overallcomputation, but in short computations they become noticeable.Although the garbage collection stop times have been factored out, the garbage collectionpauses still have a detrimental effect on performance. Each memory pooi has a data lock associated with it. When a garbage collection takes place the lock is held on that pool much longerthan if a straight memory allocation were done. The memory pools are used in a round-robinfashion on a first-come first-served basis. The garbage collection routine essentially acts as abarrier that results in memory requests stacking up at that pooi. As soon as the garbage collection starts acting as a barrier, the parallelism in the application decreases because of the stacking effect. This was observed in the prime number example. With this garbage collector, theproblem can be partly addressed by having more stacks, but this carries with it the increasedadministrative costs of managing more memory pools. Again, a parallel memory allocationenvironment should alleviate these sorts of problems.6.4 SummaryTo demonstrate Raven’s parallel constructs, several parallel applications were programmed.For the Mandelbrot set and Bayesian search space problems the efficiency was consistentlyover 0.90. The application that searched for prime numbers was less efficient, but this is attributable to the large number of garbage collections caused by the application’s high rate of memory consumption. Several other problems of a more distributed nature were also described.CHAPTER 7 Conclusions andFuture ResearchThe work presented in this thesis revolves around the Raven system and language, and themodifications made to this environment to support parallel and distributed processing. Thischapter will conclude the thesis by presenting a summary of the major results and suggestingsome future research directions.7.1 Summary of ResultsThis thesis has two major components to it. One component is concerned with the support forparallel and distributed programming and how it manifests itself at the language level. The second component of the thesis is concerned with the underlying system services required to effectively support applications in a parallel and distributed environment.211CHAPTER 7: Conclusions and Future Research 2127.1.1 Parallelism at the Language LevelThis thesis describes parallel and distributed aspects of the Raven language and system. Workfrom a variety of areas has been brought together, extended and augmented with new features,to provide a programming system that supports the development of parallel and distributedapplications.It was recognized that to support parallel and distributed programming there needed to bea convenient way to express parallelism, issues associated with providing concurrency controlneeded to be addressed, and some level of object transparency was required. Raven addressedthe problem of expressing parallelism by introducing support for class-based and user-basedparallelism.The resulting model of parallelism is based on the observation that objects, with theirencapsulated data and well-defined method interfaces, are ideal candidates for parallelism. Inconventional systems, when writing a parallel program, the programmer must decompose theapplication into processes as well as functions and procedures, thereby complicating the taskof writing the program. Raven reduces this problem by identifying the method invocation pointas the location where parallelism occurs. Parallelism at a method invocation can result in twoways. In one way, the method being invoked can return a result while continuing to execute:this is class-based parallelism. The second way to achieve parallelism is to have the invokingobject indicate that the invocation is to proceed in parallel with the initiating thread of execution: this is called user-based parallelism.Class-based and user-based parallelism form the foundation of Raven’s support for parallelism. Certificates and companions, combined with their associated control methods, are thelanguage level-objects that specify and manage user-based parallelism. Class-based parallelismis supported through the constructs of early and delayed result. These constructs were developed by observing that the return statement performs two functions: it returns a result to thecalling function, and it terminates execution flow in the current function. Early result permitsCHAPTER 7: Conc’usions and Future Research 213concurrency between the invoker and invokee, whereas delayed result allows a method to suspend execution and have a result returned later.With the introduction of parallelism, objects can be concurrently accessed, with the possibility that multiple methods may try to change an object’s instance data simultaneously. Otherimplementations solve this problem by allowing only one method to execute at a time, or bymaking concurrency control the programmer’s responsibility. This problem is addressed inRaven by concurrency control provided through the controlled property. Properties offer a newway to associate operating system services with objects on an instance-by-instance basis. Indoing this, properties eliminate the class explosion problem encountered when the same classis needed, but with different underlying support properties. Properties offer additional advantages in that only the objects using a property incur the extra execution overhead of managingthe property; also, the user does not have to explicitly make function calls to access the desiredservice. Finally, properties provide a uniform way of supplying operating system services independent of how those services are realized in the host environment.Raven’s locking support, which is supplied through the controlled property, supportsmultiple readers or a single writer in an effort to maximize concurrency. To be effective, thecontrolled property must not restrict the way an object can be used. In particular, objects muststill be able to invoke on themselves, and proper locking must be maintained across machineboundaries. These problems are addressed through the introduction of session IDs, which canbe used to track invocation sequences both locally and remotely.By making parallelism easy to use, it is easily possible to create more parallelism than theexecution environment can handle. Other systems often handle this problem by making thecontrol of parallelism the responsibility of the programmer, or by providing information atcompile time that specifies the amount of parallelism supported by the target environment.Raven considers the controlling of parallelism to be a system configuration issue. At runtimethe system is configured to specify the level of parallelism supported. This approach allows thesame application to be run on machines that support different numbers of processors.CHAPTER 7: Conclusions and Future Research 214To support distributed programming, a certain level of object transparency is required. Itshould be possible for an application to treat all objects the same, regardless of whether theyare local or remote. Raven’s global object ID, combined with proxy objects to