UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Service migration in a gigabit network Petrus, Margaret A.S. 1998

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1998-0578.pdf [ 4.2MB ]
JSON: 831-1.0051385.json
JSON-LD: 831-1.0051385-ld.json
RDF/XML (Pretty): 831-1.0051385-rdf.xml
RDF/JSON: 831-1.0051385-rdf.json
Turtle: 831-1.0051385-turtle.txt
N-Triples: 831-1.0051385-rdf-ntriples.txt
Original Record: 831-1.0051385-source.json
Full Text

Full Text

Service Migration in a Gigabit Network by Margaret A . S . Petrus B.S., Bishop's University, Canada, 1996  A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS FOR T H E DEGREE OF  M a s t e r of Science in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science)  we accept this thesis as conforming to the required standard  The University of British Columbia August 1998 © Margaret A.S. Petrus, 1998  In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission.  Computer Science The University of British Columbia 2366 Main M a l l Vancouver, B C Canada V 6 T 1Z4  Date:  /JuA-^t  ^  ,  '998  Abstract Migrating services to the network adaptor makes use of the idle processing power in the adaptor that would otherwise be wasted if it was just used for sending and receiving messages. In addition, by moving some functionality to the network adaptor, decisions can be made without involving the host. This would prevent unnecessary memory copies, I/O traversals, interruptions of the host and the associated context switches. Thus the host can use its processing power, without interruption, for more useful work. The Emu system has made possible the complete implementation of the Emerald interpreter on the network processor.  The choice of an object oriented  distributed interpreter makes the applications that run on the system more modular and compact because the language takes care of all the gory details.  Emu has  introduced runtime extension of the functionalities on the network processor and an approach for utilizing idle network processing power. However, intelligent delegation of processing to the network processor is required, because the network processor is slower than the central processor. Communication-intensive applications which require low latency and high throughput would benefit from the interpreter on the network adaptor in the Emu system. Emu has paved the way for other similar systems to be built. It has demonstrated that it is possible to take full advantage of the programmable N I C to increase system throughput. It has shown that with intelligent choice of applications, idle network processing power can be utilized without sacrificing performance. ii  Contents Abstract  ii  Contents  iii  List of Tables  vii  List of Figures  viii  Acknowledgements  x  Dedication 1  2  xi  Introduction  1  1.1  Motivation  1  1.2  What is Service Migration?  3  1.3  Problem Definition  3  1.4  The Network Environment  4  1.5  The Emerald Approach  6  1.6  Thesis Structure  • •  Related Work 2.1  8 10  Existing Migration Systems  10  2.1.1  11  Traditional Migration Systems  iii  2.2  2.3  2.4 3  2.1.2  Object Migration Systems/Languages  14  2.1.3  Heterogeneous Migration Systems  17  Moving Functionality to the Network Adaptor  19  2.2.1  U-Net  19  2.2.2  Hamlyn  20  Other Languages Considered for the Interpreter  21  2.3.1  Java - Language of the Internet  22  2.3.2  TCL  22  2.3.3  Existing Active Networking Systems/Languages  22  Contributions of this Thesis  24  Background  3.1  3.2  3.3  3.4  3.5  26  Myricom Technology and Details  26  3.1.1  L A N a i 4 Processor  27  3.1.2  Myrinet A P I  29  Benchmarks Used for the Choice of the Host Processors  33  3.2.1  Results of bwjmem-cp and hswap Benchmarks  34  3.2.2  Results of Imbench Test Suite  36  3.2.3  Results of netperf Benchmarks  37  Layout of the Myrinet Network Testbed  41  3.3.1  M 2 M - D U A L - S W 8 Myrinet Switch  42  3.3.2  M 2 M - P C I 3 2 C M y r i n e t - S A N / P C I Interface  43  3.3.3  M 2 M - C B - 0 5 Cables  44  Language Support  44  3.4.1  Emerald Primitives to Support Mobility  44  3.4.2  Structure of the Emerald Interpreter  46  Summary  48  iv  4 Design and Implementation of Emu 4.1  4.2  4.3  4.4  4.5  4.6  49  Key Issues for Moving the Emerald Interpreter to the L A N a i  50  4.1.1  Limited Memory  50  4.1.2  Library Support  51  4.1.3  Distribution and Mobility  52  Interface to the Host  53  4.2.1  OS-Support Protocol  53  4.2.2  Print/Debug Protocol  55  4.2.3  Shared Memory Structure Between L A N a i and Host  56  Communication Protocols  57  4.3.1  Design of Communication Protocol in lanai-emx  57  4.3.2  From L A N a i to the Host  58  4.3.3  From L A N a i to the Network  4.3.4  From Host/Network to the L A N a i  60  4.3.5  Reading/Writing from/to the Myrinet Network  60  . . .  59  Message Routing  61  4.4.1  Data Structures Used for Routing  63  4.4.2  Routing Information Protocol  63  Other Issues  64  4.5.1  Emerald over Myrinet V s . Emerald over IP over Myrinet  4.5.2  Issues Concerning Hardware Resets  65  4.5.3  Modifications to the host-emx  65  Summary  5 Performance 5.1  . . . .  . .  64  67  68  Micro Benchmarks  68  5.1.1  Computation Intensive Micro Benchmarks  69  5.1.2  Communication Intensive Micro Benchmarks  70  5.1.3  Discussion  71  5.2  Application 1: Kilroy  72  5.3  Application 2: Location Service  73  5.3.1  Design of the Location Service  73  5.3.2  Experimental Setup  76  5.3.3  Efficiency of Emu  76  5.3.4  Applications that could Use the Location Service  80  5.4  6  Discussion  80  Conclusions and Future Work  82  6.1  Conclusions  82  6.2  Effort Required in Creating Emu  83  6.3  Future Work  84  6.3.1  Memory Fragmentation Problems  85  6.3.2  A More Precise Debugger for the L A N a i  85  6.3.3  Performing G M S lookups on the L A N a i  86  6.3.4  Support for Active Networking on the NIC  86  6.3.5  Dynamic Routing Application  86  6.3.6  HPJava Engine as Framework for Emu  87  6.4  Final Remarks  87  Bibliography  89  vi  List of Tables 3.1  Results of bwjmem^cp and hswap Benchmarks  35  3.2  Results of Imbench Test Suite .  36  3.3  U D P - S T R E A M Testing Using netperf  38  3.4  U D P - S T R E A M Testing Using netperf (turning off the checksums)  3.5  T C P - S T R E A M Testing Using netperf  40  5.1  Computation Intensive Micro'Benchmarks  69  5.2  Communication Intensive Micro Benchmarks  71  vii  .  39  List of Figures 1.1  Emerald Interpreter on the Network Interface Card (NIC)  7  3.1  Block Diagram of the Myrinet/Host Interface  28  3.2  Layout of Myrinet Network  41  3.3  Top View of Dual 8-Port Myrinet-SAN Switch  42  3.4  Myrinet Cluster Connections  43  3.5  Emerald Virtual Machine  46  4.1  Memory Layout on the L A N a i  51  4.2  The Host Interface Protocol Memory Structure  54  4.3  Pseudocode for Handling Requests from the L A N a i by the Host . . .  55  4.4  The M C P Shared Memory Structure  56  4.5  Communication Protocol of the V M on the L A N a i  58  4.6  Pseudocode for Sending Packets from L A N a i to Host  59  4.7  Pseudocode for Sending Packets from L A N a i to Network  59  4.8  Pseudocode for Processing Incoming Packets to the L A N a i  60  4.9  Reading Packets from the Myrinet Network  61  4.10 Writing Packets to the Myrinet Network  62  4.11 Data Structure to Store Information for Each Switch  63  4.12 switch-conn Data Structure  63  4.13 Myrinet Routing Protocol  64  viii  5.1  Kilroy on Emu  72  5.2  The Locator Service Class  74  5.3  The Service Finder Class  75  5.4  The Location Service on host-emx of Emu  77  5.5  The Location Service on lanai-emx of Emu  78  ix  Acknowledgements A lot of people have influenced my work in various ways during my Masters Program at U B C . It would be impossible to remember all of them, but I will do my best. First, sincere thanks to my supervisor, Norm Hutchinson for his invaluable support throughout my work. His constant belief in me and the freedom to explore my own ideas enabled me to gain confidence.  Thank you, D r . Hutchinson, for  always listening with such patience when things were not going great. Thanks to Alistair Veitch for answering my numerous questions while installing the Myrinet network and for reviewing the first draft of my thesis. Thanks are also due to Mike Feeley who was my second reader, and to Peter Smith for his reviews and for being a wonderful friend. I would like to thank my family for supporting me through school, particularly two very special people, Fatima and Susai Peter. Without their help, it would have been impossible for me to venture off to Canada. I also extend my thanks to Stephen Sinnappan, Celine Stephen, Regina Simon and Pascal Simon. Other people who have helped either technically or by just being there are Marie-France Beauchesne, Ian Cavers, Christoph Kern, Dwight Makaroff, Holly Mitchell, Gerald Neufeld, Kalpana Ramanujam, Saravanan Sanniyasi, Pradosh Sivadoss, Robert van Hulst, Dimitri Voulioris, Alan Wagner and Marcelo Walter. Finally I would like to thank my parents and my husband Paul for being there for me through this. M  The University of British Columbia August 1998  x  A  R  G  A  R  E  T  A.S.  P  E  T  R  U  S  To my M o m and Dad.  xi  Chapter 1  Introduction 1.1  Motivation  The changing technology trends for both local area networks with faster processors, and high speed networks with low latency, make the lagging I/O bus the bottleneck in achieving better throughput in the network systems. With the advent of high speed networks, this I / O bottleneck is a serious concern which limits the full potential of the network. Protocol processing on the network processor, rather than on the host's C P U , can avoid the unnecessary use of the I / O bus which limits the speed of the processing in the network. The network interface cards (NICs) for high speed networks should have processing capabilities to support such functionality. The new NICs have processors that remain idle except when executing code to control the flow of data between the host and the network. For example, the Myrinet [ B C F 9 5 ] N I C contains a general +  purpose processor which can execute user programs. This is a unique feature because it allows programming the NIC of a high speed gigabit network. This feature could be exploited by moving services to the N I C which would avoid host overhead and I/O transfers. Migrating services to the network adaptor makes use of the idle processing  1  power in the adaptor that would otherwise be wasted if it was just used for sending and receiving messages. In addition, by moving some functionality to the network adaptor, decisions can be made without involving the host. This would prevent unnecessary memory copies, I/O traversals, interruptions of the host and the associated context switches.  Thus the host can use its processing power, without  interruption, for more useful work. There are several network intensive services that could benefit from this exploitation of the network processor.  Two such examples are dynamic routing  in an active networking scenario and global memory management in a clustered environment.  In the latter case, low latency plays a major role in performance  throughputs. Service migration could be extended to building active networking [TSS 97, +  WT96] applications. Active Networking is moving executable code in the network rather than just moving passive data.  The active code gets executed at routers,  which support such services, as it moves in the network. For example, one of the applications that would benefit from this service on the network interface card is dynamic routing, i.e., routing decisions made by code that migrates among the routers. A virtual machine supporting active networking and running within the network adaptor would help the Internet evolve dynamically, without involving the host or constant manual intervention, during failures such as when a router goes down. More important is the introduction of a new routing functionality, where code supporting new routing protocols can be injected into the network dynamically. Another area that will benefit from this approach is the global memory system (GMS) [Fee96] for a network of workstations. In G M S , global network-wide page replacement decisions are made. When a page-fault occurs, the replaced page is sent to global memory rather than to disk, and the globally least valuable page is discarded from memory, sending it back to disk if necessary. Page faults can thus sometimes be satisfied from the memory of a remote processor. By migrating some  2  code to the network adaptor, the necessary pages can be D M A e d from the remote storing node's memory without OS intervention. This would result both in faster memory access and lower overhead just by migrating a small part of the service to the network interface card.  1.2  What is Service Migration?  Service migration involves moving some process, that may be currently active, to some other processor in the network. This technique is adopted for various reasons, the most common of which are to achieve speedup by migrating to faster processors, and for load balancing. Utilizing idle machines or not so overloaded machines leads to the completion of the task faster, while at the same time taking advantage of available idle computational power. Another advantage is reducing network band^ width by moving two processes that are communicating a lot but are on different machines to the same machine [Smi97]. Research on service migration has led to many research-oriented or commercial systems [Che88, Dou91, B H J 8 7 ] . +  Emu, the system implemented as part of  this thesis, falls under object migration systems [ B H J 8 7 , LJP93]. However, unlike +  earlier object migration systems, objects in Emu may also execute in the network adaptor rather than just on the central processors. This is achieved by running a fully functional interpreter in the network adaptor. It is believed to be the first such system running within a network interface. A preview of the different migration systems is given in the next chapter, with examples from homogeneous, heterogeneous and object migration systems.  1.3  Problem Definition  Emu is a system composed of two distributed object oriented virtual machines on a single host, one executing on the host processor and the other on the Myrinet NIC  3  which supports a general purpose programmable processor.  Moving an existing  interpreter to the Myrinet N I C , to be executable by the bare network processor, is quite unlike porting software for a central processor of a different architecture. There are many challenges involved such as the R A M limit of the network adaptor, the lagging speed of the network processor when compared to the central processor, the programming environment available, handling of system calls due to the lack of an OS on the network adaptor, etc. The interpreter should also handle traffic from both host and network, each of which functions in a different manner in addition to servicing its own processes. The processing of the incoming objects may lead to the creation of new objects to be injected back into the network or to the host. During implementation, each of the above mentioned limitations had to be handled to arrive at a fully functional interpreter. Based on the results, the following thesis statement can be derived: A programmable high speed NIC can service incoming network packets independently without host assistance and can improve aggregate throughput in a network of workstations if non-computation  intensive jobs are  offloaded to the NIC. Running a distributed object-oriented on the NIC for the Emu  interpreter  system extends this functionality further by  providing dynamic migration of objects back and forth at runtime. This virtual machine on the network interface enables services to be migrated to the network interface as well, as the traditional migration method of moving to other processors. This work has introduced a new meaning to service migration.  1.4  The Network Environment  Emu is implemented for the high-speed Myrinet network. Myrinet is a full duplex 1.28+1.28 gigabit-per-second link. For distributed computing or clustered comput-  ing, a Myrinet System Area Network (SAN) or Myrinet Local Area Network (LAN) would be the ideal choice for the following reasons [Myr97c]: 1. Speed : connects workstations or P C s through full duplex 1.28+1.28 gigabit per second links and low latency cut-through switches compared to lOMbits/sec or lOOMbits/sec Ethernet or 155Mbits/sec A T M . 2. Convenience and Robustness: allows any network topology, unlike unswitched Ethernet. Being a switched network, it supports source routing and allows dynamic reconfiguration of the network which (a) is a convenience for installation and (b) provides fault-tolerance. W i t h the routing tables being stored in the S R A M of the network chip, they can be modified on the fly. 3. Network throughput: being a switched network, Myrinet can carry many packets concurrently, each traversing the network at 1.28 gigabit/second. However, unlike Ethernet and F D D I which share a common communication medium, the total traffic of Myrinet increases with the number of hosts since the ratio of hosts/switch is constant. 4. Versatility: like Ethernet, Myrinet can carry packets of many types or protocols simultaneously as its packets can be of varied length, and can encapsulate other types of packets such as IP without an adaptation layer, in contrast to, for example, the A A L 5 adaptation layer of A T M . The Myrinet interfaces include a L A N a i processor - a general purpose customV L S I chip which has a 32-bit RISC processor. The programs it executes are called Myrinet Control Programs (MCPs) [Myr97dj.  The M C P s supplied by Myricom,  Inc., handle interactions with the network and host processors. Any M C P must run within 1MB of R A M which is what the L A N a i adaptor supports. Bmti's M C P running within the L A N a i is a complete distributed object oriented interpreter in addition to controlling the flow of data between the host and the network with the help of the D M A engine on the Myrinet N I C . 5  The experimental framework consists of the Myrinet S A N network composed of 16 Pentium-II 266MHz, 128MB R A M , 512KB cache, 440FX chipset processors and 16 M 2 M - P C I 3 2 C Myricom boards. These boards are M y r i n e t - S A N / P C I interfaces, P C I short cards which support 1MB of memory. The benchmarks involved in the selection of the processor are discussed in Chapter 3.  1.5  The Emerald Approach  Emerald [BHJ+87, J L H B 8 8 , RTL+91, BHJL86], developed at the University of Washington, is an object oriented programming language that was designed to be used in a distributed environment. Hence, it inherently provides for abstraction, moving objects from one machine to another, locating those objects easily because of their unique global identifiers, and other basic distributed services. It is a complete object-migration system and so provides all the services necessary for migration and can be easily extended to provide services for active networking. The language also provides replication and transactions, thereby supporting fault-tolerant and highly available applications. Emerald provides fine-grained mobility. It supports both small passive data objects and active process objects. In addition to the traditional advantages of migration such as load sharing, fault-tolerance, availability, etc., fine-grained mobility in Emerald also supports the following [JLHB88]: • Provides a protocol for moving data (objects) from one machine to another without requiring the user to provide marshalling/unmarshalling of the data. • Improves remote invocation performance by moving parameter objects to the remote node to reduce dependability on the source during the duration of the remote execution. • It moves objects to sites where references exist in order to simplify distributed garbage collection. 6  LANai chip  myrinet  Emeralp  switch  VM Memory [ Memory J Myrinet NIC  CPU  HOST  Figure 1.1: Emerald Interpreter on the Network Interface Card (NIC) Emu focussed on building an interpreter for the Emerald byte codes that could execute on the L A N a i ( l a n a i - e m x ) and communicate with the one on the host.  This interpreter provides features to support service migration and active  networking via objects. It is written in the form of an M C P and is loaded by the host into the memory of the Myrinet interfaces. In Emu, there are two separate Emerald interpreters per host: (a) one on the host (host-emx), and (b) one on the Myrinet interface (lanai-emx) as shown in Figure 1.1. The Emerald emx running within the network interface handles traffic from the network without assistance from the host,  lanai-emx is not a simple M C P just  processing messages from/to the network and the host. It can actually interpret executable code without involving the host.  In other words, a complete virtual  machine is running within the L A N a i network interface. The host-emx and lanaiemx communicate via a shared memory structure on the L A N a i . A n application to illustrate its functionality was designed and implemented as discussed in Chapter 5. Previous research on migrating functionality, other than basic mux/demux of messages, to the network adapter is discussed in the next chapter.  7  1.6  Thesis Structure  The layout bf the following chapters are outlined below: • Chapter 2 reviews prior research on systems that deploy service migration and migrate functionality to the network interface cards. This is followed by a discussion of the languages considered for the framework of Emu. The drawbacks of these languages that initially seemed a better choice than Emerald are also explained. It finally concludes the contributions of Emu that are different from prior work. • Chapter 3 provides the background for understanding the design of Emu. It discusses the capabilities of the L A N a i 4 processor that runs on the Myrinet NIC, and the Myrinet A P I available. It explains the choice of Pentium-II for the Myrinet testbed at U B C and the benchmarks that influenced its usage. It also provides some background about the Emerald programming language that is used as the framework for Emu. The original model of the Emerald interpreter is reviewed to understand the complexity of the changes made in Chapter 4. • Chapter 4 discusses the design and the implementation of Emu. There are two virtual engines, one on the host host-emx and the other on the L A N a i lanaiemx with the latter consuming most of the time spent in the construction of Emu. It explains the protocols designed to circumvent the limitations posed by the L A N a i environment, the communication protocols for the new Myrinet network, the complex mux/demux needed for Emu, the routing protocols, and modifications made to the host-emx to assist the lanai-emx with OS support. It also discusses some miscellaneous issues that arose during the design of Emu. • Chapter 5 discusses the performance of Emu. Micro benchmarks that were  8  run on Emu help analyze the slowdown on the L A N a i for computation and communication intensive operations. The slow down is because the network processor is slower than the host processor by a factor of 8. This is followed by a review of the application designed to show the robustness and usefulness of the Emu model, and the efficiency of such a system. Chapter 6 summarizes the thesis and provides suggestions for future work.  9  Chapter 2  Related W o r k This chapter discusses prior work related to the thesis. The first section reviews some existing migration systems, focusing on significant features relevant to this thesis. A couple of recent systems that migrate some fixed set of functionality to the network interface card are reviewed in Section 2.2. Executing code on the N I C is a considerably new approach and not much research exists in literature. Section 2.3 discusses the languages considered to build the virtual machine, including some active networking systems/languages as these inherently support mobility. It also draws out those features which make them an impossible choice to be used on the L A N a i adaptor. The chapter concludes with a summary of how this thesis is different from prior approaches.  2.1  Existing Migration Systems  Process migration research was first undertaken in the late 1970s and a considerable number of experimental systems have since been implemented and tested. To migrate a process from one machine to another, the following steps are taken: 1. Freeze the process at the source. 10  2. Move all the process' state to the destination host. 3. Resume execution from the point where it was stopped at the source. This section is a brief overview of process migration in homogeneous systems, object migration systems and some examples of heterogeneous migration systems.  2.1.1  T r a d i t i o n a l M i g r a t i o n Systems  Homogeneous migration is moving processes among machines of similar architecture. Most of the homogeneous process migration systems follow the general protocol, but with significantly different techniques used to address the following design goals. • How should the virtual memory be transferred from host to destination? • When should the migrant process' execution be stopped? • What degree of transparency should be achieved? • How should residual dependencies on the source, or any other intermediate hosts be reduced? • What are the different migration policies used, i.e., which process to migrate, when to migrate it, and where to migrate it? The V [Che88, TLC85] system developed at Stanford supports migration to allow off-loading of programs to idle workstations in order to increase available computational power. The V migration facility considers these three issues as vital to its design: (1) a network-transparent execution environment, (2) minimal interference due to migration, and (3) no residual dependencies on its previous host. In order to not increase execution time due to migration, V does pre-copying of the process' state while continuing execution. This leads to two or three copies of dirty pages before the host is finally migrated to the destination to resume execution.  11  The M O S I X multicomputer operating system [BSW89], developed at the Institute of Computer Science, the Hebrew University of Jerusalem, is an extension of Unix. It is designed for a network of workstations (NOW) [PCA95], and supports a comprehensive resource sharing policy. This policy depends on various factors such as load sharing, memory sharing and I P C optimization. Algorithms to calculate these factors are executed independently on each workstation. Later, they exchange results with algorithms from other workstations for information gathering and dissemination. Migration decisions are made by this policy with partial knowledge about the state of the network, and do not depend on complete information of the state of the N O W , or any particular workstation. Such a migration system where decisions are not completely centralized, especially on a low latency network such as the Myrinet [BCF+95], scales well to large configuration. Charlotte [AF89], work done at the University of Wisconsin-Madison, makes distinct separations of policy (when to migrate which process to which destination) from mechanism (how to detach, transfer, and re-attach the migrant process). Their design of keeping the policy (in the user space) separate from mechanism (in the kernel space) allows various efficient and flexible policies like load sharing, load balancing, etc., to be used. This also obviates the need to recompile the kernel each time the policy is changed. While M O S [BL85] and Locus [WPE+83] freeze processes when selected for migration, and V freezes near the end of transfer, Charlotte freezes somewhere in between (only when the context is marshalled and transferred). Process Migration in D e m o s / M P [PM83] was implemented efficiently with no system changes due to such characteristics as a uniform and location independent communication interface, and the participation of the kernel in sending and receiving operations in the same manner as a normal process. It is a message-based operating system (communication is the most basic mechanism) with messages being sent using links to specify the receiver of a message. So when processes migrate, some level of transparency is achieved using message forwarding and updating of links. Unlike V  12  and Charlotte, execution of processes stop when they are chosen for migration. The bottleneck in process migration is the transfer of the possibly large virtual address space of the process being migrated.  To alleviate this problem,  Accent [Zay87] uses a lazy transfer copy-on-reference approach. It uses different variations of virtual memory copy (a) Pure-copy (complete physical copy of the address spaces), (b) Pure-IOU (logical transfer of address space, which requires only portions of the address space to be physically transferred), and (c) RS (moving resident set during migration). Although Pure-IOU and RS hold a clear advantage in the address space transfer of migration, they suffer longer remote execution times when compared to Pure-copy. Hence only processes which access a small portion of their address spaces at the new site are best suited to the use of the copy-on-reference technique. Sprite [Dou91], developed at the University of California, Berkeley, migrates processes at two particular times: (a) exec system call when a resource-intensive program is about to be initiated, and (b) eviction of foreign processes when user returns to his workstation. This is one of the few migration systems to receive ;  extensive practical use besides Locus [WPE+83] and M O S I X [BSW89].  Sprite's  designers considered trade-offs among transparency, residual dependencies, performance and complexity in the design of their system. Sprite uses a different form of lazy copying from Accent which also makes use of its existing network services. Backing storage for virtual memory is implemented using ordinary files which are stored in the network file system and are accessible throughout the network. When a process is migrated, the source freezes the process, flushes the dirty pages to the backing files and discards its address space. The process then resumes execution at its destination host with no resident pages, and pages are loaded from backing storage as needed using the standard paging mechanism. Condor [LTBL97] is a distributed batch processing system for Unix developed at the University of Wisconsin-Madison. Condor makes use of idle workstations by  13  migrating processes via checkpointing of their state [LA89, LS91]. The mechanism used is to write all of the process' state information (data, stack, information about open files, pending signals and C P U state) into a file, and to use this information to restore state on the destination machine at restart time. Sprite and V have been carefully designed and implemented to accommodate migration, which is not the case here. Hence not all processes can be migrated in Condor. For example, as there is no way of saving all the state necessary for every, kind of process, there is no way of migrating a set of communicating processes. A survey of systems providing process or object migration facilities is discussed by Mark Nuttall in [Nut94] and a more detailed survey is presented by Milojicic et al. in [ M D P 9 6 ] . Some of the above systems would have been ideal +  choices if not for the two constraints: (a) the processes are too big to fit within 1 M B of memory and (b) they do not support heterogeneity. The 1 Mbyte memory limit of the L A N a i network interface narrows down the choice of the system to use.  2.1.2  Object M i g r a t i o n Systems/Languages  This section discusses systems that migrate objects instead of processes. A n object is usually a part of a graph of references. Hence when an object is to be migrated, one can either move just that object, some set of objects in the graph, or the entire graph. Moving only the object in question may not be appropriate. By moving an application specific set of objects, unnecessary remote invocations can be avoided. A common problem of these systems is determining what comprises the persistent state of the object so it can be encapsulated before migration. The most typical solution is to provide a description of each object's interface using an Interface Definition Language (IDL), as used by D C E + + [SM], or use routines to marshall an object's state into a buffer. However, in spite of the various research systems, no object migration system has achieved the commercial acceptance as that of a few process migration systems.  14  Emerald [BHJ+87, J L H B 8 8 , RTL+91, BHJL86], discussed in chapter 1, has the advantage of being designed to support mobility unlike other systems which have had migration added to them after the fact. It also has additional benefits over process mobility because the overhead of an object is commensurate with its complexity. Hence mobility in Emerald provides a relatively efficient way to transfer fine-grained data from source to destination. Emerald has been designed for a L A N of not more than 100 nodes. Emerald's designers used templates instead of tag bits (used in Distributed Smalltalk [SB88]) to distinguish data types, such as integers, from object references which make mobility easier. Performance of mobility was traded-off in favor of fundamental operations like local invocations. However, it is a very efficient design which explains the reason for its techniques being adapted in many of the systems discussed below. The D O W L [Ach93] distributed object-oriented language, developed at the University of Karlsruhe, Germany, adopts most of Emerald's mechanisms: • Mobility of objects is restricted by fixing (move an object to another machine and fix it there), and unfixing them (the object is free to move again). • Attach objects; by using the reserved word attached in front of the closely related data of the moving object, all the attached objects also move to the destination site of the moving object when it migrates. Attachment is transitive but not symmetric. • Besides passing-by-object-reference, support for pass-by-move (move argument object permanently to the invoked object's location) or pass-by-visit (move argument object temporarily to the invoked object's location) reduces the number of remote invocations. If one of the object's methods is activated, it can still migrate by moving the invocation context along with the object. Execution resumes at the target node and operation results are transparently delivered to the caller. 15  D C E + + [SM], developed at the University of Karlsruhe, Germany,' is an extended distributed object-oriented environment on top of D C E (The O S F Distributed Computing environment which is becoming an industry standard for open distributed computing). It supports mobility based on concepts introduced by Emerald [JLHB88], Amber [CAL+89], Arjuna [SDP91] and Amadeus [HC91]. Migrated objects can be accessed in a uniform way by clients, and concurrent migration and invocation requests are synchronized. A migrating object leaves a proxy at its former location which results in forwarding chains of proxies when it gets invoked. The location of the object gets updated on the proxies upon stepwise return of the call. However, the birth-node of the object is always updated upon every migration and is used in case the forwarding chain breaks. Comet [MV93], developed at Katholieke Universiteit Leuven, Belgium, uses a multi-dimensional migration approach.  The designers of Comet consider that  location is not the only attribute whose change in value is worth considering during migration. Other attributes such as state representation, access limitations or failure handling semantics also change during the lifetime of an object. Migration of objects in Comet is considered to take place in an N-dimensional space where N is the number of different attributes or flavors, and hence object migration is expressed in terms of coordinate changes along each of the axes. Cool [LJP93], is an object-oriented kernel implemented above the Chorus [ R A A 8 8 ] microkernel which provides mechanisms to instantiate, migrate and store +  objects. The designers of Cool have adopted a layered approach. It is implemented as three layers (COOL-base, generic runtime (GRT) and language specific runtime) above the Chorus microkernel. It was designed with an aim to provide a generic set of abstractions for better support between existing and future object-oriented systems, languages and applications. It also supports passive objects (persistent memory) and activities (threads of execution).  Activities can migrate from one  object to another, and objects can contain several concurrently executing activities.  16  A survey of object mobility is given in [Nut94]. Many of these systems merge two or more high level language calls into a single feature (for example, the visit primitive in Emerald is a combination of locate and move), thereby reducing the probability of errors (as the underlying system takes care of the details) and the number of remote invocations.  2.1.3  Heterogeneous M i g r a t i o n Systems  There are some systems supporting migration of processes among processors of varied architectures or those that run different operating systems. In addition to load or resource sharing, migration among different architectures has the advantage that some computations are quite efficient in certain specific architectures. The process-originated migration system [Shu90, DRS89], developed at the University of Colorado at Colorado Springs, is a prototype built on the existing homogeneous migration features of the V system. Their approaches to eliminate many portability problems by having a compiler resolve the intricate migration issues. Since a process requests its own migration, the request occurs at a reasonably clear point in its execution.  In addition to modifications to the linker and the  compiler (which take care of the translation of text, heap, data, stack, process control block, etc.), the operating system was modified to include a migration server to satisfy the process-initiated migration requests. Their data message is simplified because at the time of migration request, all the needed state information is in memory, and none is in the machine registers of the source machine. The designers of the system have traded complexity of representation, and overhead at migration or page fault time, for pure runtime speed. Heterogeneous Process Migration by recompilation [TH91], introduces the concept of a source level program being automatically created, transferred to the destination machine; and then recompiled. When the program is executed, it restores the state of the process at the destination and then continues the migrated  17  process' execution from the correct location. This approach is similar to that of the previous systems in that it deals with compiled programs rather than with interpreted source programs. However, unlike the previous case, the details of machine and compiler-dependent data translation procedures are hidden in the compilers and in the debuggers by employing recompilation techniques. The main disadvantage is that migration cost is greatly increased because of source code compilation. The Emerald system [ B H J 8 7 , J L H B 8 8 , SJ95], supports fine-grained object +  and thread mobility among heterogeneous computers with threads executing native code. In Emerald, threads follow objects as they move around. The. translation of machine-dependent code and data are dealt with efficiently by making changes to the compiler. These changes enable the runtime system to transform the machine-  :  dependent code or data into a machine-independent format. However, translation of differently optimized versions of the code running on different architectures has: not been dealt with.. This is because the latter requires the runtime system to be able to invoke parts of the compiler at runtime. Migration of active, threads was' implemented with the use of bus stops [SJ95]. A bus stop represents a program counter value that is the same for all processor types, and hence can be considered as a safe migration point. This makes the code generator free to optimize all code between these bus stops. The Heterogeneous Migration Facility ( H M F ) [BVW95], developed at the Department of Computer Science, Dartmouth College, requires explicit registration of the data to be migrated. The migration library should be linked with the migration program in order to provide procedures for automatically migrating processes and checkpointing data. Their system is based on a graph model for single process migration. Migration costs are based on the vertex and edge weights of the graph that require complex analysis of many factors, and this is a serious disadvantage to this model. The Tui system [Smi97], is different from the above systems in that in all the  18  previous cases, the program must be written in a type-safe language or in a type-safe subset of a language. Tui supports heterogeneous migration of common languages that are not type-safe. It also allows an external agent to make a migration request in addition to process-originated migration, which is the case in most prior systems. While most of the above systems incorporate the marshalling code into the migrating process itself, Tui is a completely separate program.  2.2  Moving Functionality to the Network Adaptor  Research on migrating functionality to the network adaptor in order to increase processing speed, reduce round-trip latencies, etc. began in 1980s. The network processors supported complete implementations of protocols, such as TCP/IP, that were not provided by the operating system. But as operating systems adopted these protocols, the functionalities were dropped from the NIC. This was because the marginal improvements in reducing host overhead did not justify the high cost of running these protocols on the network processor with lOMbits/sec Ethernet. Most of the work done was commercial or proprietary, and hence not available in literature. But with the birth of research into high speed networks, these ideas were revisited. The following two sections discuss work recently done in this area.  2.2.1  U-Net  With the advent of A T M [dKdP88], von Eicken et al., from the Department of Computer Science at Cornell University, developed U-Net [vEBBV95], a user-level network interface for parallel and distributed computing. The goal of U-Net, implemented using off-the-shelf A T M hardware, is to remove the kernel from the communication path while providing full protection, so that user processes are only limited by network capabilities. It provides a virtual view of the network interface to the user processes so that each process has the feeling that they own the network adaptor. 19  By removing the kernel from the communication, an extra memory copy is avoided when sending or receiving messages from the network. This is because user process memory can be mapped to the network adaptor's memory and direct D M A can access user memory locations. U-Net not only supports traditional internetworking protocols like T C P and U D P but also novel ones like Active Messages. The T C P and U D P protocols implemented using U-Net have latencies close to the raw minimum, and throughputs close to the raw maximum. U-Net active messages, a prototype of the Generic Active Messages ( G A M ) [ea95], have round-trip times which are only a little above the absolute minimum. The U-Net model explained above was extended further, by von Eicken et al., into a U - N e t / M M architecture [BWvE97]. This allows memory to be copied from and to any part of the application's address space. In other words, memory is not physically pinned down as in the previous case. By integrating a translation look-aside buffer in the network adaptor, which coordinates with the operating system, network buffer pages are pinned and unpinned dynamically. This architecture has been demonstrated using both A T M (PCA-200/Linux implementation) and Ethernet (DC21140/Windows N T implementation). While the former performs both address translation and mux/demux of the messages in the network adaptor, the latter relies on the host for both functions. Their evaluation leads to the conclusion that while performing mux/demux of messages on the network interface avoids buffer copies on receive, it is better to perform address translation on the host to reduce communication latency and complexity.  2.2.2  Hamlyn  Hamlyn [ B J M 9 6 ] , developed at Hewlett-Packard laboratories, is a sender-managed +  interface architecture optimized for clustered computing using Myrinet. The latency (both hardware and software) is just a few microseconds and provides full protection among the different applications with little or no OS intervention. Being a sender-  20  managed design, messages are only sent when the sender knows that there are sufficient buffers at the receiver sites. So it needs to maintain some information about memory space availability at the destination, taking into account buffer overruns and retransmission under heavy loads. To achieve low latency, direct access to the Myrinet interface is given to the different applications with no OS intervention. So applications can D M A memory between E B U S (External-access Bus) and L B U S (LANai Bus) without any interference from the kernel. To achieve high bandwidth, Hamlyn supports a scatter-gather direct memory access (DMA) capability that frees the host for other operations during long transfers. Since Myrinet [BCF 95] has very low (almost negligible) error +  rates, flow control, and a physically secure network, messages need not be held for retransmission and packet acknowledgements are unnecessary. However, in these systems only a fixed set of functionality has been ported to the network interface, with no techniques to extend them at runtime. Unlike these systems, Emu supports a full-fledged virtual machine running within the network adaptor, thereby allowing extensions at runtime. This is believed to be the first such virtual machine on a network interface that can mux/demux messages, D M A memory, build objects, execute them, etc. within 1 M B of the network interface memory with no host intervention.  2.3  Other Languages Considered for the Interpreter  Prior to choosing Emerald as the basis of Emu, other languages were studied. Initially, Java seemed to be the ideal choice since it is popular and commercially accepted. Another venue of thought was choosing an active networking language like P L A N . Such a language would support mobility, and thus can be easily modified for service migration. The following sections briefly discuss the advantages and the disadvantages of each of the languages considered:  21  2.3.1  Java - Language of the Internet  The features of Java [Boo96, B M K K 9 7 ] that are advantageous for mobility are (a) its excellent exception handling mechanism, (b) it is object oriented, (c) it provides support for encapsulation of objects, (d) it has no pointer references, and (e) it provides marshalling and un-marshalling routines. Unfortunately Java's interpreter is 50 times slower than C . It does not provide migration primitives or compiler runtime support on different architectures. For active networks, Java is probably much too bulky, as the virtual machine comes with a much larger overhead than C . Hence, the advantage of the gigabit network might be lost. But the main reason for not choosing Java is the massive size of the code, and the high rate of increase of its code size. It would have been almost impossible to fit the code including heap and stack size in 1MB of L A N a i memory. Even removing unnecessary classes might not have solved the problem.  2.3.2  TCL  T C L [Ous93], a prototyping language, supports many of the features of conventional procedural languages, including variable assignment, procedure calls, control structures, and in addition it has very easy access to graphical widgets. T C L is less strict in its handling of data than many conventional languages, and does not provide the data hiding or modularity features of languages with strong support for abstract data types, such as M L , Modula2, Modula3, or Ada. Thus it may not be an appropriate language in which to write large and complex programs, which are required to be highly reliable and robust.  2.3.3  E x i s t i n g A c t i v e Networking Systems/Languages  The concept of active networking involves moving executable code around the network, unlike the traditional method of moving just passive data. This code can be executed at various routers by running an interpreter on them. Recent work on 22  active networking includes [TSS+97, TW96, BCZ96a, BCZ96b, T G S K 9 6 , WT96]. Active networks allow users to inject customized programs into the network. These programs can help in dynamic modification of the network. For example, active programs can assist routers and gateways in making intelligent routing decisions based on the network traffic at a particular time, as well as dealing with failures due to network partitioning without manual intervention. Languages for active networking include Telescript [tel96b, tel96a, tel96c, tel96d], P L A N [GHK+97, Kak97, MH97, Moo97], Wave [Sap95, Sap88], Messengers [Bic95, F B D M 9 6 ] , and others. Telescript Everything in telescript [tel96b, tel96a, tel96c, tel96d] is an object. It is an interpreted language and has two levels : high telescript has an object oriented syntax and compiles to low telescript which has a postfix syntax for stack-based interpretation. Telescript supports preemptive prioritized multi-tasking of process objects (which is an object with a life of its own), migration, "meeting locations" for agents, persistent objects and authentication. Telescript's source code could not be obtained easily for experimentation owing to the fact that it has become a commercial product. Odyssey, its successor, is composed of Java class libraries to support distributed applications, and hence poses the same problems as mentioned in section 2.3.1.  PLAN P L A N [GHK+97, Kak97, MH97, Moo97] is a script-like functional language similar •to M L . It is designed to satisfy the following two functions in an active network: (i) Resource discovery: the location, invocation, or installation of services, and (ii) Diagnostics: analysis of network status or behavior. P L A N consists of two levels. The P L A N level which is the lower level provides services for the above two functions,  23  and consists of P L A N programs which are distributed to all P L A N routers. The S E R V I C E level, which is the higher level, includes services that are available only to certain routers. Therefore, it may require authentication, unlike services from the lower level, for their use. It usually includes a rich collection of protocols which provide functionality for an active network. Programs are carried in the packets and executed at the different routers. However, P L A N packets do not communicate with each other directly as in Telescript, but only through application programs at endpoints or service calls on the routers. Although P L A N was successfully installed, it could not be ported to the Myrinet. The P L A N interpreter grows to over 6 M B (6776KB) of memory to handle simple programs such as ping and traceroute, and would clearly grow with more complicated services. This was because current P L A N implementations are built on top of Java or Caml, and the memory limitations of the L A N a i adaptor (1MB) make such sizes unacceptable.  2.4  Contributions of this Thesis  Emu has made possible the running of a heterogeneous object-oriented distributed virtual machine on a NIC. The interpreter running within the L A N a i network adaptor supports all the features of a regular virtual machine running on the host. In Hamlyn and U-Net, functionality migrated to the network interfaces are statically fixed, unlike Emu which can be extended at runtime. Emerald, an object oriented migration system was selected as the framework due to the limitations posed by the other migration and active networking systems discussed. The L A N a i being a different environment, makes the homogeneous and heterogeneous systems discussed unsuitable for running on the network processor which is slower than the host processor by a factor of 8. Homogeneous process migration systems have introduced the fundamental concepts of migration, but these systems are course-grained because the processes are.big and their state cannot be easily 24  marshalled as is the case with objects'. It would be impossible to run one of the systems within 1MB of R A M . Also, these systems do not support heterogeneity and the L A N a i is quite unlike regular processors with a very unfriendly programming environment.  In the case of the discussed heterogeneous migration systems, the  limitation is the very slow recompilation times which would reduce throughput in a gigabit network. The usefulness of such a network would be lost. The active networking systems, which provide the support for migrating functionality to the N I C , have the wrong languages for the L A N a i adaptor.  The  memory limitations of the .LANai adaptor force the virtual machine to be small and optimized such that its code, stack and heap size can run within 1 M B . For example, P L A N did not satisfy this constraint. Moreover, a case study of P L A N implied that a more efficient active networking scenario can be deployed with a much simpler distributed language. Java's main disadvantage was its ever-increasing code size, which would have otherwise made it the first choice due to its wide usage. C and C + + provide no support for mobility, marshalling or unmarshalling routines, and have to be programmed at the socket level, etc. These fundamental functions were inherently provided by the Emerald language which made it a better choice. The Emerald virtual machine has been successfully ported and it runs within 1MB of memory. It is object-oriented which makes migration easier and it supports heterogeneity. Although Emu has been developed using techniques discussed in the research above, it is believed to be the first system on a network interface card supporting functionality which can expand at runtime.  25  1  Chapter 3  Background Emu was implemented for the Myrinet network interface cards. A review of the network hardware, and the software interface between the L A N a i and the host is required to understand the design issues of Emu. The Emerald language which is used as the framework for Emu is also reviewed. The following is the layout of this chapter. Section 3.1 reviews the L A N a i 4 processor and the Myrinet A P I . Section 3.2 discusses the various host processors considered for the Myrinet network. This is followed by the results of the experimental benchmarks to support the choice of Pentium-II for the Emu system. Section 3.3 explains the Myrinet layout, the M y r i n e t / P C I interface card, the Dual Myrinet switches and the cable support. Section 3.4 discusses the language support for mobility followed by a background on the Emerald interpreter design and implementation, and Section 3.5 summarizes.-  3.1  Myricom Technology and Details  Myrinet [ B C F 9 5 ] is a local-area network (LAN) designed for packet communication +  and switching within massively parallel processors (MPPs). The characteristics that distinguish Myrinet from other L A N s are (a) high data rates owing to its full-duplex  26  pair of 640Mb/sec channels unlike the uni-directional Ethernet channels, (b) very low error rates (almost negligible), (c) unlike unswitched Ethernet and F D D I L A N s , the Myrinet topology is scalable and the aggregate capacity grows with the number of nodes, (d) the regularity of its network topology avoids deadlock because it allows for simple algorithmic routing, (e) it uses flow control and error control on every link, and (f) it uses low-latency cut-through switches. Various parts of the following subsections contain details, about L A N a i processors and h o s t / L A N a i interface support, which were taken from the Myrinet guide [Myr97a] and the Myricom website ("http://www.myri.com/").  3.1.1  L A N a i 4 Processor  Each network interface card contains a L A N a i 4 processor, which is a programmable communication device that provides the interface to the Myrinet L A N or S A N network. As shown in Figure 3.1, it consists of an instruction interpreting processor and a packet interface which together make up the L A N a i core, as well as a D M A / C h e c k s u m engine, the Myrinet-link interface and the host/IO interface. The arrows in the figure indicate the flow of data between components. The L A N a i processor [Myr97a, B C F 9 5 ] is a 32-bit dual-context machine, +  with 24 general-purpose registers. There are two contexts. The user context can be interrupted by events that cause the processor to switch to the system context. The system context is uninterruptable but it can be explicitly switched to user context by executing a punt (assembly) instruction. The three special registers : (a) Interrupt Status Register (ISR), (b) Interrupt Mask Register (IMR), and (c) External Interrupt Mask Register ( E I M R ) , control interrupt generation.  Packet-  interface special registers and interrupt-control special registers are all memorymapped except for I M R . A l l the memory-mapped registers can be accessed by both the L A N a i processor and the external-access bus (EBUS). I M R is an internal on-chip L A N a i register  27  32-bit fast static SRAM  Myrinet-Link Interface  DMA/Checksum Engine  Myrinet  •0=  Port  Packet  Processor  Interface  Data T Bus  LBUS LANai chip  Programmed I/O  ( Host I/O Interface  Timing and Control Signals  EBUS )-  Logic peculiar to the bus  Figure 3.1: Block Diagram of the Myrinet/Host Interface and hence not accessible from the E B U S . The L A N a i bus (LBUS) operates at twice the chip-clock speed, i.e., two memory cycles for every clock cycle. The addresses of both L B U S and E B U S are byte addresses and the byte order is big-endian. Words (32-bits) and half-words (16-bits) must be aligned on the L B U S . The L A N a i ignores least-significant bits of the address that would make it non-aligned. The L A N a i chip provides bi-directional access to the Myrinet network. The packet interface handles the injection of packets into the network and their consumption from the network. Special memory-mapped registers are to be used to access the packet interface. On the host side, the L A N a i core cannot access the E B U S directly and the processor has to initiate data transfer between L B U S and E B U S . In a typical mode, the L A N a i processor with the memory on its L B U S appears as a synchronous memory block from the E B U S . But, when the D M A engine starts the data transfer between the L B U S and E B U S , the chip becomes the master and 28  E B U S the slave. The E B U S is simple and needs extra hardware to connect to any standard bus such as P C I . The processor can compute partial Internet checksums and has a special ChecKSum (CKS) register which gets modified during E B U S - L B U S D M A transfers. It also has two real-time counters: (a) real-time clock which gets incremented every time-reference period (1 microsecond), and (b) interrupt timer which gets decremented on every time-reference period. It also has miscellaneous special registers such as the T I M E O U T register (which specifies the timeout period of the L A N a i watchdog timer), M Y R I N E T register (its bit 0 NRES-ENABLE whether the board is to be reset, and bit 1 CRC-ENABLE  indicates  indicates whether C R C  verification is to be performed on the incoming packets), etc.  3.1.2  Myrinet API  The Myrinet A P I provides functions required for programs that run on the L A N a i . The L A N a i libraries are to be initialized by certain function calls prior to their use. The topics that are discussed in the following sections are: • Myrinet Control Program • Channel • Addresses • Resets to the L A N a i boards . • DMA • Sending/Receiving Messages These are discussed with the Myricom standard M C P as an example. The two topics, channels and sending/receiving messages are very specific to the Myricom M C P . The remaining issues are relevant to other M C P s although the details might not exactly conform to them. 29  MCP A program that runs on the L A N a i chip is called a Myrinet Control Program ( M C P ) . A n M C P is not burned into the L A N a i chip. It is loaded from the host into the L A N a i memory using the lload Myrinet tool. The lload program uses the lanaiJoad-and_reset()  function from the Myrinet A P I library. The Myrinet A P I  must be compiled with the L A N a i code to be loaded into the L A N a i S R A M . The device driver performs the following to load an M C P into the L A N a i using the Myrinet A P I : 1. Reset the L A N a i with  lanaLres'et.unit().  2. Set the clock value with  lanaiset-clock^value().  3. Load zeros into the entire L A N a i memory. 4. Load M C P with  myriApiLoadLanaiQ:  5. Set the M C P address with 6. Set D M A burst size with  myriApiSetAddress(). myriApiSetBurst().  7. Release L A N a i from reset mode by calling lanai-reset-unit()  again.  An M C P can do general purpose computing, but standard M C P s typically: • Receive messages from the network and send it to the local host. • Send messages from the local host to the network. There are some standard M C P s that only provide these services. Examples of such M C P s are the standard Myricom M C P , and the "Simple Message Protocol" (SMP) [SHDM96] from Mississippi State University. There are other M C P s which provide more functionality than just the basic service of flow control. One such example is the BullDog M C P [HDMS97, HDS97], 30  also from Mississippi State University, which provides different protocols for ordered/unordered reliable message delivery. But its functionality is limited to reliable ordering and delivering of messages.  Channel A channel is the software interface between the host and the M C P . Each channel consists of three queues: (a) Send queue, (b) Receive queue, and (c) Receive Ack queue. The number of channels in the M C P , which is configurable depending on the kind of L A N a i board, can be obtained using the myriApiGetNumChannels()  API  function. Channels are numbered starting from zero which is always used by the T C P / I P driver. Any channel should be used only by a single host process.  Addresses of LANai Boards/MCPs Every L A N a i board has a unique 48-bit address written into its E E P R O M which the host relays to the M C P using the LANaiDevice function calls. The M C P messages use 64-bit addresses which is of the format 0:0:a:b:c:d:e:f where a:b:c:d:e:f is the 48-bit address burned into the L A N a i board. The static routing table stored in the L A N a i converts the 64-bit address into a route of the path to be traversed. If the highest bit of the M C P address is set, then it is a multicast address. Otherwise, it is a regular address. When the highest bit is set, the message is delivered to all L A N a i boards connected to the Myrinet network. If it is a regular address, the Myrinet address and the channel number in the header are to be considered. On reaching the L A N a i board with the specific Myrinet address, the message is delivered to the process on the host that holds the specified destination channel.  Resets to LANai Boards To avoid deadlock, the L A N a i boards can be reset at anytime.  There are two  kinds of reset: (a) Hardware reset - this can either be generated by the host via  31  LANaiDevice library functions, or it can be a forward reset generated by the Myrinet network itself, (b) Software Reset-- these are generated by the host by calling the myriApiReset()  function in the A P I . While hardware resets cause the M C P to be  restarted, a software reset only resets a single channel. After either reset, the M C P needs to re-establish communication with the host via hand-shaking. The host also needs to restore t h e M C P channel's list of multicast addresses, reclaim any buffers held by the M C P at the time of the reset, and set up the channel queues.  DMA There are two D M A s to consider when one host communicates with another host via the Myrinet network. The D M A between host memory and L A N a i memory is achieved with the help of the D M A engine in Figure 3.1. The host D M A pointers must be aligned on a 4-byte boundary, and must be in the memory space specific to the particular operating system they are running on. To be considered as valid,: the host D M A pointers have.different requirements on different platforms and operating systems. For example, a 32-bit-aligned kernel virtual address is considered as valid for a Pentium running Linux 1.2.x. Using the same address on a different architecture may lead to problems. During the D M A , the memory must be locked so it is otherwise not disturbed. The second D M A between the L A N a i memory and: the network is carried out with the help of the Packet Interface in Figure 3.1. The D M A s are started by the M C P executing on the network processor.  Message Sending/Receiving The Send queue is a fixed size circular buffer of length NUMSENDS.  When the  hosts wants to send a message, it places it on the Send queue. The M C P removes the message and sends it over the Myrinet network. A t any time, there can be only NUMSENDS  messages in the Send queue. If the Send queue is full, the host has  to wait until the M C P has removed a message from the Send queue and has sent it  32  over the network. Receiving messages is more complicated than sending messages since there are two queues involved: Receive queue and Receive Ack queue. When the Myrinet adaptor receives a message from the network, it D M A s the message into a single . buffer (if the entire message can fit in the fixed-size buffer) or into multiple buffers in the Receive queue. Then it increments the Receive Ack queue. The host, on noticing the Receive Ack update, figures which group of Receive queue buffers contain the new message. It D M A s the message from the buffers, and puts the empty buffers back in the Receive queue. If either the Receive queue is empty or the Receive Ack .queue is full, the M C P drops messages from the network.  3.2  Benchmarks Used for the Choice of the Host Processors  Several C P U s were evaluated for the Myrinet testbed. The ideal processor should not starve either the C P U or the Myrinet D M A engine of the P C I bus, but yet :  produce optimal throughput. The following four processors were considered for the choice of host machines: 1. 233MHz Pentium-MMX, 128MB R A M , 256KB cache, with Intel 430TX chipset and A S U S motherboard. 2. 200MHz Pentium-Pro, 128MB R A M , 256KB cache, with 440LX chipset and A S U S motherboard. 3. 266MHz Pentium-II, 128MB R A M , 512KB cache, with 440FX chipset and Intel motherboard. 4. 233MHz K 6 - A M D , 128MB R A M , 256KB cache, with 430LX chipset and A S U S motherboard.  33  Four benchmarks were used to choose the most ideal processor to build the high-speed gigabit network. The following sections discuss the relevance of each benchmark run, with the experimental results obtained on the four processors.  3.2.1  R e s u l t s of 6w_mem_cp and hswap B e n c h m a r k s  The terminology bwjraem-cp stands for bandwidth memory copy. It measures how fast the C P U can copy memory. It was implemented using bcopy of the lmbench test suite. The lmbench test suite is a system performance tool which can be obtained from the FreeBSD benchmarks ftp site. hswap measures how fast the Myrinet card can read or write memory with its D M A engine. This benchmark comes with the Myrinet software. Table 3.1 shows the memory bandwidth measured on each of the processors by bw-mem^cp, both when they are otherwise idle, and when the network device (Myrinet adaptor) is hammering on the P C I bus.  The test runs are explained  below: 1. bw_mem_cp, no competition: this is performed by just running bwjmem^cp with the hswap benchmark idle. B y running this, the speed of the system's memory can be determined. 2. bw_mem_cp during E2L phase of hswap: bwjmem-cp is first started and later hswap is started. When hswap goes through the E 2 L (host to LANai) phase, the value of bwjmem-cp is reported. 3. bw_mem_cp during L 2 E phase of hswap: this is similar to the second test, except that the value of bwjmera-cp is reported during the L 2 E ( L A N a i to host) phase of hswap testing. 4. hswap b/w for 8k transfers, no competition: hswap bandwidth results are reported during E 2 L and L 2 E phases without running bwjmera.cp.  34  Test  Pentium-MMX in MB/sec  bw_mem_cp, no competition (test 1) bw_mem_cp during E 2 L phase of hswap (test 2) L 2 E phase of hswap (test 3) hswap b/w for 8k transfers, no competition (test 4) E 2 L phase L 2 E phase hswap b/w for 8k transfers while competing w / C P U for memory (test 5) E 2 L phase L 2 E phase Table  49.5  Pentium-Pro in MB/sec 53  Pentium-II  K6-AMD  in MB/sec  in MB/sec  51  47  5.53 11.65  24.34 37.35  23.78 36.94  8.75 6.54  112.2  111.2 12L1  121.4 121.0  111.5 122.1  50.7 40.7  49.9 38.7  111.4 113.1  114.0  112.2 111.9  3.1: Results of bwjmem.cp and hswap Benchmarks  5. hswap b/w for 8k transfers while competing w / C P U for memory: hswap bandwidth results are reported during E 2 L and L 2 E phases while also running bw.mem-cp. A l l four processors have approximately the same processor speeds (test 1), but Pentium-II has the highest D M A bandwidth when considering both E 2 L and L 2 E phases (test 4). On combining bw_mem_cp/hswap, both P e n t i u m - M M X and K 6 - A M D maintain the same D M A transfer bandwidth which is around l l l M B / s e c (test 5). But in both the processors, the Myrinet card starves the C P U leading to very low (in the range of 5 - U M B / s e c ) processor bandwidth (tests 2 and 3). On the other hand, both Pentium-Pro and Pentium-II have significant decrease in their D M A transfer bandwidths, which is around 50MB/sec (test 5), when compared to about 120MB/sec in test 4. This is because the Myrinet card does not starve the C P U of its cycles, unlike the tests with the Pentium-MMX and K 6 - A M D processors. The processor speeds decrease only to about 30MB/sec (test 2) when compared to 52MB/sec in test 1. Pentium-Pro and Pentium-II have both good processor and D M A band-  35  Test  Pentium-MMX in MB/sec  Pentium-Pro in MB/sec  Pentium-II in MB/sec  K6-AMD in MB/sec  bw_mem_rd 8 M B bw_mem_wr 8 M B bw_mem_cp 8 M B unrolled aligned libc aligned unrolled unaligned libc unaligned  148.98 84.15  228.15 82.29 :  219.86 74.05  188.77 77.84  49.93 46.37 49.91 33.39  47.20 51.39 46.78 50.70  44.30 53.25 44.90 52.84  46.71 47.14 45.24 45.66  Table 3.2: Results of lmbench Test Suite widths when running independent of each other, and also reasonable bandwidths when competing with each other.  Pentium-II has slightly better D M A transfer  bandwidth than Pentium-Pro. However, both Pentium-MMX and K 6 - A M D fail to deliver balanced performance, when the C P U and Myrinet D M A compete with each other.  3.2.2  Results of lmbench Test Suite  This system measurement tool was used to measure how fast the cpu can read, write and copy 8 M B of memory. Table 3.2 gives the bandwidth in MB/sec for different tests on the processors. bwjmemjrd and bwjmem^wr measure the bandwidth for reading and writing 8 M B of memory, bw.mem^cp is used to measure (a) unrolled (simplistic) and aligned data copy, (b) library (general) and aligned data copy, (c) unrolled (simplistic) and unaligned data copy, and (d) library (general) and unaligned data copy. Aligned means that the source and the destination are aligned to page boundaries and not that the pointers are word aligned. Both Pentium-Pro and Pentium-II have the highest read bandwidths. A l though Pentium-II has slightly lower write bandwidths, it has almost the same copy bandwidths in all four cases as the other processors. In fact, in the libc case, it has the highest copy bandwidth (around 53MB/sec) in both the aligned and unaligned  36  cases.  3.2.3  Results of netperf B e n c h m a r k s  The netperf tool measures how fast a network application can run when stressing both the C P U and the Myrinet driver. It makes use of both memory copies (for example, the uiomove() in the socket layer), and the Myrinet card (via myristartf) or checkjreceiveQ functions in the Myrinet driver). In short, it combines the testing of bw-mem-cp and hswap when sending data over the network. The sender sends data from its memory which is D M A e d to the L A N a i memory where the M C P pushes it to the network. On the receiver side, the data is D M A e d into L A N a i memory following another D M A into host memory, to the particular process. In Tables 3.3, 3.4, and 3.5, N/A implies that particular experimental result is not available because more than one processor of certain processor types were unavailable for testing.  U D P - S T R E A M Testing Using  netperf  Tables 3.3 and 3.4 are both results for a U D P _ S T R E A M (data are sent in blocks over the network without waiting for any acknowledgement) with the latter repeating the same experiment after turning off the checksums. The experimental setup and the terminologies for tables 3.3 and 3.4 follow: • Socket Buffer Size = 41600 bytes. • Message Size = 8192 bytes. • M_Sent: Number of messages that the sender sent to the receiver. • M_Recv: Number of messages that made it all the way up to the user process listening on a socket on the receiver. It is likely that more messages were received into the receiver's kernel than were delivered all the way up to the user process.  37  • M S _ B W : Throughput for M_Sent case in Mbits/sec • M R _ B W : Throughput for M_Recv case in Mbits/sec  Receiver Sender K6-AMD M-Sent M_Recv MS_BW MR.BW Pentium-II M_Sent M_Recv MS-BW MR_BW Pentium-Pro M_Sent M_Recv MS_BW MR.BW  K6-AMD  Pentium-II  Pentium-Pro  31116 31109 203.89 203.85  31043 31043 203.42 203.42  53196 24789 348.58 162.44  52728 52706 345.53 345.38  53176 53124 348.17 347.83  40335 40279 264.11 263.74  40260 40219 263.68 263.41  N/A  N/A  Table 3.3: U D P _ S T R E A M Testing Using netperf  From Table 3.3, Pentium-II is the fastest sender, about 347Mbits/sec, when compared to the sending speed of K 6 - A M D which is around 203Mbits/sec or PentiumPro which is around 264Mbits/sec. When considering the receiving bandwidth, Pentium-II and Pentium-Pro receive at the same speed (around 203Mbits/sec) when K 6 - A M D is the sender. Pentium-Pro and Pentium-II both receive at about 346Mbits/sec, but at twice the rate of K 6 - A M D receive bandwidth when Pentium-II is the sender. Pentium-II and K 6 - A M D receive at the same speed as the sending rate of the Pentium-Pro. On turning off the checksums, in Table 3.4, both the send and receive bandwidths increase on some of the processors. However, the relative order of the processors in Table 3.4 is similar to that in Table 3.3, except that K 6 - A M D receives 38  Receiver Sender K6-AMD M_Sent M_Recv MS.BW MR-BW Pentium-II M_Sent M_Recv MS.BW MR.BW Pentium-Pro M_Sent MJlecv MS-BW MR_BW  K6-AMD  Pentium-II  Pentium-Pro  33622 33615 220.34 220.29  33904 33904 222.02 222.02  52125 52125 341.58 341.58  52766 52631 345.79 344.90  52274 52256 342.54 342.42  42889 42877 280.82 280.74  42794 42781 280.36 280.27  N/A .  N/A  Table 3.4: U D P _ S T R E A M Testing Using netperf (turning off the checksums)  better from Pentium-II in the latter case. It is clear that Pentium-II can send data faster than both Pentium-Pro and K 6 - A M D , and receive slightly faster than or at the same rate as the latter two processors.  TCP_STREAM Testing Using netperf Table 3.5 are the results for a T C P - S T R E A M where the data is sent over an established stream between the two ends. The experimental setup for T C P - S T R E A M testing are: • Recv buffer Size = 65536 bytes • Send buffer Size = 65536 bytes • Send Message Size = 65536 bytes  39  Receiver  Pentium-II  K6-AMD  Sender  in  K6-AMD Pentium-II Pentium-Pro  Mbits/sec  in  N/A 175.71 178.62  Mbits/sec  Pentium-Pro in  193.83 312.41 290.57  Table 3.5: T C P - S T R E A M Testing Using  Mbits/sec  193.13 317.45 N/A netperf  Both Pentium-II and Pentium-Pro perform similarly when K 6 - A M D is the sender. Both Pentium-Pro and Pentium-II perform significantly better than K 6 A M D (about 80% better) when Pentium-II is the sender, as with U D P J 3 T R E A M testing when checksums are not turned off. Pentium-II also performs about 63%better than K 6 - A M D with Pentium-Pro being the sender, which is different from the results with U D P _ S T R E A M testing.  Evaluation netperf  was  run focusing on the U D P J 3 T R E A M . Many applications which actually  process data are going to have throughput numbers similar to what one gets with a U D P _ S T R E A M . The better the U D P _ S T R E A M number, the better a system is at processing data, and running a Myrinet card at the same time. Both P C s and alphas (using results from Duke University) have problems with optimally executing the U D P _ S T R E A M while running the Myrinet card. A l phas have limited  hswap  numbers, P C s have bad  bwjmem-cp  numbers. Because of  this, neither is an ideal choice for a platform on which to run Myrinet. A n ideal platform will have excellent  hswap,  bwjmem^cp  and combined  hswap/bwjmemory.cp  bandwidths. A platform like this is capable of running the Myrinet card and processing data at the same time, and will indicate this capability with good U D P _ S T R E A M send and receive numbers. Cost constraints of this project limited the choice to P C s . Both Pentium-II  40  and Pentium-Pro have equivalent bw-merri-cp/hswap results. Pentium-II has slightly better copy bandwidths than the other processors. Although Pentium-Pro receives at the same rate as Pentium-II, the latter sends at a much faster rate than the former, as per the results using the network performance evaluation tool. These results lead Pentium-II to be the best choice.  3.3  Layout of the Myrinet Network Testbed  The complete Myrinet layout is shown in Figure 3.2. Each of the 16 Pentium-IIs, with 128MB R A M , 512KB cache and 440FX chipset, numbered 1 through 16 in the figure have a M 2 M - P C I 3 2 C L A N a i interface. The L A N a i boards on the computers access the Myrinet network via M 2 M - C B - 0 5 S A N cables. These are 5-foot length cables which connect the L A N a i interfaces to M 2 M - D U A L - S W 8 dual 8-port S A N switches via M 2 M - Y splitters.  M2M-X 0  M2M-X 1  t  1  ! 2  3  0  M2M-Dual-SW8  !  M2M-X  1  2  D u a l 8-port S A N s w i t c h (switch 1)  (switch 0)  M2M-Y  t  0  5  6  M2M-Y  4  7  M2M-Y  1  0  3  M2M-Dual-SW8  D u a l 8-port S A N s w i t c h  4  M2M-X  M2M-Y  1  0  M2M-Y  y  1  0  5  6  M2M-Y  M2M-Y  V  0  7 M2M-Y  Y  0  V  0  0  Figure 3.2: Layout of Myrinet Network The specifications of the Myrinet switches, interfaces, cables and splitters are 41  briefly reviewed here, but their details can be found in [Myr97b, Myr96, Myr97e].  3.3.1  M 2 M - D U A L - S W 8 M y r i n e t Switch  This switch is designed for a in-cabinet cluster to be deployed with M 2 M - P C I 3 2 C and Myrinet S A N cables. A single dual 8-port Myrinet S A N switch houses two 8-input/output, full-crossbar, Myrinet-SAN switches as shown in Figure 3.3. Each channel operates at 1.28 G b / s , providing a total bisection data rate of 10.24 G b / s for each switch. Port 0  Port 1  A & B |  I  —i—H  A & B |  P  I  n —i— -  o  7  w  e  r  <v  ^—)  Port 2  Port 3  A & B  A & B  p  "  ~  [  —i—i—  Port 4  Port 5  Port 6  A & B  A & B  A & B  I  I  h—i—  Port 7 A & B  Figure 3.3: Top View of Dual 8-Port Myrinet-SAN Switch The 8 ports support 1.28+1.28 G b / s data rate on each of the A and B links. The A and B links may be split onto separate cables using the M 2 M - Y splitters as in ports 4)5,6, and 7of Figure 3.2. The switches A and B can also be connected using a M 2 M - X . This is called an A - B link because it is used to create a link, between the A and B switches at the S A N connectors of the M2M-Dual-SW8 switch, as in ports 0, 1 of switch 0 and ports 2,3 of switch 1, as shown in Figure 3.2. 42  In the network shown in Figure 3.2, computers "1,3,5,T are connected to the internal switch A of switch 0, "2,4,6,8" to the internal switch B of switch 0, "10, 12, 14, 16" to the internal switch A of switch 1, and "9, 11, 13, 15" to the internal switch B of switch 1. The cluster connection can be viewed as shown in Figure 3.4.  CD  GD  CD  GD  Figure 3.4: Myrinet Cluster Connections  3.3.2  M2M-PCI32C M y r i n e t - S A N / P C I Interface  This host interface contains the L A N a i 4 processor which runs the M C P . Its main characteristics are: • 32-bit PCI-bus interface operating at 33MHz. Its D M A performance at 33MHz, with large blocks, exceeds 120MB/s peer mode and with good P C I implemen-  43  tations to/from system memory. , • Interface memory is 1 M B operating at double the PCI-clock rate. • Interface processor is a LANai-4 RISC, operating at the PCI-clock rate(33MHz), explained in detail in section 3.1.1.  3.3.3  M 2 M - C B - 0 5 Cables  These 5-foot cables are used to connect the interfaces to the switches. Each S A N cable carries two channels which are full-duplex and have 1.28Gb/sec data rate each.  3.4  Language Support  This section discusses Emerald's features that support mobility and the original structure of its interpreter before changes were made to its design for running within the L A N a i .  3.4.1  E m e r a l d P r i m i t i v e s to S u p p o r t M o b i l i t y  Mobility in Emerald [JLHB88] is provided by a small set of language features which are discussed below: 1. Move : move X to Y , moves the object X to Y if Y is a node, or moves X to the node where Y is located if Y is an object. 2. Locate: locate X , returns the node where object X is located. 3. Fix: fix X at Y , moves X to Y and fixes it there, i.e., makes it non-mobile. 4. Unfix: unfix X makes a fixed object mobile again. 5. Refix: refix X at Y unfixes the object from where it is currently located, moves it to Y and fixes it there. 44  These routines are a complete list of functions that one would need to move objects around. A main concern in the move routine is deciding how much of the object's state, in addition to its activation records, to move to the target when the object migrates. If the object's state has an array of other objects, (a) should the entire array be moved, or (b) should the individual objects of the array be moved only when invoked on the remote site. The designers of Emerald have introduced the attached qualifier in the language, to be used by the programmer, in order to make explicit requests of moving certain data objects. For example if the object: object myobject attached var myDirectory: var myOtherDirectory:  Directory  Directory  end object  is moved, myDirectory has to be moved as well because it is declared to be attached, as opposed to myOtherDirectory which may or may not be moved depending upon the compiler's decisions. Parameter passing is implemented by the call-by-object-reference technique. Moving an argument object when an object moves depends on (i) the size of the argument object, (ii) whether it will be referenced frequently enough at the remote site, and (iii) how costly it is to move it to the remote site. The Emerald compiler tries to make these decisions. For example, small immutable objects such as strings or integers should be moved rather than being referenced remotely. The application program can assist the compiler in arriving at the appropriate decisions by specifying whether it would like a particular object to be moved to the remote site. The above review, though does not do complete justice to all the features of Emerald, supports the idea that mobility or service migratibility can be easily  45  Reader  Initialization  >~  Put incoming packets in the queue to be processed Reader > Create Process queue  Threads  Pool of Worker Threads (^^WOTto^^)  C^^Worker^^)  Figure 3.5: Emerald Virtual Machine achieved using it as a base framework.  3.4.2  Structure of the Emerald Interpreter  The host-based Emerald interpreter (named host-emx) is a multi-threaded distributed object-oriented virtual machine. It consists of one big C program which uses the standard libc, as well as other libraries to handle system calls and communication over the network via sockets. When it starts, the virtual machine does basic initialization, loads all the builtin types from files, and builds objects of the loaded-in types. It then creates a worker pool of threads and a single listener thread as shown in Figure 3.5. The listener thread listens for new connections. On receiving a request from a new node, it creates a new reader thread which has its own T C P socket stream connection to the new node. The listener creates a new reader for every connection. For example, if there are four nodes in the Emerald world, each node will have three reader threads.  46  The reader threads listen for any incoming packets on their stream. They put the incoming messages in the process queue which uses the F I F O technique to send them to the worker pool. A t any time, there should be at least one worker thread so as to avoid deadlock. When the worker thread receives a new message, it checks whether the object to which the message is to be delivered is resident. If the object has moved, the worker redirects the message to where the local node thinks it has moved to. Otherwise it processes the incoming message, and if needed, also handles sending out new messages. The virtual machine also has a garbage collector to clean up any stale objects. The virtual machine divides its memory area into three main blocks to handle malloc and garbage collection: • MALLOC  memory: This area is used to allocate space for data that is not  garbage collected, i.e, for thread control blocks, stacks, tables, sockets, file descriptors, etc. for the different threads running in the virtual machine. • GC memory: Emerald uses the generational garbage collection mechanism. There are two main blocks (a) NEWGC, and (b) OLDGC. The ory area is further divided into two blocks NEWGCl  NEWGCmem-  and NEWGC2.  When  objects are first created, they live in one of these blocks. When the garbage collector is called, it handles either NEWGCl  or NEWGC2.  It cleans up un-  referenced objects and moves the recently created referenced objects to the other block. After a few garbage collections the old objects which have moved back and forth between NEWGCl  and NEWGC2,  and are still being refer-  enced are promoted to the OLDGC block. Objects moved to OLDGC may be garbage collected if they remain unreferenced, but at a less frequent rate than NEWGCl  or NEWGC2  because they are less likely to die.  47  3.5  Summary  This chapter reviewed the Myricom details necessary for understanding the design and implementation of lanai-emx (running on the L A N a i ) , with particular focus on the L A N a i 4 processor and the Myrinet A P I . This was followed by the results of the experimental benchmarks to support the choice of Pentium-II for the Myrinet cluster. The Myrinet layout illustrates the network that was constructed and the different connectors and switches involved. The overall structure and the high level implementation of the host-emx was discussed to explain the changes necessary for porting it to the L A N a i . The discussion of the interpreter will help understand the changes necessary for execution on the L A N a i . The next chapter discusses the design and implementation changes needed for running within the 1 M B of memory on the L A N a i . The main changes include the multi-threadeddistribution package of the interpreter, communication protocols and the host/LANai interface due to the lack of an'OS on the L A N a i among other intricate implementation details.  48  Chapter 4  Design and Implementation of Emu Emu consists of two implementations of the Emerald virtual machine: lanai-emx, running on the L A N a i ; and host-emx, running on the host. The lanai-emx on the network adaptor can communicate with other interpreters, either on the local host via E B U S <-> L B U S transfer, or on some other host via the Myrinet network. In other words, the lanai-emx running on the network adaptor can do anything that the host-emx can. For example, it can process messages, build objects, execute the active objects, and garbage collect. The layout of the chapter follows. Section 4.1 discusses the key issues to be considered for running the Emerald interpreter within the L A N a i . This is followed by Section 4.2 which discusses the data structures and protocols designed and implemented in lanai-emx to get the support of the host due to lack of an OS. Sections 4.3 and 4.4 discuss the communication protocols and new routing information protocol in the lanai-emx respectively. Section 4.5 discuss various miscellaneous issues in the design of Emu and Section 4.6 concludes.  49  4.1  Key Issues for Moving the Emerald Interpreter to the LANai  This section discusses the main issues involved in running the Emerald virtual machine within 1MB of S R A M on the L A N a i . The first two issues could be handled by implementation changes. The distribution package had to be completely redesigned, because the environment of the L A N a i is different from what one would expect of a regular processor. Each of the issues are concerned with resolving incompatibilities with the new environment.  4.1.1  Limited Memory  The Emerald interpreter on the L A N a i must execute within 1MB of S R A M . This is possible only if the Emerald code, shared memory between the host and the L A N a i , stack, heap and garbage collection memory space fit in this 1 M B . The memory layout of the lanai-emx on 1 M B of R A M on the L A N a i is shown in Figure 4.1. The problems faced in the mapping of the various parts of the interpreter to the L A N a i S R A M , due to the limited memory, will now be described. The L A N a i environment does not provide any debugger, such as gdb, or even printing to standard output such as simple printf stubs. Hence, initial mapping of the different parts of the interpreter to L A N a i S R A M was done by trial and error. When the host loads the M C P to the L A N a i , handshaking between the host and the L A N a i must successfully complete. This is necessary for the L A N a i and host to communicate via shared memory or for any sort of debugging protocol to be built. A considerable amount of time was spent in successfully completing the startup calibration between the L A N a i and the host, which could have been saved if some debugging support was provided. The lanai-emx running within the L A N a i is a highly optimized version, with code size of about 209 K B without debugging options (emx), and about 285 K B  50  with debugging help (emx.debug), as shown in Figure 4.1. Shared memory starts at 308 K B to provide for proper execution of either versions of the lanai-emx M C P . W i t h the help of a short program, it was found that the stack grows to about 3 K B to 4 K B and the shared memory between the host and the L A N a i is about 56 K B . A n additional 20 K B was added to provide for nested activation records, hence 76 K B was allocated in total for the shared memory and the stack. The remaining space was partitioned to about 256 K B for the heap, 64 K B for the new generation garbage collection, and 320 K B for the old generation garbage collection, as shown in Figure 4.1. 0x100000  I M e m o r y for o l d g e n e r a t i o n garbage collection on lanai-emx  OxbOOOO  M e m o r y for n e w g e n e r a t i o n garbage c o l l e c t i o n o n l a n a i - e m x  OxaOOOO  M e m o r y for h e a p / m a l l o c o n l a n a i - e m x  0x60000  ^ Stack grows d o w n  S h a r e d m e m o r y b / w host a n d f  0x4d000 0x47280  LANai  '  emx_debug emx  0x34300 L A N a i M C P , i.e., l a n a i - e m x c o d e  0x00000  Figure 4.1: Memory Layout on the L A N a i  4.1.2  Library Support  Emerald makes extensive use of the standard C library, libc. The L A N a i , being a different environment with no OS or debugging support, is an unfriendly programming environment.  It does not support the standard libc routines. 51  The original  host-emx was implemented using many functions from libc. The simple approach to solve this problem is to include all libc source files along with the V M code without modification. However this could not be done for two reasons. First, the entire V M including its code, heap and stack must run within 1 M B of R A M , and including unnecessary libc routines would make this impossible. Second, many of the /defiles use system calls for a regular OS. Hence, including the entire libc unmodified will not work. The major libc issues to be considered for the lanai-emx were: • Malloc routines such as mallocQ, reallocf), free(), etc. had to be rewritten to work in the L A N a i environment. The malloc routines work in the S R A M area that is allocated for the heap/malloc, as shown in Figure 4.1. • Stdout calls were handled by using the print/debug protocol discussed in Section 4.2.2. • Stdin, system calls and file opens, reads, etc. were handled using the host interface protocol discussed in Section 4.2.1. • Although some libc routines could be worked around, sources for frequently occurring routines such as strcmp(), memmove(), strcpy(), etc.  had to be  provided. • Basic structures, defines and typedefs had to be provided in addition to routines from libraries.  4.1.3  D i s t r i b u t i o n and M o b i l i t y  In the original Emerald interpreter, distribution implementation depends on userlevel threads.  But there is no builtin thread utility for the L A N a i .  It seemed  better to reimplement the multi-threaded package as a single thread in the L A N a i considering the memory limit of the network adaptor. Redesigning distribution from a multi-threaded model to a single processor model was the most difficult feature of porting the interpreter to the L A N a i . 52  The original multi-threaded Emerald interpreter, explained in Section 3.4.2, provides a conceptual understanding which facilitated coding. However, restructuring this multi-threaded system as a single process required a completely different understanding of the interpreter in order to make the different calls at the proper times. Achieving mobility and distribution for the V M on the L A N a i completed the implementation and porting of the lanai-emx.  4.2  Interface to the Host  In order to execute on the network processor, the lanai-emx needs assistance from the host. The protocols implemented provide OS services, due to the lack of an OS, and debugging support on the L A N a i . This is followed by a discussion of the shared memory structure for communication between the host and L A N a i and the different services that rely on it. Each of the issues discusses the support provided by the L A N a i with the help of the host.  4.2.1  OS-Support Protocol  The Emerald interpreter was designed to run in a Unix environment. It uses files and system calls. On initialization, Emerald interpreter loads a huge (about 120 K B ) Builtins fi\e to build the standard objects (note that everything in Emerald is an object). It also uses system calls to get the ipaddress of the local host which is a part of the object identifier of all objects resident on the local machine. The L A N a i does not support inodes or vnodes as in a Unix environment. There is no file system on the L A N a i . So the OS-Support protocol was designed and implemented to handle opening/reading/closing of files with the help of the host.  But the Builtins file is too big to be loaded all at once into the L A N a i  memory. Therefore, parts of the file are read and objects are built at every stage before reading the next part of the file.  53  The OS-Support protocol was extended to handle system calls such as gethostname(), gethostbyname(), etc., due to the lack of OS on the L A N a i .  But the  M C P running on the L A N a i has the illusion that the calls are satisfied by the L A N a i processor. It was also modified to handle basic information exchange between the L A N a i and the host. In order to make it possible for the host to assist the L A N a i in satisfying the above mentioned calls, the OS-Support protocol on the lanai-emx and the host-emx of the local host communicate via a shared structure shown in Figure 4.2. In short, this protocol can be extended as necessary to enlist the host's help in any future needs where the L A N a i is unable to do so by itself.  typedef v o l a t i l e s t r u c t _MCP_E2L_info { int req_flag; /* query t o be s a t i s f i e d */ int fd; /* f i l e d e s c r i p t o r f o r f i l e c a l l s */ unsigned i n t nbytes; •/* number of bytes */ int result; /* r e s u l t of the c a l l */ char name[50]; /* name of f i l e or hostname, e t c . */ char buffer[MAX-SIZE] ;/* b u f f e r t o r e t u r n t h e r e p l y */ } MCP_E2L_info;  Figure 4.2: The Host Interface Protocol Memory Structure The routine in Figure 4.3, which is added to host-emx, assists the lanai-emx in handling various calls due to the lack of OS support on the L A N a i . 1. File requests: Cases 1, 2 and 3 handle opening, reading and closing of files respectively for the lanai-emx. 2. gethostname()/gethostbyname()  requests: When an Emerald machine comes  up, it tries to get its hostname and other details to set or define its node address. Due to the inability of the L A N a i interpreter to satisfy system calls, such as gethostnameQ / gethostbyname(), it enlists the help of the host interpreter in order to successfully complete those requests. The host V M periodically calls the function whose pseudocode is shown in Figure 4.3 to check whether 54  v o i d handleReqFromLANaiO { access t h e KCPJEZLAnjo shared s t r u c t u r e of the LANai; s w i t c h ( r e q _ f l a g of t h e shared s t r u c t u r e ) { case case case case case case case case  0 1 2 3 4 5 6 7  no request t o s a t i s f y ; open f i l e s p e c i f i e d and r e t u r n s i t s f i l e read from f i l e as s p e c i f i e d ; close f i l e ; s a t i s f y gethostname() request; s a t i s f y gethostbyname() request; s a t i s f y cases 4 and 5 i n one go; get r o o t n o d e i d ;  descriptor  } update r e q _ f l a g ;  } Figure 4.3: Pseudocode for Handling Requests from the L A N a i by the Host these system requests (cases 4, 5 and 6) have been issued from the L A N a i . 3. Information exchange: Basic information about rootnodeid, etc. for the lanaiemx are handled by case 7.  4.2.2  P r i n t / D e b u g Protocol  The L A N a i does not provide any support for printing to standard output. Debugging support was essential to complete porting of the stand-alone interpreter to the L A N a i . A new data structure MCP-L2EJnfo  was introduced in the shared memory/  between the L A N a i and the host, to assist with writing to the standard output. It is in the form of a circular array. No locking is needed to access this shared block of memory. This is because the L A N a i only writes to the buffer and updates the tail of the circular queue, and the host only reads from the buffer and updates the head of the queue.  55  4.2.3  Shared Memory Structure Between L A N a i and Host  The structure shown in Figure 4.4 is the memory on the L A N a i which can be directly accessed by the host. It plays the major role in handling any type of communication such as sending/receiving messages, handshaking, etc. between the L A N a i and the host. It consists of data structures for different services such as debugging, host interface support, etc.  typedef  v o l a t i l e struct  int int MCPJloute MCP_Send_info MCP_Recv_info MCP_L2E_info MCP_E2L_info int int } MCP JShared_mem;  _MCP.Sharedjnem_ { handshake; address; route; send_info; recv_info; debug_info; h o s t _ i n t e r f a c e j n f o; dma_burst; staging_buf f er;  Figure 4.4: The M C P Shared Memory Structure The different services that this shared memory structure helps support are discussed below: • Handshaking between the host and the L A N a i is achieved by setting the handshake attribute. • Routing is done using MCP-Route which consists of the hop length and switching information in an array. • MCPSendJnfo  is used to send messages to the L A N a i from the host.  It  consists of a pointer to the buffer, the buffer length and a ready_bit. • MCPJlecvJnfo  is used by the host to receive messages from the L A N a i . Its  structure is similar to that of  MCPSendJnfo.  56  • Debugging is done with the help of MCP-L2EJnfo.  This structure handles  messages from the L A N a i to the host, hence the name. This shared structure played an important role in porting the interpreter as there was no other means of debugging in the L A N a i . • MCP-E2LJnfo  is the structure that handles file requests, gethostname(), and  gethostbyname() requests, etc. The host does the appropriate work based on the req-flag value as explained in Section 4.2.1.  4.3  Communication Protocols  Communication protocols of the original Emerald interpreter were redesigned in order to properly execute within the L A N a i .  lanai-emx listens for messages not  only from the network but also from the host-emx. Also, the original Emerald interpreter was designed for T C P / I P using sockets. This was redesigned for the Myrinet network due to the lack of socket support in the L A N a i environment. The following topics are covered to give an understanding of the different factors involved in building the communication protocol for lanai-emx. • Design of communication protocol in lanai-emx • Communication from L A N a i to Host • Communication from L A N a i to Network • Communication from Host/Network to L A N a i • Reading/Writing data to the Myrinet network  4.3.1  Design of Communication Protocol in  lanai-emx  The design of the communication protocol for the lanai-emx is portrayed in Figure 4.5.  The lanai-emx supports message exchanges between (a) the L A N a i and the 57  network, (b) the host and network, and (c) the host and the L A N a i . This adds to the complexity of the lanai-emx when compared to the original Emerald interpreter. Me; sages to objects i 1 its own address space Messages to objects in its own space  lanai-emx  (^Switjch  Myrinet Network Adaptor  HOST  Figure 4.5: Communication Protocol of the V M on the L A N a i The communication control support, provided by lanai-emx has a more complex mux/demux routine, for sending and receiving messages either from the network or the host, when compared to a standard M C P . lanai-emx cannot just assume that messages from the network are for the host. This is because the messages can also be for some object in the L A N a i address space. A similar problem exists for messages from the host. This difficulty was handled by modifying the packet header of Emerald packets, and all the marshalling and unmarshalling routines, in order for the mux/demux function to route packets properly.  4.3.2  From LANai to the Host  The function sendToLocalhost() handles writing out messages from the L A N a i to the host. It sends messages using MCP-RecvJnfo  of the shared memory structure  discussed in Section 4.4. Its pseudocode is shown in Figure 4.6.  58  void  sendToLocalhost(void wait  t i l l  send_bit  s e t u p DMA o f wait  for  buffer  DMA t o  inform host  *buffer,  is  int  messageJLen)  {  ready;  to  the  host;  complete;  that  there  is  a message w a i t i n g  for  it;  } Figure 4.6: Pseudocode for Sending Packets from LANai to Host 4.3.3  From L A N a i to the Network  The function  is called by  sendToNetwork()  discussed  processIncomingMessages(),  in Section 4.3.4, to route packets from the host to the network. It is also used by which handles outgoing messages for  vmThreadSend(),  address space.  of  vmThreadSend()  lanai-emx,  from the LANai  lanai-emx  has been modified to write out the  messages to the Myrinet interface (via sendToNetwork()  ) from its original sending  of objects using the Ethernet interface. The pseudocode of sendToNetwork()  is given  in Figure 4.7. void  sendToNetwork(void  (Note t h a t as  the  this  *buffer,  ready_bit  function  of  int  message_len,  LANai->network  returns  only  when t h e  in_addr dest_ip)  need not DMA f r o m  be  checked,  LANai->network  completes) find if  the  (no  route  route  return else  f r o m me t o  dest_ip;  exists)  error;  {  write  the  route to  write  the  packet  dest_ip to  header type  w r i t e messageJLen t o setup the wait  for  DMA o f the  network;  network;  buffer  DMA t o  network; to  to  network;  finish;  return;  }  } Figure 4.7: Pseudocode for Sending Packets from LANai to Network  59  {  4.3.4  From Host/Network to the LANai  The function processIncomingMessages()  handles all incoming messages from both  network and the host. This function gets called periodically by the main loop of the lanai-emx.  Its pseudocode is outlined in Figure 4.8.  void  processIncomingMessagesO  { /*  From h o s t  if  (the  to  the  readyJbit  LANai  of  */  host->LANai  r e c e i v e message from extract  its  if  message  (the  else  is  it  in  /*  is  for  for  the  the  {  me)  receive  buffer  o t h e r VM i n  DMA m e s s a g e t o clear  set)  header;  store it  is  host;  the  ready_bit  of  the  network  for  processing;  network  (Figure  host->LANai  */  4.7);  transfer;  } /*  From n e t w o r k t o  if  (the  readyJbit  the of  LANai  DMA m e s s a g e f r o m extract  its  if  message  (the  else  is  in  /*  is  for  the  set)  {  network;  it  DMA clear  is  header;  store it  */  network->LANai  the  for  the  me)  receive buffer  the  host  message t o  ready_bit  of  the  for  processing;  */ host  (Figure  network->LANai  4.6);  transfer;  }  } Figure 4.8: Pseudocode for Processing Incoming Packets to the L A N a i  4.3.5  Reading/Writing from/to the Myrinet Network  The pseudocode for reading a packet from the Myrinet network is shown in Figure 4.9 and writing a packet to the Myrinet is shown in Figure 4.10. Understanding of the L A N a i registers and flag were necessary in writing code for accessing the Myrinet network. However, low level details are beyond the scope of this section. 60  void  readPacketFromMyriO /*  the  if  a packet  read if  following has  is  arrived  RH r e g i s t e r  (it  is  not  { executed only  for  packet  checking  header;  my h e a d e r t y p e )  consume p a c k e t  after  */.  and throw  { it  away;  return;  } r e a d RW r e g i s t e r ./*  a packet  if  (length  for  the  packet  s h o u l d have at <=0)  consume i t  length;  least  1 word  */  { and t h r o w  it  away;  return;  } use  RMP a n d RML r e g i s t e r s  from network to  LANai  wait  for  DMA t o  complete;  read  the.  CRC; / *  report if  CRC e r r o r s  (tail  bit  was  consume e x t r a  read  the  if  any;  not  to  s e t u p DMA  memory; tail  of  the  packet  */  received)  bytes  and  continue;  Figure 4.9: Reading Packets from the Myrinet Network The function readPacketFromMyri() is used by processIncomingMessages() discussed in Section  4.3.4.  The function sendToNetwork, discussed in Section  4.3.3,  uses writePacketToMyri() to write packets out to the network.  4.4  Message Routing  A standard M C P , or a program executing on a network adaptor, sends messages from the host to the network, or from the network to the host. The lanai-emx performs processing of the messages, in addition to just routing them between the host and the network. Hence, the routing function of the lanai-emx is required to be more intelligent than that of the original Emerald interpreter, because messages from the network or from the host could be for the objects in its own address space. As discussed in Section  4.3.4,  the router part of the lanai-emx extracts head61  void  writePacketToMyri() wait  until  send the  the  packet  determined plus use  the  /*  this  interface  header which a tag;  is  out  the the  says  consists  we of  can the  send; route  {  SB r e g i s t e r s  c a l c u l a t e d one write  {  packet  hop  at  tag  */  the  route->path  i n the  SH r e g i s t e r ;  to  write  a  time;  packet header  } write  out  the  packet  setup LANai->myrinet wait  for  DMA t o  length  in  DMA u s i n g  the  SW r e g i s t e r ;  SMP a n d  SMLT  registers;  complete;  } Figure 4.10: Writing Packets to the Myrinet Network ers from incoming packets, either from the host or from the network, in order to properly route the message to the indicated destination object. The destination object might be in its address space in which case the message is placed in its incoming buffer. Otherwise, the message is routed to the network or the host appropriately. Similarly, handling of outgoing packets from the L A N a i is handled appropriately, depending on the route, as discussed in Sections 4.3.2 and 4.3.3. The routing protocol implemented is a modified version of the routing code obtained from the B D M package [HDMS97] at Mississippi University. The original version would read the connections of the computers to the switch ports from a static file. As the OS-Support protocol implemented requires the assistance of the host, using it would require host assistance for routing every single packet. The purpose of the entire porting of the interpreter to the L A N a i , to offload the host and save C P U cycles, would be lost. The OS-Support protocol had to be designed to support file handling, in order to load the Builtins file initially. However, other protocols which use files frequently, and hence need constant host assistance were redesigned. The following sections discuss the data structures and routing function implemented to route packets by the shortest path into the network to the appropriate destination.  62  4.4.1  D a t a S t r u c t u r e s U s e d for R o u t i n g  The static file information, describing the network connections, were stored in a static array. The modified switchJnfo data structure which was used to store the information for each switch is shown in Figure 4.11 and the switch.conn structure it uses, follows in Figure 4.12.  typedef  struct  { int unsigned  }  num_ports; int  hosts  [MAX_P0RTS];  switch_conn  switches  int  visited;  [MAX_PORTS];  switch_info;  Figure 4.11: Data Structure to Store Information for Each Switch  typedef  struct  { }  int  switchjium;  int  port_num;  switch_conn;  Figure 4.12: switch-conn Data Structure  4.4.2  R o u t i n g Information Protocol  The lanai-emx calls the function in Figure 4.13 to find the routing path to the destination node. On return, route->path stores shortest routing path from hostJp to destJp, and route-> length stores the number of hops to traverse to reach the destination. The routing function uses some basic routing code from the B D M package, but a lot of functions had to be modified for the lanai-emx.  63  void  Map (MCP_Route * r o u t e ,  unsigned  int  host_ip,  unsigned  int  dest_ip)  { int  my.switch,  int  tmpjroute  use  the  and t h e if  (  my_entry_port,  static port  information  to  which the  ( m y . s w i t c h == my l o c a l report  hops,  i;  [MAXJIOUTE];  -1)  host_ip  or is  array local  to  find  host  is  (my_entryjport  not  in  routing  the  switch  and  connected; ==  -1)  )  {  table;  error;  } else  { hops = F i n d _ r o u t e if  (tmp_route,  ( h o p s ==  (MAXJIOUTE +  could  find  report  not  1)  my_switch, )  my . e n t r y j p o r t ,  dest_ip);  {  c o n n e c t i o n from host_ip to  dest_ip;  error;  } /* for  tmpjroute (i=0;  is  backwards,  i<hops;  so  access  it  from  end  to  front  */  i++)  r o u t e - > p a t h [ i ] = ( ( t m p j r o u t e [ h o p s - ( i+1)] &0x3f ) I 0 x 8 0 ) ; route->length=hops;  .  }  } Figure 4.13: Myrinet Routing Protocol  4.5  Other Issues  The following are miscellaneous issues encountered while designing Emu, and the changes made to the host-emx, both to assist the lanai-emx by providing OS support, and to communicate with other machines via the Myrinet network.  4.5.1  E m e r a l d over M y r i n e t Vs. E m e r a l d over IP over M y r i n e t  Normally, an Emerald machine communicates with another machine via a T C P stream, using sockets for sending, receiving or listening for messages.  But in a  Myrinet cluster, the messages need to be packaged as Myrinet packets. The original Emerald packet headers did not need to specify the destinations since the T C P / I P protocol took care of copying the destination address in their packet headers. Emerald over IP over Myrinet would result in three levels of traversal when compared to the two levels in the case of Emerald over Myrinet. To satisfy the 64  latter case, the Emerald packet headers had to accommodate an extra field to store the destination node. Changing the Emerald packet headers resulted in changing different routines like startMsgQ, sendMsg(), forwardMsgf) which handle outgoing packets. Marshalling routines for assembling an outgoing packet and unmarshalling routines on an incoming packet had to be modified to support the added destination attribute.  4.5.2  Issues C o n c e r n i n g H a r d w a r e Resets  If the L A N a i is being bombarded with a lot of messages from the network, the board is sometimes hard reset by the network. The two least significant bits of the T I M E O U T register (refer to Section 3.1.1) determine the timeout period by which the L A N a i chip must consume packets from the network. If the L A N a i chip fails to do so within this period, the N R E S signal is asserted.  This causes the  N R E S - E N A B L E of the M Y R I N E T register to the reset the L A N a i chip. So far, in our application, this has not occurred as high priority is given to messages from the network.  4.5.3  M o d i f i c a t i o n s to the host-emx  The Emerald interpreter on the host had to be modified to communicate with the lanai-emx. The host-emx is a modified version of the model described in Section 3.4.2. The following items discuss the changes to the original interpreter running on the host: • Communication medium support: The host-emx, of the Emu system, was modified to listen for messages on the Myrinet interface rather than the Ethernet interface. It checks whether the ready.bit of the MCP-RecvJnfo  shared structure is set for it to receive  messages. If so, it copies the message to its incoming buffer and clears the ready.bit. For sending messages, vmThreadSend() of the host-emx is modified 65  to set the ready-bit of the MCPSendJnfo  shared structure, in order to inform  the interpreter on the L A N a i that a new message has to be D M A e d from the host. Functionality to support lanai-emx: Support for lanai-emx with files, debugging, system calls and other basic information exchanges were discussed in Section 4.2. Modification to Packet Headers: As in the case of the lanai-emx, the headers had to be changed to accommodate the destination field for communicating with the Myrinet interface. The routines that use the packet headers had to be modified to support the new format. Byte-order: L A N a i uses the big-eridian format whereas the P C s use the little-endian format for the representation of integers. This initially caused a lot of handshaking problems between the L A N a i and the host when trying to read/write the handshaking variable. It was however difficult to pinpoint this problem as there was no debugging tool at the commencement of the project. Heterogeneous Processor Environment: Initially, the Emerald interpreter, on the host, only had to process packets from other nodes which are also running in a similar environment (Unix), while it now has to assist the L A N a i as well. Coordinating all these interactions required proper interleaving of function calls. For example, the host-emx has to assist with handling file support, debugging support, etc. for the lanai-emx in addition to coordinating with the L A N a i in sending and receiving messages via the Myrinet interface.  66  4.6  Summary  This chapter discussed the various issues involved in the design and implementation of Emu.  It discussed the problems with the L A N a i environment in running the  Emerald interpreter and the protocols implemented to resolve the different processing environment conflicts. Second, the communication issues for the new network, and the complexity of the mux/demux function was explained. Third, the new routing functionality introduced by lanai-emx was discussed. The new routing information protocol implemented was discussed at a high level, explaining the technique of routing involved in the Myrinet interfaces and switches. Finally, miscellaneous issues that arose during the design of Emu were discussed including the changes needed to host-emx to communicate with the lanai-emx. The Myrinet interfaces can now run an Emerald interpreter, thereby executing Emerald packets. Emerald objects circulate within the Myrinet network and can be processed either by the lanai-emx or the host-emx. Thus, an interpreter on a network adaptor can do so much more than just a simple routing program as is the traditional case.  67  Chapter 5  Performance This chapter is concerned with the performance of the Emu system and the applications that benefit from the interpreter running on the L A N a i . Section 5.1 discusses the micro benchmarks run on Emu to measure the slow down of certain operations on the network processor when compared to the central processor of the computer. The results of the micro benchmarks support the expected claim that the L A N a i processor is better for communication intensive processing while the host processor is better for computation intensive processing. Two applications were used to test the Emu system, or to be more explicit, the usefulness of migrating functionality to the L A N a i . The first application, kilroy, is a standard Emerald program, and the second application, locator, is designed to locate services among the workstations in the cluster. The final section concludes with a discussion of the usefulness of Emu.  5.1  Micro Benchmarks  Two categories of micro benchmarks were used to compare the performance between the lanai-emx and the host-emx. The first category dealt with computation-intensive operations and the second concentrated on communication-intensive functions. The experimental framework of the system is described in detail in Section 3.3.  68  Operation  host-emx runtime in microsec  lanai-emx runtime in microsec  L A N a i Slow down factor  add (2) multiply (7) divide (436 / 17) invoke create  2.7 2.7 2.9 3.6 3.4  15.8 24.5 25.2 20.6 21.5  5.9 9.1 8.7 5.7 6.3  Table 5.1: Computation Intensive Micro Benchmarks  5.1.1  C o m p u t a t i o n Intensive M i c r o B e n c h m a r k s  The micro benchmarks under the first category are listed in Table 5.1. Each of the operations were run one million times and the averages are reported for accuracy. The L A N a i timings are measured using the host's clock because the L A N a i ' s clock does not report time accurately. As each of the operations are repeated one million times, the slight increase in contacting the host for the runtimes on the L A N a i can be ignored. The second and third columns of Table 5.1 report the runtimes of the reported operation on the host processor and the network processor respectively. The fourth column is the ratio of the runtime on the L A N a i to that on the host. The operations in the first column are discussed in detail below: 1. add (2): adding 2 to an integer object. 2. multiply (7): multiply an integer object by 7. 3. divide (436 / 17): dividing object 436 by object 17. 4. invoke : invoking another object. 5. create: creating an object. Note that the reason for the arguments 7 and 436 / 17, in the multiply and divide operations respectively, is to avoid any hardware support provided for arguments such as 1 or 2. 69  The slow down in the last column of Table 5.1 is almost as expected. The clock rate or the speed of the host (Pentium-II) processor is 266MHz, and that of the L A N a i processor is 33MHz [BCF+95]. So the network processor is slower that the host processor by a little more than a factor of 8. Even though the clockrate ratio of the host to the L A N a i is 8, the measured slow down ratio is in the range of 5.7 to 9.1 for the different operations. The slowdown on the network processor cannot be avoided due to the lagging speed of the L A N a i processor compared to the host processor.  5.1.2  C o m m u n i c a t i o n Intensive M i c r o B e n c h m a r k s  Table 5.2 contains the results of communication intensive operations. Each of the operation were run one thousand times and the averages are reported. In the table, the acronym R T T stands for Round Trip Time. The explanation of the communication intensive operations, and the results reported are discussed below. Communication times along the four paths: (i) from one host to another host via the Myrinet network, (ii) from host to the local L A N a i , (iii) from host to a remote L A N a i , and (iv) from one L A N a i to another L A N a i , are reported for each of the different operations in the first four rows. The fifth row reports the slow down on the L A N a i , which is the ratio of the communication on the L A N a i to L A N a i path to that on the host to host path. The description of the operations executed in all the four possible paths are discussed below. 1. Move Obj R T T : move an object back and forth. 2. Move Self R T T : move the executing process object back and forth. 3. Remote Invoke: invoke a method within a remote object. It is clear from the results, that as expected, communication intensive operations fare better than computation intensive operations. The slow downs reported in the last row are all less than a factor of 6. In fact, in the cases of moving another 70  Path  Move Obj RTT in microsec  Move Self RTT in microsec  Remote Invoke in microsec  1854.7 936.8 2418.2 3064.2 1.7  2020.5 5076.9 6855.8 11827.4 5.9  1041.2 965.9 1721.3 2336.9 2.2  Host to Host (HH) Host to locLANai Host to remLANai L A N a i to L A N a i ( L L ) Slow down ( L L / H H )  Table 5.2: Communication Intensive Micro Benchmarks object or remote invocations, the slow downs are hardly noticeable considering the measurements are in microseconds. In such cases, avoiding host interrupts can be beneficial if the host is involved in a heavy computation intensive workload. It is difficult to give the exact reasons for this almost negligible slow down in the two cases move obj RTT and remote invoke. For a network processor that is slower than the host processor by a factor of 8, the slowdown in the range of 1.7 to 5.9 was quite unexpected. A reasonable explanation, for the move self RTT slow down to be greater than the other two cases, is because it is more computation intensive: In the case of moving the current executing process object, all its activation records also need to be moved to the remote node. This extra computation, of figuring out the state of the executing object to be migrated, is not required in the other two cases.  5.1.3  Discussion  The micro benchmarks support the thesis statement in Chapter 1 that communication intensive components of an application are better off on the network processor, if extensive computation intensive processing is also needed which can occur on the host processor. The lagging clock rate of the L A N a i processor makes it unsuitable for computation intensive processing.  71  5.2  Application 1: Kilroy  kilroy is a simple Emerald program which tests whether an object that is started on one node can migrate to all other nodes in the Emerald world and return to the originator successfully. This application is useful for testing whether each node has state about every other node in the Emerald world. The scenario, shown in Figure 5.1, is just one particular path taken by kilroy in traversing the Emu world. For clarity, it is limited to three nodes in the figure. There were two reasons why Kilroy was run in the Emu system.  Figure 5.1: Kilroy on Emu First, kilroy tests the robustness of Emu. In the Emu world, there are two interpreters on each host, one on the host and the other on the N I C . So, kilroy on Emu will run on 32 Emerald virtual machines in the 16-node Pentium-II cluster. This process object successfully visited all nodes in the Emu system and validated the communication protocols used. 72  Second, a process object which migrates from one node to another provides the framework needed for active networking applications. The application can be easily modified to visit only the lanai-emxs of Emu.  As the Myrinet network is  dynamically reconfigurable, kilroy can be modified to do dynamic routing in the L A N a i without host intervention. As the network increases in size and complexity, kilroy can modify the routing tables in the S R A M . In this way, it extends the static routing function in the L A N a i for dynamic routing. Emu makes this possible because it is designed to be extended at runtime, unlike systems such as Hamlyn [BJM+96] or U-Net [vEBBV95].  5.3  Application 2: Location Service  This application was designed and implemented to provide a generic location service to different types of applications. The following sections discuss the design of the service, the experimental setup, the efficiency of migrating the locator to the L A N a i and examples of applications that might benefit from it.  5.3.1  D e s i g n of the L o c a t i o n Service  The location service provides an A P I for applications that need to locate services in some part of the distributed environment. The location service is provided by two classes, locator and dataFinder, listed in Figures 5.2 and 5.3 respectively. Locator is responsible for maintaining information about the services available at the local node. dataFinder is the class that is invoked by applications to find the location of particular services either on the local node or on remote nodes. This requires dataFinder to maintain a list of all of the locators existing in the cluster. The locator is a monitored class which provides mutual exclusion to protect its shared state when different dataFinders try to access or modify the state of a particular locator at the same time. It provides the basic operations to:  73  const  locator  <-  monitor  attached const export  local  class  <-  operation add[ind:  locator  Set.of[Service].create[320] Service]  local.add[ind] end  add  export  operation remove[ind:  Service]  local.remove[ind] end  remove  export  operation objReq[ind:  result if  <-  result  Service]->[result:  Boolean]  local.contains[ind] then  local.remove[ind] end end  if  objReq  export  operation  result end end  <-  isitHere[ind:  Service]  ->  [result  :  Boolean]  local.contains[ind]  isitHere  locator  Figure 5.2: The Locator Service Class • add a service • remove a service • check whether a particular service is available at the local node • be remotely invoked by other objects when requesting a remote service The dataFinder is the class that is called by applications wishing to locate services. When requested to find a particular service, it checks whether it is available locally. If it is locally available, no remote operations are required. If the requested service is not locally resident, dataFinder serially multicasts requests to other /ocators in the cluster until it finds the service. It reports failure to the application if none of the nodes provide the requested service. The dataFinder class which exports one operation could have been easily merged with the locator class, as this was the initial design of the location service. The reason for separating the location service into two classes is to avoid deadlock if one monitored locator tries to access another locator which is in turn trying to 74  const  dataFinder  <-  attached const  class  field  dataFinder  locators  :  Vector.of[locator]  <-  Vector.of[locator].create[16] attached f i e l d export  result if  count  :  Integer  operation get[ind: <i  [result:  Boolean]  then  :  Integer  result if  0  locators[0].isitHere[Service]  !result for  <-  Service]->  <-  result  <-  1 while  i  <  c o u n t by  i  <-  i  +  1  locators[i].objReq[ind] then  locators[0].add[ind] exit end end end end end  if  for  if  get  dataFinder  Figure 5.3: The Service Finder Class access the former at the same time. In the final version, no monitored operation invokes another monitored operation. This confirms the avoidance of deadlock in this service. The ease of designing locator clearly exemplifies the advantage of using Emerald for a distributed system such as Emu. Each of the consistency, remote invocation and migration services required by the location service are inherently provided by Emerald. This saved considerable time in implementing locator whose naive version is less than 100 lines of Emerald code. The real difficulty in successfully completing the application was in determining the data structures that had to be used because of the memory limitations of the L A N a i . For example, using an array to keep track of locators in the Emu world failed when the location service was migrated to the L A N a i . The lanai-emx was unable to malloc memory for the array data structure. Hence, the vector was used instead as shown in Figure 5.3.  Vector is the most primitive mechanism in  Emerald for indexable storage. This builtin object is used in the implementation of the array in Emerald. Vector is a mutable object and supports the basic subscripting  75  operations. Using arrays would have been more appropriate, because they can dynamically grow, unlike vectors, and hence can be extended at runtime appropriately as the number of nodes in the cluster increase.  5.3.2  Experimental Setup  The experimental framework is described in Section 3.3. A dataFinder and a locator is created for each application on the host. The application invokes the dataFinder to locate a service. The dataFinder maintains state about other locators, by invoking the getlocationServer operation on all the remote host nodes, in the Emu world. The dataFinder checks its local locator and if it receives a negative reply, invokes the remote locators. Experiments were performed for two configurations. • Configuration 1: In this case, locator, dataFinder and the application execute on the host-emx of Emu as shown in Figure 5.4. In this situation, the lanai-emx of Emu is involved only in the routing of messages. • Configuration 2: In the second case, the locator and dataFinder are migrated to run on the lanai-emx with the application running on host-emx as shown in Figure 5.5. The two configurations can be compared with the help of Figures 5.4 and 5.5. In these figures, only three hosts in the cluster are depicted for clarity. The number of I / O crossings, interrupts and context switches avoided in Figure 5.5, when compared to Figure 5.4, suggest that there are significant advantages in migrating the location service to the L A N a i . The next section discusses the benefits achieved, and the overhead avoided by running the location service on the L A N a i .  5.3.3  Efficiency of Emu  Migrating the location service to the L A N a i alleviates the I/O bottleneck discussed in Chapter 1. The advantages of the Emu system are clearly demonstrated in the 76  Figure 5.4: The Location Service on host-emx of Emu scenario where only 3 hosts are considered. • Traversal of the I / O bus is reduced as all the remote invokes and their replies are handled by the lanai-emx. In configuration 1, a maximum of 6 traversals of I/O bus is reduced to a constant 2 traversals in configuration 2. • Unnecessary main memory copies across the I / O bus are avoided. A l l remote invocations in configuration 2 use the L A N a i S R A M . • There is always only 1 interrupt on the host for any invocation of the location service. • The reduction of interrupts avoids context switches which consume a significant amount of the C P U time. Each of the benefits, listed above, result in lower demands placed on the host C P U , freeing more of its resources for useful work. In this evaluation, maximum 77  HOST  Figure 5.5: The Location Service on lanai-emx of Emu of 2 context switches is achieved for any request on dataFinder even in the worst case scenario where all remote locators have to be invoked. This constant is valid if the number of nodes is 3 or 100. This evaluation can be generalized to n nodes of the cluster. The following equations estimate the time spent in locating a service in the worst case where all n locators have to invoked for the two configurations.  CT(n) = n(t p)  + n[2(t + t  NT(n) = n(t )  + [2(t + t  C  NP  c  c  pci  pci  + t )+ mem  + t )+ mem  t J  + 2(n-l)(t )  t ].+  2(n-l)(t )  int  int  net  net  (1) -•— (2)  where CT(n) and NT(n) are the times to locate a service, in the worst case, for Figures 5.4 and 5.5 respectively. The factors involved are: • n is the number of hosts in the cluster. • t is the time spent in context switching in the kernel. c  78  •  t i  n  t  is the kernel interrupt time.  •  t  c  i  is the P C I traversal time for D M A between the host and the L A N a i which  p  is approximately 15.8 microseconds for 256 bytes. • tmem is time spent in memory copies in the host which is approximately 1.02 microseconds for 256 bytes, assuming the data is in the cache. •  t t ne  is time spent for each D M A from/to L A N a i to network which is a little  less than 15.8 microseconds for 256 bytes.  • t c p is processing time on the C P U which is about 10.3 microseconds. •  t p t p  is processing time of the network processor which is approximately 59.4  microseconds (about 5.7 times that of tcp)The times reported for  t  p  c  i  and  t  m  e  m  are for messages of size 256 bytes as  the request messages are small. There is only one final response message of size 6 K B for any service request, and hence that is not taken into calculation. Moving the location service to the L A N a i saves about n(15.8 + 1.02) microseconds in h o s t / L A N a i D M A s and memory copies on the host. The measured network processing time, however, is on average about 5.7 times slower than the measured host processing time. Thus, the host overhead savings is accompanied by an increase in the processing times on the L A N a i . From the equations it is clear that context switches, memory copies and interrupts grow with the number of nodes in equation (1), but remain a constant in equation (2). For a sample run of 256 random service requests in a 6-node cluster, for example, the savings in context switches amounted to about 892 and interrupts to about 446, just by moving the location service to the L A N a i . Thus, when the location service runs on the L A N a i , this savings in processing time can be spent by the n hosts to do useful work.  79  5.3.4  A p p l i c a t i o n s that could U s e the L o c a t i o n Service  Most applications that are built for a distributed environment need some kind of location service if the services are distributed and no central state information is maintained. Even if state is maintained, services could migrate and correct state information update might be impossible. In such a case, multicasting is often done to locate the requested service. The location service, implemented, is a generic service demonstrated by using integers. However, Emerald code could be easily modified to use any other type defined by the programmer. It could also be optimized to perform in a more intelligent manner.  Examples of systems that might use this  service are: • Distributed directories where the data type would be strings or files. • A global memory system which would use pages as the service data types. • Parallel matrix multiplication where the data type would be matrices.  5.4  Discussion  The main motivations for Emu, discussed in Chapter 1, were to alleviate the I / O bottleneck and to offload all unnecessary processing from the host. The locator implemented clearly demonstrates that Emu has satisfied the goals, stated in Chapter 1, which are: avoiding context switches, interrupts, unnecessary memory copies and I/O traversals, and utilizing the processing power on the network processor for more useful work than basic flow control. Emu is an attempt to focus on reducing the overhead involved in message transmissions as suggested by the analysis in [MVCA97]. In [MVCA97], Martin et al. state that applications demonstrate strong sensitivity to overhead involved in sending or receiving a message and the gap between transmissions. But, they are surprisingly tolerant of increased latency and lower per-byte bandwidth. Their  80  results obtained using a wide variety of chosen applications verify their claim. The Emu system reduces the overhead by eliminating host processing as much as possible. The system would work well if the right applications are chosen for migration to the network processor. Applications that are computation intensive are not going to benefit running on the lanai-emx because the network processor is slower than the host processor by a factor of 8. Emu provides support for Emerald object processing on the network processor. It does not decide whether to execute on the central processor or the network processor. It is the responsibility of the programmer to make that decision. Components of applications that are communication intensive can benefit from Emu's support of processing on the N I C . Intelligent offloading of communicationintensive components, from the host to the L A N a i , will save host processing power for computation-intensive components that are better off executing on the host-emx.  81  Chapter 6  Conclusions and Future Work 6.1  Conclusions  The Emu system has made possible the complete implementation of an interpreter on the network processor. The choice of an object oriented distributed interpreter, such as Emerald, makes the applications that run on the system more modular and compact because the language takes care of all the gory details. Emu has introduced runtime extension of the functionalities on the network processor, and an approach for utilizing idle network processing power. Emu has satisfied all the thesis goals stated in Chapter 1: first, migration of the Emerald interpreter to the stand-alone embedded environment of the L A N a i has been completed; second, the interpreter running on the central processor has been modified to communicate via the Myrinet interface; third, the I/O bottleneck problem has been attacked successfully. Unnecessary host interrupts, associated context switches, memory copies and C P U cycles can be avoided with the help of the interpreter running within the network processor. Emu can improve the throughput of the system if intelligent decisions are made in choosing those components of applications that will be run (optimally) on the network processor. Emu does not make migration decisions, and leaves that responsibility to the end application.  82  However, it provides all the support necessary for the actual migration process; Performance of the Emu system in satisfying these goals is supported by the locator service. A formula has been derived which compares the total processing time in locating a service, when processing is done on the central processor, to that obtained by moving the processing to the network processor. The Emu system only requires a single interrupt, when the location service executes on the L A N a i , in locating a service in a cluster with increasing number of nodes, even in the worst case scenario when all the nodes have to be invoked. Emu has proved that processing on the network processor can increase or at least maintain similar overall system throughput, while at the same time saving host processing, for more useful work, during heavy computational workload. However, intelligent delegation of processing to the network processor is required, because the network processor is slower than the central processor by a factor of 8. Communication-intensive applications which require low latency and high throughput would benefit from the interpreter on the network adaptor in the Emu system. Emu has paved the way for other similar systems to be built. It has demonstrated that it is possible to take full advantage of the programmable N I C to increase system throughput. It has shown that with intelligent choice of applications, idle network processing power can be utilized without sacrificing performance.  6.2  Effort Required in Creating Emu  Implementing lanai-emx, in the Emu system, was writing programs for the bare hardware, with not much support for programming. This required learning the intricate details of the N I C registers and Myrinet network. Although the design was simple, the actual implementation was a painful learning experience. This can be attributed to the different environment of the L A N a i processor and the existence of bugs in the Emerald interpreter code which became more pronounced when the communication interface was modified to use the Myrinet network. 83  The main limitations of the L A N a i processor were the lack of an OS, lack of standard C library support in the lanai3-gcc compiler, 1 M B of S R A M memory and no debugging tools. Protocols were built to provide OS and libc support, and debugging tools. The 1 M B of S R A M forced a smaller and a more compact version of the Emerald interpreter. In fact, the optimized version of the interpreter code, with no debugging support, is only about 209 K B of compiled C code. The implementation of Emu helped make Emerald more robust. First, it helped find the dormant bugs that became active with the introduction of the new interface.  Examples of not completely correct code were found in move and re-  mote invocation implementations. Second, in the original version there were tables for mapping oids (these are numeric object IDs) to objects and objects to oids. Due to the memory limitations, the implementation was modified to have both the mappings in a single table with a more complex mapping function but with a 50% savings in memory usage. Third, in the original version, the distribution package was implemented using user level threads. Due to lack of a thread utility, and space constraints in the L A N a i 4 processor, the current version of Emerald supports a single process distribution package for the L A N a i as well the original package with threads. Emu would significantly reduce system creation for any future research that would make such extensive use of the network processor. It has demonstrated the difficulties that one would encounter while implementing such systems. The debugging tools and the protocols implemented, to overcome the limitations of a network adaptor with a different environment, would help in providing the steps to take for building a similar system. Thus, Emu not only shows that such a system is possible but also would be useful research for future systems on the N I C .  6.3  Future  Work  Emu is a system that provides support for network processing on the L A N a i pro84  cessor. But it could, be improved to perform better with some optimizations and modifications to certain parts of the Emerald interpreter.  The following sections  discuss some features to help improve the Emu system and suggests some additional applications that can benefit from this system.  6.3.1  M e m o r y Fragmentation Problems  The malloc and free routines, implemented to allocate and free space from the L A N a i heap, do not provide for the defragmentation of memory when it is freed. This causes fragmentation of memory which creates havoc in the already stringent heap space allowed which is about 256 K B of S R A M .  '  Hence, even though pages are freed as expected, malloc reports failure if space is requested for a large number of pages.  This is because there is no one  contiguous block of memory available for the request to be satisfied. If some sort of defragmentation is implemented for the heap area, more efficient utilization of memory will be possible.  6.3.2  A M o r e Precise Debugger for the L A N a i  The debugger implemented for the lanai-emx part of Emu is the simple approach using printf stubs. It has helped solve the memory problems in Emu but was not easy to use. For example, there is no way to read the point of execution in the program, such as is possible with the where and stacktrace commands in gdb, or to display the values of the variables when the program on the L A N a i dies due to segmentation faults. A more precise debugging tool would avoid unnecessary printing during the process of debugging memory faults. The debugger could help with providing only the necessary information when the program fails.  85  6.3.3  P e r f o r m i n g G M S lookups on the L A N a i  One of the applications that would benefit from Emu is the global memory system [Fee96]. It has two main operations getpage() and putpage(). G M S makes extensive use of the communication bandwidth. If these requests are directed to an engine on the L A N a i which can D M A pages from/to host memory without consuming C P U cycles, the host processor need not be bothered with remote requests. In this way, it lends its memory to trusted hosts but is not disturbed in order to process any remote messages. If the hashtable of the G M S is moved to the L A N a i , the getpage() and putpage() requests can be successfully handled by the L A N a i . It could make use of the location service which was explained in the previous chapter. Moreover, most of the processing is in the transferring of pages which makes these requests communication intensive. Hence, the lower speed of the network processor when compared to the central processor can be ignored.  6.3.4  S u p p o r t for A c t i v e N e t w o r k i n g on the N I C  Kilroy, in chapter 5, demonstrated the capabilities of Emu to provide support for Active Networking. Applications such as this could be created, following kilrotfs pattern, which run only on the NICs and do useful work as they migrate, such as checking state and modifying it appropriately. For example, an active object migrating among the NICs can be responsible for maintaining up to date information in the hash tables of the G M S application described above.  6.3.5  Dynamic Routing Application  The concept of supporting an intelligent interpreter engine on the N I C , can be extended to a local area network. But in a local area network, the unpredictable latency in traversing the network might mean that traditional paging to disk is faster than paging to global memory. However, other applications might benefit from the 86  Emu concept. If the engine on the N I C is modified to make intelligent routing decisions based on (i) network partitioning, (ii) heavy traffic on the standard route, and (iii) a new shorter route becoming available, then a dynamic routing scenario is established. First, the engine on the N I C need not bother the host, thereby avoiding interrupts and context switches among other things. Second, manual intervention is unnecessary because the network can automatically reconfigure itself to changes, almost instantaneously, which is not the case otherwise.  6.3.6  H P J a v a E n g i n e as F r a m e w o r k for Emu  The H P Java Virtual Machine [Vij98] is designed for embedded systems with resource constraints such as the Myrinet NIC, with memory limitations of 1 M B of S R A M . It complies with Sun's Java Virtual Machine specifications, but with its 0.5MB of code, provides more flexibility. Using H P Java as the framework for Emu instead of Emerald would be an interesting future work. W i t h Java being so popular, using it as the basis for Emu would make the latter more easily accepted as a new programming model for a cluster.  6.4  Final Remarks  The Emu system is a model that supports migration of active objects to the network processors, thereby offloading the hosts for other useful work. The runtime migration of services to the network processor is made possible by the interpreting engine designed for the NIC. It provides all the features to support active networking applications, as stated above. Emu has proved that processing on the network processor is possible. But good throughput in such a system depends on proper delegation of responsibilities to the N I C , keeping in mind that the network processor is much slower than the host processor. Unless the network processors are made faster, the performance in such a system would be limited by the speed of the network proces87  sor. Emu has provided a model to avoid the I/O bottleneck by utilizing idle network processing power, but the new network processing bottleneck introduced has to be tackled for more complex processing to occur on the network interface cards.  88  Bibliography [Ach93]  Bruno Achauer. The D O W L Distributed Object Oriented Language. Communications  [AF89]  of the ACM, 36, Nr 9:48-55, September 1993.  Yeshayahu Artsy and Ralph Finkel.  Designing a Process Migration  Facility: The Charlotte Experience. IEEE Computer, 22, Nr 9:47-56, September 1989. [BCF+95]  Nanette J . Boden, Danny Cohen, Robert E . Felderman, Alan E . K u lawid, Charles L . Seitz, Jakov N . Seizovic, and Wen-King Su. Myrinet A Gigabit-per-Second Local-Area Network. IEEE-Micro,  15(l):29-36,  February 1995. [BCZ96a]  Samrat Bhattacharjee, Kenneth L . Calvert, and Ellen W . Zegura. Implementation of an Active Networking Architecture. White paper presented at gigabit switch technology workshop, College of Computing, Georgia Institute of Technology, Washington University, St. Louis, July 1996.  [BCZ96b]  Samrat Bhattacharjee, Kenneth L . Calvert, and Ellen W . Zegura. On Active Networking and Congestion. Technical Report GIT-CC-96-02, College of Computing, Georgia Institute of Technology, 1996.  89  [BHJ 87] +  Andrew Black, Norman C . Hutchinson, Eric Jul, Henry Levy, and Larry Carter. Distribution and Abstract Types in Emerald. IEEE  Transac-  tions on Software Engineering, 13(l):65-76, January 1987. [BHJL86]  Andrew Black, Norman Hutchinson, Eric Jul, and Henry Levy. Object Structure in the Emerald System. In Proceedings of the First ACM Conference on Object-Oriented Programming Systems, Languages and Applications, pages 78-86, Portland, Oregon, September 1986.  [Bic95]  Lubomir F . Bic. M E S S E N G E R S : Distributed Computing using A u tonomous Objects. Technical Report TR-95-19, 1995.  [BJM+96]  Greg Buzzard, David Jacobson, Milon Mackey, Scott Marovich, and" John Wilkes. A n Implementation of the Hamlyn Sender-Managed Interface Architecture. In Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation,  Seattle, Washington,  October 1996. [BL85]  A . Barak and A . Litman. M O S : A Multi-Computer Distributed Operating System. Software—Practice  and Experience, pages 725-737, August  1985. [BMKK97] A . Baratloo, Karaul. M . , H . Karl, and Z. M . Kedem. KnittingFactory: A n Infrastructure for Distributed Web Applications. Technical Report TR1997-748, New York University, November 1997. [Boo96]  G . Booch. Object-Oriented Development with Java. Report on Object Analysisand  [BSW89]  Design, SIGS Publications, 2(6):2-4, March 1996.  Ammon Barak, Ammon Shiloh, and Richard Wheeler. Flood Prevention in the M O S I X Load-Balancing Scheme.  IEEE Computer Soci-  ety Technical Committee on Operating Systems Newsletter, 3(l):23-27, 1989. 90  [BVW95]  M a t t Bishop, Mark Valence, and Leonard F . Wisniewski. Process M i gration for Heterogeneous Distributed Systems. Technical Report P C S TR95-264, Department of Computer Science, Dartmouth College, A u gust 1995.  [BWvE97]  Anindya Basu, Matt Welsh, and Thorsten von Eicken. Incorporating. Memory Management into User-Level Network Interfaces.  Technical  Report TR97-1620, Department of Computer Science, Cornell University, August 1997. Presented at Hot Interconnects V , A u g 1997, Stanford University. [CAL 89] +  Jeffrey S. Chase, Franz G . Amador, Edward D . Lazowska, Henry M . , Levy, and Richard J . Littlefield.  The Amber System: Parallel Pro-  gramming on a Network of Multiprocessors.  In Proceedings of the  Twelfth ACM Symposium on Operating System Principles, pages 147— 158, Litchfield Park, Arizona, December 1989. [Che88]  David R. Cheriton. The V Distibuted System. Communications  of the  ACM, 31, Nr 3:314-333, March 1988. [dKdP88]  W . de Kinder and M . de Prycker. A T M : A Logical Evolution from ISDN. 1st European Conf. on Inf. Techno, for  OrganisationalSystems,  Elsevier Science Publishers B.V., 1988. [Dou91]  Fred Douglis. Transparent Process Migration: Design Alternatives and the Sprite Implementation. Software—Practice  and Experience, 21, Nr  8:757-785, August 1991. . [DRS89]  F . Brent Dubach, Robert M . Rutherford, and Charles M . Shub. ProcessOriginated Migration in a Heterogeneous Environment. In Proceedings of the ACM Conference on Computer Science, New York, 1989.  91  [ea95]  David  Culler  ification. ence,  et  al.  Technical  Generic Active report,  Message  Department  Berkeley University, February 1995.  http://now.cs.  Interface  of  Computer  SpecSci-  Available from U R L  berkeley.edu/Papers/Papers/gam_spec.ps.  [FBDM96] Munehiro Fukuda, Lubomir F . Bic, Michael B . Dillencourt, and Fehmina Merchant. Intra- and Inter-Object Coordination with M E S SENGERS.  In Proceedings of the First International  Conference on  Coordination Models and Languages, Cesena, Italy, April 1996. [Fee96]  Michael J . Feeley. Global Memory Management for Workstation Networks. P h D thesis, University of Washington, Seattle, Washington, December 1996.  [GHK+97]  Carl A . Gunter, Mike Hicks, Pankaj Kakkar, Jonathan Moore, Scott Nettles, and Sam Weber. P L A N : A Programming Language for Active Networks. Technical report, Department of Computer and Information Science, University of Pennsylvania, July 1997. Available from U R L http://www.cis.upenn.edu/ s w i t c h w a r e / P L A N / .  [HC91]  C . Horn and V . Cahill.  Supporting Distributed Applications in the  Amadeus Environment. Computer. Communications,  14, Nr 6:358-365,  July 1991. [HDMS97]  Gregory Henley, Nathan Doss, Thomas McMahon, and Anthony Skjellum. B D M : A Multiprotocol Myrinet Control Program and Host Application Programmer Interface. Technical Report MSSU-EIRS-EIRC-973, Integrated Concurrent and Distributed Computation Research Laboratory, Mississippi State University, May 1997. Available from U R L http: / / www.erc.msstate.edu/labs/hpcl/bdm.html.  92  [HDS97]  Gregory Henley, Nathan Doss, and Anthony Skjellum.  BDT: A  Thread Library for the Myricom L A N a i 4.x Communications Processor.  Technical Report MSSU-EIRS-ERC-97-2, Integrated Con-  current and sissippi  Distributed  State University,  Computation January  Research  1997.  Laboratory,  Mis-  Available from U R L  http://www.ere.msstate.edu/labs/hpcl/bdm.html. [JLHB88]  Eric Jul, Henry M . Levy, Norman C . Hutchinson, and Andrew P. Black. Fine-Grained Mobility in the Emerald System. ^4CM Transactions on Computer Systems, 6(1):109-133, February 1988.  [Kak97]  Pankaj Kakkar. The P L A N Tutorial. Technical report, Department of Computer and Information July 1997.  Science, University of Pennsylvania,  Available from U R L http://www.cis.upenn.edu/  switch-  ware/PLAN/. [LA89]  Luke Lin and Mustaque Ahamad. . Checkpointing and RollbackRecovery in Distributed Object Based Systems. Technical Report G I T ICS-89/43, School of Information and Computer Science, Georgia Institute of Technology, November 1989.  [LJP93]  Rodger Lea, Christian Jacquemot, and Eric Pillevesse. Cool: System Support for Distributed Programming. Communications of the ACM, 36, Nr 9:37-46, September 1993.  [LS91]  Michael Litzkow and Marvin Solomon. Supporting Checkpointing and Process Migration Outside the U N I X Kernel. In Proceedings of the Usenix Winter 1992 Technical Conference, pages 283-290, Berkeley, C A , U S A , January 1991. Usenix Association.  [LTBL97]  Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint and Migration of U N I X Processes in the Condor Distributed 93  Processing System. Technical Report 1346, Computer Science Department, University of Wisconsin-Madison, April 1997. Available from U R L http://www.cs.wisc.edu/condor/publications.html. [MDP+96]  Dejan Milojicic, Fred Douglis, Yves Paindaveine, Richard Wheeler, and Songnian Zhou. Research  Institute,  Process Migration. AT&T  Laboratories,  and Platform Computing, October http://www.opengroup.org/ [MH97]  Technical report, T O G University of  1996.  Available from U R L  dejan/papers/indx.htm.  Jonathan T . Moore and Michael W . Hicks. Guide.  Toronto  PLAN  Programmer's  Technical report, Department of Computer and Information  Science, University of Pennsylvania, July 1997. Available from U R L http://www.cis.upenn.edu/ s w i t c h w a r e / P L A N / . [Moo97]  Jonathan T . Moore.  P L A N Grammar.  Technical report, Depart-  ment of Computer and Information Science, University of Pennsylvania, July 1997.  Available from U R L http://www.cis.upenn.edu/ switch-  ware/PLAN/. [MV93]  Herman Moons and Pierre Verbaeten. Object Migration in a Heterogeneous World - A Multi-Dimensional Affair. In Proceedings of the Third International  Workshop on Object Orientation  in Operating Systems,  pages 62-72, Asheville, Noth Carolina, December 1993. [MVCA97] Richard P. Martin, Amin M . Vahdat, David E . Culler, and Thomas E . . Anderson. Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture. In Proceedings of the ISCA '97, pages 85-97, Denver, C O , 1997. A C M . [Myr96]  Myrinet internal report, Myricom, Inc., November 1996. Available from U R L http://www.myri.com/myrinet/cables/m2m-widget.html. 94  [Myr97a]  Myrinet internal report, Myricom, Inc., September 1997. Available from URL  [Myr97b]  http://www.myri.com/myrinet/.  Myrinet  internal  report,  Myricom,  Inc.,  July  1997.  http://www.myri.com/myrinet/PCI/m2m-pci32.html. [Myr97c]  Myrinet internal report, Myricom, Inc., September 1997. Available from URL  [Myr97d]  http://www.myri.com/myrinet/pitch.html.  Myrinet internal report, Myricom, Inc., July 1997. Available from U R L http://www.myri.com/myrinet/sw-index.html.  [Myr97e]  Myrinet internal report, Myricom, Inc., July 1997. Available from U R L http://www.myri.com / myrinet/switches/m2m-dual-sw8.html.  [Nut94]  Mark Nuttal. A Brief Survey of Systems Providing Process or Object Migration Facilities. Operating Systems Review, pages 64-80, October 1994.  [Ous93]  John K . Ousterhout. Tel: A n Embeddable Command Language. Technical Report CSD-89-541, University of California, Berkeley, August 1993.  [PCA95]  David A . Patterson, David E . Culler, and Thomas E . Anderson.  A  Case for N O W (Networks of Workstations)—abstract. In Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing, page 17, Ottawa, Ontario, Canada, 2-23 August 1995. [PM83]  Michael L . Powell and Barton P. Miller. MOS/MP.  Process Migration in D E -  Proceedings of the Ninth ACM Symposium on Operating  System Principles, pages 110-119, October 1983. [RAA+88]  M . Rozier, V . Abrossimov, F . Armand, I. Boule, M . Gien, M . Guillemont,  F . Herrmann,  C . Kaiser, 95  S. Langlois,  P. Leonard,  and  W . Neuhauser.  Chorus Distributed Operating Systems.  Computing  Systems Journal, l(4):305-370, December 1988. [RTL 91] +  Rajendra K . Raj, Ewan D . Tempero, Henry M . Levy, Andrew P. Black, Norman C . Hutchinson, and Eric Jul. Emerald: A General-Purpose Programming Language. Software—Practice  and Experience, 21 (1) :91—  118, January 1991. [Sap88]  Peter S. Sapathy. W A V E - 1 : A New Ideology of Parallel Processing on Graphs and Networks. Proceedings of Future Generations  Computer  Systems, 4, 1988. [Sap95]  Peter S. Sapathy. Mobile Wave Technology for Distributed Knowledge Processing in Open Networks. Technical report, June 1995.  [SB88]  Marcel Schelvis and Eddy Bledoeg. tributed Smalltalk.  In ECOOP'88  The Implementation of a DisEuropean Conference on Object-  Oriented Programming, pages 212-232, Oslo, Norway, August 1988. [SDP91]  Santosh K . Shrivastava, Graeme N . Dixon, and Graham D . Parrington. An Overview of the Arjuna Distributed Programming System.  IEEE  Software, pages 66-73, January 1991. [SHDM96]  Anthony Skjellum, McMahon. for  LANai  A 3.x*.  and  Distributed  sippi  State  Gregory Henley, Nathan Doss, and Guide  to  Writing  Technical Computation  University, February  Myrinet Control  report, Research 1996.  Integrated  Programs Concurrent  Laboratory, Available  Thomas  Missis-  from U R L  http://www.erc.msstate.edu/icdrl/learn_mcp/smp.tar.gz. [Shu90]  Charles M . Shub. Native Code Process-Originated Migration in a Heterogeneous Environment. In Proceedings of the ACM Conference on Computer Science, pages 266-270, New York, 1990. 96  [SJ95]  Bjarne Steensgaard and Eric Jul. Object and Native Code Thread M o bility Among Heterogeneous Computers. In Proceedings of the Fifteenth ACM Symposium on Operating System Principles, pages 68-78, Copper Mountain Resort, Colorado, December 1995.  [SM]  Alexander B . Schill and Markus U . Mock.  D C E + + : Distributed  Object-Oriented System Support on top of O S F D C E . Technical report, Institute of Telematics, University of Karlsruhe, Germany. [Smi97]  Peter W . Smith. The Possibilities and Limitations  of Heterogeneous  Process Migration. P h D thesis, Computer Science Department, University of British Columbia, Vancouver, B C V 6 T 1Z4, October 1997. [tel96a]  Telescript Technology: Mobile Agents. Telescript internal report, General Magic, 1996.  [tel96b]  Telescript Technology: A n Introduction to the Language. Telescript internal report, General Magic, 1996.  [tel96c]  Telescript Technology: The Foundation for the Electronic Marketplace. Telescript internal report, General Magic, 1996.  [tel96d]  Telescript Technology: Scenes from the Electronic Marketplace. Telescript internal report, General Magic, 1996.  [TGSK96]  D . L . Tennenhouse, S.J. Garland, L . Shrira, and M . F . Kaashoek. From Internet to Activenet. Request for comments, Laboratory for Computer Science, M I T , January 1996.  [TH91]  M . M . Theimer and B . Hayes. Heterogeneous Process Migration by Recompilation. In Proceedings of the Eleventh International on Distributed  Computing  Systems, May 1991.  Technical Report CSL-92-3. 97  Conference  Also: Xeroc P A R C  [TLC85]  Marvin M . Theimer, Keith A . Lantz, and David R. Cheriton.  Pre-  emptable Remote Execution Facilities for the V-System. Proceedings of the Tenth ACM Symposium on Operating System Principles, December 1985. [TSS+97]  David L . . Tennenhouse,  Jonathan  M . Smith, W . David Sincoski,  David J . Wetherall, and Gary J . Minden. A Survey of Active Network Research. IEEE Communications, [TW96]  35, Nr 1:80-86, January 1997.  David L . Tennenhouse and David J . Wetherall. Towards an Active Network Architecture.  Computer Communication  Review, 26, Nr 2,  April 1996. A n earlier version of this paper was delivered during the keynote session of Multimedia Computing and Networking, San Jose, C A , January 96. [vEBBV95] Thorsten von Eicken, Anindya Basu, Vineet Buch, and Werner Vogels.  U-Net:  Distributed  A User-Level Network Interface for Parallel and  Computing.  In  Proceedings  of the Fifteenth  ACM  Symposium on Operating System Principles,  Copper Mountain Re-  sort,  Available from U R L  Colorado,  December  1995.  Also:  http://www.cs.cornell.edu/Info/Projects/U-Net/. [Vij98]  Jaikumar Vijayan.  HP's Java effort raises questions on Java's Fu-  ture.  Technical report, HP, Inc., March 1998.  Available from  URL  http://www.idg.net/idgJrames/english/content.cgi?vc=docid_9-  50365.html. [WPE+83]  B . Walker, G . Popek, R. English, C . Kline, and G . Thiel. The LOCUS Distributed Operating System. In Proceedings of the Ninth ACM Symposium on Operating System Principles, pages 49-70, Bretton Woods, 1983.  98  [WT96]  Dave Wetherall and David Tennenhouse. The A C T I V E IP Option. In 7th ACM SIGOPS European Workshop, Connemara, Ireland, September 1996.  [Zay87]  E . Zayas. Attacking the Process Migration Bottleneck. In Proceedings of the Eleventh ACM Symposium on Operating System Principles, pages 13-24, Austin, T X , November 1987. A C M .  99  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items