UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Service migration in a gigabit network Petrus, Margaret A.S. 1998

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1998-0578.pdf [ 4.2MB ]
Metadata
JSON: 831-1.0051385.json
JSON-LD: 831-1.0051385-ld.json
RDF/XML (Pretty): 831-1.0051385-rdf.xml
RDF/JSON: 831-1.0051385-rdf.json
Turtle: 831-1.0051385-turtle.txt
N-Triples: 831-1.0051385-rdf-ntriples.txt
Original Record: 831-1.0051385-source.json
Full Text
831-1.0051385-fulltext.txt
Citation
831-1.0051385.ris

Full Text

Service Migration in a Gigabit Network by Margaret A.S . Petrus B.S., Bishop's University, Canada, 1996 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Maste r of Science in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of British Columbia August 1998 © Margaret A.S. Petrus, 1998 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for ex-tensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Computer Science The University of British Columbia 2366 Main Mal l Vancouver, B C Canada V 6 T 1Z4 Date: /JuA-^t ^ , '998 Abstract Migrating services to the network adaptor makes use of the idle processing power in the adaptor that would otherwise be wasted if it was just used for sending and re-ceiving messages. In addition, by moving some functionality to the network adaptor, decisions can be made without involving the host. This would prevent unnecessary memory copies, I /O traversals, interruptions of the host and the associated context switches. Thus the host can use its processing power, without interruption, for more useful work. The Emu system has made possible the complete implementation of the Emerald interpreter on the network processor. The choice of an object oriented distributed interpreter makes the applications that run on the system more modular and compact because the language takes care of all the gory details. Emu has introduced runtime extension of the functionalities on the network processor and an approach for utilizing idle network processing power. However, intelligent delegation of processing to the network processor is required, because the network processor is slower than the central processor. Communication-intensive applications which require low latency and high throughput would benefit from the interpreter on the network adaptor in the Emu system. Emu has paved the way for other similar systems to be built. It has demon-strated that it is possible to take full advantage of the programmable NIC to increase system throughput. It has shown that with intelligent choice of applications, idle network processing power can be utilized without sacrificing performance. ii Contents Abstract ii Contents iii List of Tables vii List of Figures viii Acknowledgements x Dedication xi 1 Introduction 1 1.1 Motivation 1 1.2 What is Service Migration? 3 1.3 Problem Definition 3 1.4 The Network Environment 4 1.5 The Emerald Approach 6 1.6 Thesis Structure • • 8 2 Related Work 10 2.1 Existing Migration Systems 10 2.1.1 Traditional Migration Systems 11 iii 2.1.2 Object Migration Systems/Languages 14 2.1.3 Heterogeneous Migration Systems 17 2.2 Moving Functionality to the Network Adaptor 19 2.2.1 U-Net 19 2.2.2 Hamlyn 20 2.3 Other Languages Considered for the Interpreter 21 2.3.1 Java - Language of the Internet 22 2.3.2 T C L 22 2.3.3 Existing Active Networking Systems/Languages 22 2.4 Contributions of this Thesis 24 3 B a c k g r o u n d 26 3.1 Myricom Technology and Details 26 3.1.1 L A N a i 4 Processor 27 3.1.2 Myrinet A P I 29 3.2 Benchmarks Used for the Choice of the Host Processors 33 3.2.1 Results of bwjmem-cp and hswap Benchmarks 34 3.2.2 Results of Imbench Test Suite 36 3.2.3 Results of netperf Benchmarks 37 3.3 Layout of the Myrinet Network Testbed 41 3.3.1 M 2 M - D U A L - S W 8 Myrinet Switch 42 3.3.2 M2M-PCI32C Myr ine t -SAN/PCI Interface 43 3.3.3 M2M-CB-05 Cables 44 3.4 Language Support 44 3.4.1 Emerald Primitives to Support Mobility 44 3.4.2 Structure of the Emerald Interpreter 46 3.5 Summary 48 iv 4 Design and Implementation of Emu 49 4.1 Key Issues for Moving the Emerald Interpreter to the L A N a i . . . . 50 4.1.1 Limited Memory 50 4.1.2 Library Support 51 4.1.3 Distribution and Mobility 52 4.2 Interface to the Host 53 4.2.1 OS-Support Protocol 53 4.2.2 Print/Debug Protocol 55 4.2.3 Shared Memory Structure Between L A N a i and Host 56 4.3 Communication Protocols 57 4.3.1 Design of Communication Protocol in lanai-emx 57 4.3.2 From L A N a i to the Host 58 4.3.3 From L A N a i to the Network . . . 59 4.3.4 From Host/Network to the L A N a i 60 4.3.5 Reading/Writing from/to the Myrinet Network 60 4.4 Message Routing 61 4.4.1 Data Structures Used for Routing 63 4.4.2 Routing Information Protocol 63 4.5 Other Issues 64 4.5.1 Emerald over Myrinet Vs. Emerald over IP over Myrinet . . 64 4.5.2 Issues Concerning Hardware Resets 65 4.5.3 Modifications to the host-emx 65 4.6 Summary 67 5 Performance 68 5.1 Micro Benchmarks 68 5.1.1 Computation Intensive Micro Benchmarks 69 5.1.2 Communication Intensive Micro Benchmarks 70 5.1.3 Discussion 71 5.2 Application 1: Kilroy 72 5.3 Application 2: Location Service 73 5.3.1 Design of the Location Service 73 5.3.2 Experimental Setup 76 5.3.3 Efficiency of Emu 76 5.3.4 Applications that could Use the Location Service 80 5.4 Discussion 80 6 Conclusions and Future Work 82 6.1 Conclusions 82 6.2 Effort Required in Creating Emu 83 6.3 Future Work 84 6.3.1 Memory Fragmentation Problems 85 6.3.2 A More Precise Debugger for the L A N a i 85 6.3.3 Performing G M S lookups on the L A N a i 86 6.3.4 Support for Active Networking on the NIC 86 6.3.5 Dynamic Routing Application 86 6.3.6 HPJava Engine as Framework for Emu 87 6.4 Final Remarks 87 Bibliography 89 vi List of Tables 3.1 Results of bwjmem^cp and hswap Benchmarks 35 3.2 Results of Imbench Test Suite . 36 3.3 U D P - S T R E A M Testing Using netperf 38 3.4 U D P - S T R E A M Testing Using netperf (turning off the checksums) . 39 3.5 T C P - S T R E A M Testing Using netperf 40 5.1 Computation Intensive Micro'Benchmarks 69 5.2 Communication Intensive Micro Benchmarks 71 vii List of Figures 1.1 Emerald Interpreter on the Network Interface Card (NIC) 7 3.1 Block Diagram of the Myrinet/Host Interface 28 3.2 Layout of Myrinet Network 41 3.3 Top View of Dual 8-Port Myrinet-SAN Switch 42 3.4 Myrinet Cluster Connections 43 3.5 Emerald Virtual Machine 46 4.1 Memory Layout on the L A N a i 51 4.2 The Host Interface Protocol Memory Structure 54 4.3 Pseudocode for Handling Requests from the L A N a i by the Host . . . 55 4.4 The M C P Shared Memory Structure 56 4.5 Communication Protocol of the V M on the L A N a i 58 4.6 Pseudocode for Sending Packets from L A N a i to Host 59 4.7 Pseudocode for Sending Packets from L A N a i to Network 59 4.8 Pseudocode for Processing Incoming Packets to the L A N a i 60 4.9 Reading Packets from the Myrinet Network 61 4.10 Writing Packets to the Myrinet Network 62 4.11 Data Structure to Store Information for Each Switch 63 4.12 switch-conn Data Structure 63 4.13 Myrinet Routing Protocol 64 viii 5.1 Kilroy on Emu 72 5.2 The Locator Service Class 74 5.3 The Service Finder Class 75 5.4 The Location Service on host-emx of Emu 77 5.5 The Location Service on lanai-emx of Emu 78 ix Acknowledgements A lot of people have influenced my work in various ways during my Masters Program at U B C . It would be impossible to remember all of them, but I will do my best. First, sincere thanks to my supervisor, Norm Hutchinson for his invaluable support throughout my work. His constant belief in me and the freedom to explore my own ideas enabled me to gain confidence. Thank you, Dr. Hutchinson, for always listening with such patience when things were not going great. Thanks to Alistair Veitch for answering my numerous questions while in-stalling the Myrinet network and for reviewing the first draft of my thesis. Thanks are also due to Mike Feeley who was my second reader, and to Peter Smith for his reviews and for being a wonderful friend. I would like to thank my family for supporting me through school, particu-larly two very special people, Fatima and Susai Peter. Without their help, it would have been impossible for me to venture off to Canada. I also extend my thanks to Stephen Sinnappan, Celine Stephen, Regina Simon and Pascal Simon. Other people who have helped either technically or by just being there are Marie-France Beauchesne, Ian Cavers, Christoph Kern, Dwight Makaroff, Holly Mitchell, Gerald Neufeld, Kalpana Ramanujam, Saravanan Sanniyasi, Pradosh Siva-doss, Robert van Hulst, Dimitri Voulioris, Alan Wagner and Marcelo Walter. Finally I would like to thank my parents and my husband Paul for being there for me through this. M A R G A R E T A . S . P E T R U S The University of British Columbia August 1998 x To my Mom and Dad. xi Chapter 1 Introduction 1.1 Motivation The changing technology trends for both local area networks with faster processors, and high speed networks with low latency, make the lagging I/O bus the bottleneck in achieving better throughput in the network systems. With the advent of high speed networks, this I /O bottleneck is a serious concern which limits the full po-tential of the network. Protocol processing on the network processor, rather than on the host's C P U , can avoid the unnecessary use of the I /O bus which limits the speed of the processing in the network. The network interface cards (NICs) for high speed networks should have processing capabilities to support such functionality. The new NICs have processors that remain idle except when executing code to control the flow of data between the host and the network. For example, the Myrinet [BCF + 95] NIC contains a general purpose processor which can execute user programs. This is a unique feature because it allows programming the NIC of a high speed gigabit network. This feature could be exploited by moving services to the NIC which would avoid host overhead and I/O transfers. Migrating services to the network adaptor makes use of the idle processing 1 power in the adaptor that would otherwise be wasted if it was just used for sending and receiving messages. In addition, by moving some functionality to the network adaptor, decisions can be made without involving the host. This would prevent unnecessary memory copies, I /O traversals, interruptions of the host and the as-sociated context switches. Thus the host can use its processing power, without interruption, for more useful work. There are several network intensive services that could benefit from this exploitation of the network processor. Two such examples are dynamic routing in an active networking scenario and global memory management in a clustered environment. In the latter case, low latency plays a major role in performance throughputs. Service migration could be extended to building active networking [TSS + 97, WT96] applications. Active Networking is moving executable code in the network rather than just moving passive data. The active code gets executed at routers, which support such services, as it moves in the network. For example, one of the applications that would benefit from this service on the network interface card is dynamic routing, i.e., routing decisions made by code that migrates among the routers. A virtual machine supporting active networking and running within the network adaptor would help the Internet evolve dynamically, without involving the host or constant manual intervention, during failures such as when a router goes down. More important is the introduction of a new routing functionality, where code supporting new routing protocols can be injected into the network dynamically. Another area that will benefit from this approach is the global memory sys-tem (GMS) [Fee96] for a network of workstations. In G M S , global network-wide page replacement decisions are made. When a page-fault occurs, the replaced page is sent to global memory rather than to disk, and the globally least valuable page is discarded from memory, sending it back to disk if necessary. Page faults can thus sometimes be satisfied from the memory of a remote processor. By migrating some 2 code to the network adaptor, the necessary pages can be D M A e d from the remote storing node's memory without OS intervention. This would result both in faster memory access and lower overhead just by migrating a small part of the service to the network interface card. 1.2 What is Service Migration? Service migration involves moving some process, that may be currently active, to some other processor in the network. This technique is adopted for various reasons, the most common of which are to achieve speedup by migrating to faster processors, and for load balancing. Utilizing idle machines or not so overloaded machines leads to the completion of the task faster, while at the same time taking advantage of available idle computational power. Another advantage is reducing network band^ width by moving two processes that are communicating a lot but are on different machines to the same machine [Smi97]. Research on service migration has led to many research-oriented or commer-cial systems [Che88, Dou91, B H J + 8 7 ] . Emu, the system implemented as part of this thesis, falls under object migration systems [BHJ + 87, LJP93]. However, unlike earlier object migration systems, objects in Emu may also execute in the network adaptor rather than just on the central processors. This is achieved by running a fully functional interpreter in the network adaptor. It is believed to be the first such system running within a network interface. A preview of the different migration sys-tems is given in the next chapter, with examples from homogeneous, heterogeneous and object migration systems. 1.3 Problem Definition Emu is a system composed of two distributed object oriented virtual machines on a single host, one executing on the host processor and the other on the Myrinet NIC 3 which supports a general purpose programmable processor. Moving an existing interpreter to the Myrinet NIC, to be executable by the bare network processor, is quite unlike porting software for a central processor of a different architecture. There are many challenges involved such as the R A M limit of the network adaptor, the lagging speed of the network processor when compared to the central processor, the programming environment available, handling of system calls due to the lack of an OS on the network adaptor, etc. The interpreter should also handle traffic from both host and network, each of which functions in a different manner in addition to servicing its own processes. The processing of the incoming objects may lead to the creation of new objects to be injected back into the network or to the host. During implementation, each of the above mentioned limitations had to be handled to arrive at a fully functional interpreter. Based on the results, the following thesis statement can be derived: A programmable high speed NIC can service incoming network packets in-dependently without host assistance and can improve aggregate through-put in a network of workstations if non-computation intensive jobs are offloaded to the NIC. Running a distributed object-oriented interpreter on the NIC for the Emu system extends this functionality further by providing dynamic migration of objects back and forth at runtime. This virtual machine on the network interface enables services to be migrated to the network interface as well, as the traditional migration method of moving to other processors. This work has introduced a new meaning to service migration. 1.4 The Network Environment Emu is implemented for the high-speed Myrinet network. Myrinet is a full duplex 1.28+1.28 gigabit-per-second link. For distributed computing or clustered comput-ing, a Myrinet System Area Network (SAN) or Myrinet Local Area Network (LAN) would be the ideal choice for the following reasons [Myr97c]: 1. Speed : connects workstations or PCs through full duplex 1.28+1.28 gigabit per second links and low latency cut-through switches compared to lOMbits/sec or lOOMbits/sec Ethernet or 155Mbits/sec A T M . 2. Convenience and Robustness: allows any network topology, unlike unswitched Ethernet. Being a switched network, it supports source routing and allows dynamic reconfiguration of the network which (a) is a convenience for instal-lation and (b) provides fault-tolerance. With the routing tables being stored in the S R A M of the network chip, they can be modified on the fly. 3. Network throughput: being a switched network, Myrinet can carry many pack-ets concurrently, each traversing the network at 1.28 gigabit/second. However, unlike Ethernet and F D D I which share a common communication medium, the total traffic of Myrinet increases with the number of hosts since the ratio of hosts/switch is constant. 4. Versatility: like Ethernet, Myrinet can carry packets of many types or proto-cols simultaneously as its packets can be of varied length, and can encapsulate other types of packets such as IP without an adaptation layer, in contrast to, for example, the A A L 5 adaptation layer of A T M . The Myrinet interfaces include a L A N a i processor - a general purpose custom-VLSI chip which has a 32-bit RISC processor. The programs it executes are called Myrinet Control Programs (MCPs) [Myr97dj. The M C P s supplied by Myricom, Inc., handle interactions with the network and host processors. Any M C P must run within 1MB of R A M which is what the L A N a i adaptor supports. Bmti's M C P running within the L A N a i is a complete distributed object oriented interpreter in addition to controlling the flow of data between the host and the network with the help of the D M A engine on the Myrinet NIC. 5 The experimental framework consists of the Myrinet S A N network composed of 16 Pentium-II 266MHz, 128MB R A M , 512KB cache, 440FX chipset processors and 16 M2M-PCI32C Myricom boards. These boards are Myr ine t -SAN/PCI inter-faces, PCI short cards which support 1MB of memory. The benchmarks involved in the selection of the processor are discussed in Chapter 3. 1.5 The Emerald Approach Emerald [BHJ+87, JLHB88, RTL+91, BHJL86], developed at the University of Washington, is an object oriented programming language that was designed to be used in a distributed environment. Hence, it inherently provides for abstraction, moving objects from one machine to another, locating those objects easily because of their unique global identifiers, and other basic distributed services. It is a complete object-migration system and so provides all the services necessary for migration and can be easily extended to provide services for active networking. The language also provides replication and transactions, thereby supporting fault-tolerant and highly available applications. Emerald provides fine-grained mobility. It supports both small passive data objects and active process objects. In addition to the traditional advantages of mi-gration such as load sharing, fault-tolerance, availability, etc., fine-grained mobility in Emerald also supports the following [JLHB88]: • Provides a protocol for moving data (objects) from one machine to another without requiring the user to provide marshalling/unmarshalling of the data. • Improves remote invocation performance by moving parameter objects to the remote node to reduce dependability on the source during the duration of the remote execution. • It moves objects to sites where references exist in order to simplify distributed garbage collection. 6 Memory CPU LANai chip Emeralp V M [ Memory J Myrinet NIC HOST myrinet switch Figure 1.1: Emerald Interpreter on the Network Interface Card (NIC) Emu focussed on building an interpreter for the Emerald byte codes that could execute on the LANai( lanai-emx) and communicate with the one on the host. This interpreter provides features to support service migration and active networking via objects. It is written in the form of an M C P and is loaded by the host into the memory of the Myrinet interfaces. In Emu, there are two separate Emerald interpreters per host: (a) one on the host (host-emx), and (b) one on the Myrinet interface (lanai-emx) as shown in Figure 1.1. The Emerald emx running within the network interface handles traffic from the network without assistance from the host, lanai-emx is not a simple M C P just processing messages from/to the network and the host. It can actually interpret executable code without involving the host. In other words, a complete virtual machine is running within the L A N a i network interface. The host-emx and lanai-emx communicate via a shared memory structure on the L A N a i . An application to illustrate its functionality was designed and implemented as discussed in Chapter 5. Previous research on migrating functionality, other than basic mux/demux of messages, to the network adapter is discussed in the next chapter. 7 1.6 Thesis Structure The layout bf the following chapters are outlined below: • Chapter 2 reviews prior research on systems that deploy service migration and migrate functionality to the network interface cards. This is followed by a dis-cussion of the languages considered for the framework of Emu. The drawbacks of these languages that initially seemed a better choice than Emerald are also explained. It finally concludes the contributions of Emu that are different from prior work. • Chapter 3 provides the background for understanding the design of Emu. It discusses the capabilities of the L A N a i 4 processor that runs on the Myrinet NIC, and the Myrinet A P I available. It explains the choice of Pentium-II for the Myrinet testbed at U B C and the benchmarks that influenced its usage. It also provides some background about the Emerald programming language that is used as the framework for Emu. The original model of the Emerald interpreter is reviewed to understand the complexity of the changes made in Chapter 4. • Chapter 4 discusses the design and the implementation of Emu. There are two virtual engines, one on the host host-emx and the other on the L A N a i lanai-emx with the latter consuming most of the time spent in the construction of Emu. It explains the protocols designed to circumvent the limitations posed by the L A N a i environment, the communication protocols for the new Myrinet network, the complex mux/demux needed for Emu, the routing protocols, and modifications made to the host-emx to assist the lanai-emx with OS support. It also discusses some miscellaneous issues that arose during the design of Emu. • Chapter 5 discusses the performance of Emu. Micro benchmarks that were 8 run on Emu help analyze the slowdown on the L A N a i for computation and communication intensive operations. The slow down is because the network processor is slower than the host processor by a factor of 8. This is followed by a review of the application designed to show the robustness and usefulness of the Emu model, and the efficiency of such a system. Chapter 6 summarizes the thesis and provides suggestions for future work. 9 Chapter 2 Related Work This chapter discusses prior work related to the thesis. The first section reviews some existing migration systems, focusing on significant features relevant to this thesis. A couple of recent systems that migrate some fixed set of functionality to the network interface card are reviewed in Section 2.2. Executing code on the NIC is a considerably new approach and not much research exists in literature. Section 2.3 discusses the languages considered to build the virtual machine, including some active networking systems/languages as these inherently support mobility. It also draws out those features which make them an impossible choice to be used on the L A N a i adaptor. The chapter concludes with a summary of how this thesis is different from prior approaches. 2.1 Existing Migration Systems Process migration research was first undertaken in the late 1970s and a consider-able number of experimental systems have since been implemented and tested. To migrate a process from one machine to another, the following steps are taken: 1. Freeze the process at the source. 10 2. Move all the process' state to the destination host. 3. Resume execution from the point where it was stopped at the source. This section is a brief overview of process migration in homogeneous systems, object migration systems and some examples of heterogeneous migration systems. 2.1.1 Trad i t iona l M i g r a t i o n Systems Homogeneous migration is moving processes among machines of similar architecture. Most of the homogeneous process migration systems follow the general protocol, but with significantly different techniques used to address the following design goals. • How should the virtual memory be transferred from host to destination? • When should the migrant process' execution be stopped? • What degree of transparency should be achieved? • How should residual dependencies on the source, or any other intermediate hosts be reduced? • What are the different migration policies used, i.e., which process to migrate, when to migrate it, and where to migrate it? The V [Che88, TLC85] system developed at Stanford supports migration to allow off-loading of programs to idle workstations in order to increase available com-putational power. The V migration facility considers these three issues as vital to its design: (1) a network-transparent execution environment, (2) minimal interference due to migration, and (3) no residual dependencies on its previous host. In order to not increase execution time due to migration, V does pre-copying of the process' state while continuing execution. This leads to two or three copies of dirty pages before the host is finally migrated to the destination to resume execution. 11 The M O S I X multicomputer operating system [BSW89], developed at the In-stitute of Computer Science, the Hebrew University of Jerusalem, is an extension of Unix. It is designed for a network of workstations (NOW) [PCA95], and supports a comprehensive resource sharing policy. This policy depends on various factors such as load sharing, memory sharing and IPC optimization. Algorithms to calculate these factors are executed independently on each workstation. Later, they exchange results with algorithms from other workstations for information gathering and dis-semination. Migration decisions are made by this policy with partial knowledge about the state of the network, and do not depend on complete information of the state of the N O W , or any particular workstation. Such a migration system where decisions are not completely centralized, especially on a low latency network such as the Myrinet [BCF+95], scales well to large configuration. Charlotte [AF89], work done at the University of Wisconsin-Madison, makes distinct separations of policy (when to migrate which process to which destination) from mechanism (how to detach, transfer, and re-attach the migrant process). Their design of keeping the policy (in the user space) separate from mechanism (in the kernel space) allows various efficient and flexible policies like load sharing, load balancing, etc., to be used. This also obviates the need to recompile the kernel each time the policy is changed. While M O S [BL85] and Locus [WPE+83] freeze processes when selected for migration, and V freezes near the end of transfer, Charlotte freezes somewhere in between (only when the context is marshalled and transferred). Process Migration in Demos/MP [PM83] was implemented efficiently with no system changes due to such characteristics as a uniform and location independent communication interface, and the participation of the kernel in sending and receiving operations in the same manner as a normal process. It is a message-based operating system (communication is the most basic mechanism) with messages being sent using links to specify the receiver of a message. So when processes migrate, some level of transparency is achieved using message forwarding and updating of links. Unlike V 12 and Charlotte, execution of processes stop when they are chosen for migration. The bottleneck in process migration is the transfer of the possibly large virtual address space of the process being migrated. To alleviate this problem, Accent [Zay87] uses a lazy transfer copy-on-reference approach. It uses different variations of virtual memory copy (a) Pure-copy (complete physical copy of the address spaces), (b) Pure-IOU (logical transfer of address space, which requires only portions of the address space to be physically transferred), and (c) RS (moving resident set during migration). Although Pure-IOU and RS hold a clear advantage in the address space transfer of migration, they suffer longer remote execution times when compared to Pure-copy. Hence only processes which access a small portion of their address spaces at the new site are best suited to the use of the copy-on-reference technique. Sprite [Dou91], developed at the University of California, Berkeley, migrates processes at two particular times: (a) exec system call when a resource-intensive program is about to be initiated, and (b) eviction of foreign processes when user returns to his workstation. This is one of the ; few migration systems to receive extensive practical use besides Locus [WPE+83] and M O S I X [BSW89]. Sprite's designers considered trade-offs among transparency, residual dependencies, perfor-mance and complexity in the design of their system. Sprite uses a different form of lazy copying from Accent which also makes use of its existing network services. Backing storage for virtual memory is implemented using ordinary files which are stored in the network file system and are accessible throughout the network. When a process is migrated, the source freezes the process, flushes the dirty pages to the backing files and discards its address space. The process then resumes execution at its destination host with no resident pages, and pages are loaded from backing storage as needed using the standard paging mechanism. Condor [LTBL97] is a distributed batch processing system for Unix developed at the University of Wisconsin-Madison. Condor makes use of idle workstations by 13 migrating processes via checkpointing of their state [LA89, LS91]. The mechanism used is to write all of the process' state information (data, stack, information about open files, pending signals and C P U state) into a file, and to use this information to restore state on the destination machine at restart time. Sprite and V have been carefully designed and implemented to accommodate migration, which is not the case here. Hence not all processes can be migrated in Condor. For example, as there is no way of saving all the state necessary for every, kind of process, there is no way of migrating a set of communicating processes. A survey of systems providing process or object migration facilities is dis-cussed by Mark Nuttall in [Nut94] and a more detailed survey is presented by Milojicic et al. in [MDP + 96] . Some of the above systems would have been ideal choices if not for the two constraints: (a) the processes are too big to fit within 1MB of memory and (b) they do not support heterogeneity. The 1 Mbyte memory limit of the L A N a i network interface narrows down the choice of the system to use. 2.1.2 Object M i g r a t i o n Systems/Languages This section discusses systems that migrate objects instead of processes. A n object is usually a part of a graph of references. Hence when an object is to be migrated, one can either move just that object, some set of objects in the graph, or the entire graph. Moving only the object in question may not be appropriate. By moving an application specific set of objects, unnecessary remote invocations can be avoided. A common problem of these systems is determining what comprises the persistent state of the object so it can be encapsulated before migration. The most typical solution is to provide a description of each object's interface using an Interface Definition Language (IDL), as used by D C E + + [SM], or use routines to marshall an object's state into a buffer. However, in spite of the various research systems, no object migration system has achieved the commercial acceptance as that of a few process migration systems. 14 Emerald [BHJ+87, JLHB88, RTL+91, BHJL86], discussed in chapter 1, has the advantage of being designed to support mobility unlike other systems which have had migration added to them after the fact. It also has additional benefits over process mobility because the overhead of an object is commensurate with its complexity. Hence mobility in Emerald provides a relatively efficient way to transfer fine-grained data from source to destination. Emerald has been designed for a L A N of not more than 100 nodes. Emerald's designers used templates instead of tag bits (used in Distributed Smalltalk [SB88]) to distinguish data types, such as integers, from object references which make mobility easier. Performance of mobility was traded-off in favor of fundamental operations like local invocations. However, it is a very efficient design which explains the reason for its techniques being adapted in many of the systems discussed below. The D O W L [Ach93] distributed object-oriented language, developed at the University of Karlsruhe, Germany, adopts most of Emerald's mechanisms: • Mobility of objects is restricted by fixing (move an object to another machine and fix it there), and unfixing them (the object is free to move again). • Attach objects; by using the reserved word attached in front of the closely related data of the moving object, all the attached objects also move to the destination site of the moving object when it migrates. Attachment is transi-tive but not symmetric. • Besides passing-by-object-reference, support for pass-by-move (move argu-ment object permanently to the invoked object's location) or pass-by-visit (move argument object temporarily to the invoked object's location) reduces the number of remote invocations. If one of the object's methods is activated, it can still migrate by moving the in-vocation context along with the object. Execution resumes at the target node and operation results are transparently delivered to the caller. 15 D C E + + [SM], developed at the University of Karlsruhe, Germany,' is an extended distributed object-oriented environment on top of D C E (The OSF Dis-tributed Computing environment which is becoming an industry standard for open distributed computing). It supports mobility based on concepts introduced by Emer-ald [JLHB88], Amber [CAL+89], Arjuna [SDP91] and Amadeus [HC91]. Migrated objects can be accessed in a uniform way by clients, and concurrent migration and invocation requests are synchronized. A migrating object leaves a proxy at its for-mer location which results in forwarding chains of proxies when it gets invoked. The location of the object gets updated on the proxies upon stepwise return of the call. However, the birth-node of the object is always updated upon every migration and is used in case the forwarding chain breaks. Comet [MV93], developed at Katholieke Universiteit Leuven, Belgium, uses a multi-dimensional migration approach. The designers of Comet consider that location is not the only attribute whose change in value is worth considering during migration. Other attributes such as state representation, access limitations or failure handling semantics also change during the lifetime of an object. Migration of objects in Comet is considered to take place in an N-dimensional space where N is the number of different attributes or flavors, and hence object migration is expressed in terms of coordinate changes along each of the axes. Cool [LJP93], is an object-oriented kernel implemented above the Chorus [RAA + 88] microkernel which provides mechanisms to instantiate, migrate and store objects. The designers of Cool have adopted a layered approach. It is implemented as three layers (COOL-base, generic runtime (GRT) and language specific runtime) above the Chorus microkernel. It was designed with an aim to provide a generic set of abstractions for better support between existing and future object-oriented systems, languages and applications. It also supports passive objects (persistent memory) and activities (threads of execution). Activities can migrate from one object to another, and objects can contain several concurrently executing activities. 16 A survey of object mobility is given in [Nut94]. Many of these systems merge two or more high level language calls into a single feature (for example, the visit primitive in Emerald is a combination of locate and move), thereby reducing the probability of errors (as the underlying system takes care of the details) and the number of remote invocations. 2.1.3 Heterogeneous M i g r a t i o n Systems There are some systems supporting migration of processes among processors of varied architectures or those that run different operating systems. In addition to load or resource sharing, migration among different architectures has the advantage that some computations are quite efficient in certain specific architectures. The process-originated migration system [Shu90, DRS89], developed at the University of Colorado at Colorado Springs, is a prototype built on the existing homogeneous migration features of the V system. Their approaches to eliminate many portability problems by having a compiler resolve the intricate migration issues. Since a process requests its own migration, the request occurs at a reasonably clear point in its execution. In addition to modifications to the linker and the compiler (which take care of the translation of text, heap, data, stack, process control block, etc.), the operating system was modified to include a migration server to satisfy the process-initiated migration requests. Their data message is simplified because at the time of migration request, all the needed state information is in memory, and none is in the machine registers of the source machine. The designers of the system have traded complexity of representation, and overhead at migration or page fault time, for pure runtime speed. Heterogeneous Process Migration by recompilation [TH91], introduces the concept of a source level program being automatically created, transferred to the destination machine; and then recompiled. When the program is executed, it re-stores the state of the process at the destination and then continues the migrated 17 process' execution from the correct location. This approach is similar to that of the previous systems in that it deals with compiled programs rather than with inter-preted source programs. However, unlike the previous case, the details of machine and compiler-dependent data translation procedures are hidden in the compilers and in the debuggers by employing recompilation techniques. The main disadvantage is that migration cost is greatly increased because of source code compilation. The Emerald system [BHJ + 87 , JLHB88, SJ95], supports fine-grained object and thread mobility among heterogeneous computers with threads executing native code. In Emerald, threads follow objects as they move around. The. translation of machine-dependent code and data are dealt with efficiently by making changes to the compiler. These changes enable the runtime system to transform the machine- : dependent code or data into a machine-independent format. However, translation of differently optimized versions of the code running on different architectures has: not been dealt with.. This is because the latter requires the runtime system to be able to invoke parts of the compiler at runtime. Migration of active, threads was' implemented with the use of bus stops [SJ95]. A bus stop represents a program counter value that is the same for all processor types, and hence can be considered as a safe migration point. This makes the code generator free to optimize all code between these bus stops. The Heterogeneous Migration Facility (HMF) [BVW95], developed at the Department of Computer Science, Dartmouth College, requires explicit registration of the data to be migrated. The migration library should be linked with the migra-tion program in order to provide procedures for automatically migrating processes and checkpointing data. Their system is based on a graph model for single process migration. Migration costs are based on the vertex and edge weights of the graph that require complex analysis of many factors, and this is a serious disadvantage to this model. The Tui system [Smi97], is different from the above systems in that in all the 18 previous cases, the program must be written in a type-safe language or in a type-safe subset of a language. Tui supports heterogeneous migration of common languages that are not type-safe. It also allows an external agent to make a migration request in addition to process-originated migration, which is the case in most prior systems. While most of the above systems incorporate the marshalling code into the migrating process itself, Tui is a completely separate program. 2.2 Moving Functionality to the Network Adaptor Research on migrating functionality to the network adaptor in order to increase processing speed, reduce round-trip latencies, etc. began in 1980s. The network processors supported complete implementations of protocols, such as TCP/IP, that were not provided by the operating system. But as operating systems adopted these protocols, the functionalities were dropped from the NIC. This was because the marginal improvements in reducing host overhead did not justify the high cost of running these protocols on the network processor with lOMbits/sec Ethernet. Most of the work done was commercial or proprietary, and hence not available in literature. But with the birth of research into high speed networks, these ideas were revisited. The following two sections discuss work recently done in this area. 2.2.1 U-Net With the advent of A T M [dKdP88], von Eicken et al., from the Department of Computer Science at Cornell University, developed U-Net [vEBBV95], a user-level network interface for parallel and distributed computing. The goal of U-Net, im-plemented using off-the-shelf A T M hardware, is to remove the kernel from the com-munication path while providing full protection, so that user processes are only limited by network capabilities. It provides a virtual view of the network interface to the user processes so that each process has the feeling that they own the network adaptor. 19 By removing the kernel from the communication, an extra memory copy is avoided when sending or receiving messages from the network. This is because user process memory can be mapped to the network adaptor's memory and direct D M A can access user memory locations. U-Net not only supports traditional inter-networking protocols like T C P and U D P but also novel ones like Active Messages. The T C P and U D P protocols implemented using U-Net have latencies close to the raw minimum, and throughputs close to the raw maximum. U-Net active messages, a prototype of the Generic Active Messages (GAM) [ea95], have round-trip times which are only a little above the absolute minimum. The U-Net model explained above was extended further, by von Eicken et al., into a U - N e t / M M architecture [BWvE97]. This allows memory to be copied from and to any part of the application's address space. In other words, memory is not physically pinned down as in the previous case. By integrating a transla-tion look-aside buffer in the network adaptor, which coordinates with the operating system, network buffer pages are pinned and unpinned dynamically. This archi-tecture has been demonstrated using both A T M (PCA-200/Linux implementation) and Ethernet (DC21140/Windows NT implementation). While the former performs both address translation and mux/demux of the messages in the network adaptor, the latter relies on the host for both functions. Their evaluation leads to the conclu-sion that while performing mux/demux of messages on the network interface avoids buffer copies on receive, it is better to perform address translation on the host to reduce communication latency and complexity. 2.2.2 Hamlyn Hamlyn [BJM + 96] , developed at Hewlett-Packard laboratories, is a sender-managed interface architecture optimized for clustered computing using Myrinet. The latency (both hardware and software) is just a few microseconds and provides full protection among the different applications with little or no OS intervention. Being a sender-20 managed design, messages are only sent when the sender knows that there are sufficient buffers at the receiver sites. So it needs to maintain some information about memory space availability at the destination, taking into account buffer over-runs and retransmission under heavy loads. To achieve low latency, direct access to the Myrinet interface is given to the different applications with no OS intervention. So applications can D M A memory between E B U S (External-access Bus) and L B U S (LANai Bus) without any interfer-ence from the kernel. To achieve high bandwidth, Hamlyn supports a scatter-gather direct memory access (DMA) capability that frees the host for other operations dur-ing long transfers. Since Myrinet [BCF +95] has very low (almost negligible) error rates, flow control, and a physically secure network, messages need not be held for retransmission and packet acknowledgements are unnecessary. However, in these systems only a fixed set of functionality has been ported to the network interface, with no techniques to extend them at runtime. Unlike these systems, Emu supports a full-fledged virtual machine running within the network adaptor, thereby allowing extensions at runtime. This is believed to be the first such virtual machine on a network interface that can mux/demux messages, D M A memory, build objects, execute them, etc. within 1MB of the network interface memory with no host intervention. 2.3 Other Languages Considered for the Interpreter Prior to choosing Emerald as the basis of Emu, other languages were studied. Ini-tially, Java seemed to be the ideal choice since it is popular and commercially ac-cepted. Another venue of thought was choosing an active networking language like P L A N . Such a language would support mobility, and thus can be easily modified for service migration. The following sections briefly discuss the advantages and the disadvantages of each of the languages considered: 21 2.3.1 Java - Language of the Internet The features of Java [Boo96, BMKK97] that are advantageous for mobility are (a) its excellent exception handling mechanism, (b) it is object oriented, (c) it provides support for encapsulation of objects, (d) it has no pointer references, and (e) it provides marshalling and un-marshalling routines. Unfortunately Java's interpreter is 50 times slower than C. It does not provide migration primitives or compiler runtime support on different architectures. For active networks, Java is probably much too bulky, as the virtual machine comes with a much larger overhead than C. Hence, the advantage of the gigabit network might be lost. But the main reason for not choosing Java is the massive size of the code, and the high rate of increase of its code size. It would have been almost impossible to fit the code including heap and stack size in 1MB of L A N a i memory. Even removing unnecessary classes might not have solved the problem. 2.3.2 T C L T C L [Ous93], a prototyping language, supports many of the features of conven-tional procedural languages, including variable assignment, procedure calls, control structures, and in addition it has very easy access to graphical widgets. T C L is less strict in its handling of data than many conventional languages, and does not provide the data hiding or modularity features of languages with strong support for abstract data types, such as M L , Modula2, Modula3, or Ada. Thus it may not be an appropriate language in which to write large and complex programs, which are required to be highly reliable and robust. 2.3.3 Existing Active Networking Systems/Languages The concept of active networking involves moving executable code around the net-work, unlike the traditional method of moving just passive data. This code can be executed at various routers by running an interpreter on them. Recent work on 22 active networking includes [TSS+97, TW96, BCZ96a, BCZ96b, TGSK96, WT96]. Active networks allow users to inject customized programs into the network. These programs can help in dynamic modification of the network. For example, active programs can assist routers and gateways in making intelligent routing decisions based on the network traffic at a particular time, as well as dealing with failures due to network partitioning without manual intervention. Languages for active networking include Telescript [tel96b, tel96a, tel96c, tel96d], P L A N [GHK+97, Kak97, MH97, Moo97], Wave [Sap95, Sap88], Messengers [Bic95, FBDM96] , and others. Telescript Everything in telescript [tel96b, tel96a, tel96c, tel96d] is an object. It is an inter-preted language and has two levels : high telescript has an object oriented syntax and compiles to low telescript which has a postfix syntax for stack-based interpre-tation. Telescript supports preemptive prioritized multi-tasking of process objects (which is an object with a life of its own), migration, "meeting locations" for agents, persistent objects and authentication. Telescript's source code could not be obtained easily for experimentation owing to the fact that it has become a commercial product. Odyssey, its successor, is composed of Java class libraries to support distributed applications, and hence poses the same problems as mentioned in section 2.3.1. P L A N P L A N [GHK+97, Kak97, MH97, Moo97] is a script-like functional language similar •to M L . It is designed to satisfy the following two functions in an active network: (i) Resource discovery: the location, invocation, or installation of services, and (ii) Diagnostics: analysis of network status or behavior. P L A N consists of two levels. The P L A N level which is the lower level provides services for the above two functions, 23 and consists of P L A N programs which are distributed to all P L A N routers. The S E R V I C E level, which is the higher level, includes services that are available only to certain routers. Therefore, it may require authentication, unlike services from the lower level, for their use. It usually includes a rich collection of protocols which provide functionality for an active network. Programs are carried in the packets and executed at the different routers. However, P L A N packets do not communicate with each other directly as in Tele-script, but only through application programs at endpoints or service calls on the routers. Although P L A N was successfully installed, it could not be ported to the Myrinet. The P L A N interpreter grows to over 6 M B (6776KB) of memory to handle simple programs such as ping and traceroute, and would clearly grow with more complicated services. This was because current P L A N implementations are built on top of Java or Caml, and the memory limitations of the L A N a i adaptor (1MB) make such sizes unacceptable. 2.4 Contributions of this Thesis Emu has made possible the running of a heterogeneous object-oriented distributed virtual machine on a NIC. The interpreter running within the L A N a i network adap-tor supports all the features of a regular virtual machine running on the host. In Hamlyn and U-Net, functionality migrated to the network interfaces are statically fixed, unlike Emu which can be extended at runtime. Emerald, an object oriented migration system was selected as the framework due to the limitations posed by the other migration and active networking systems discussed. The L A N a i being a different environment, makes the homogeneous and het-erogeneous systems discussed unsuitable for running on the network processor which is slower than the host processor by a factor of 8. Homogeneous process migration systems have introduced the fundamental concepts of migration, but these systems are course-grained because the processes are.big and their state cannot be easily 24 marshalled as is the case with objects'. It would be impossible to run one of the systems within 1MB of R A M . Also, these systems do not support heterogeneity and the L A N a i is quite unlike regular processors with a very unfriendly programming environment. In the case of the discussed heterogeneous migration systems, the limitation is the very slow recompilation times which would reduce throughput in a gigabit network. The usefulness of such a network would be lost. The active networking systems, which provide the support for migrating 1 functionality to the NIC, have the wrong languages for the L A N a i adaptor. The memory limitations of the .LANai adaptor force the virtual machine to be small and optimized such that its code, stack and heap size can run within 1 M B . For example, P L A N did not satisfy this constraint. Moreover, a case study of P L A N implied that a more efficient active networking scenario can be deployed with a much simpler distributed language. Java's main disadvantage was its ever-increasing code size, which would have otherwise made it the first choice due to its wide usage. C and C++ provide no support for mobility, marshalling or unmarshalling routines, and have to be programmed at the socket level, etc. These fundamental functions were inherently provided by the Emerald language which made it a better choice. The Emerald virtual machine has been successfully ported and it runs within 1MB of memory. It is object-oriented which makes migration easier and it supports heterogeneity. Although Emu has been developed using techniques discussed in the research above, it is believed to be the first system on a network interface card supporting functionality which can expand at runtime. 25 Chapter 3 Background Emu was implemented for the Myrinet network interface cards. A review of the network hardware, and the software interface between the L A N a i and the host is required to understand the design issues of Emu. The Emerald language which is used as the framework for Emu is also reviewed. The following is the layout of this chapter. Section 3.1 reviews the L A N a i 4 processor and the Myrinet A P I . Section 3.2 discusses the various host processors considered for the Myrinet network. This is followed by the results of the experimental benchmarks to support the choice of Pentium-II for the Emu system. Section 3.3 explains the Myrinet layout, the Myr ine t /PCI interface card, the Dual Myrinet switches and the cable support. Sec-tion 3.4 discusses the language support for mobility followed by a background on the Emerald interpreter design and implementation, and Section 3.5 summarizes.-3.1 Myricom Technology and Details Myrinet [BCF + 95] is a local-area network (LAN) designed for packet communication and switching within massively parallel processors (MPPs). The characteristics that distinguish Myrinet from other L A N s are (a) high data rates owing to its full-duplex 26 pair of 640Mb/sec channels unlike the uni-directional Ethernet channels, (b) very low error rates (almost negligible), (c) unlike unswitched Ethernet and F D D I LANs , the Myrinet topology is scalable and the aggregate capacity grows with the number of nodes, (d) the regularity of its network topology avoids deadlock because it allows for simple algorithmic routing, (e) it uses flow control and error control on every link, and (f) it uses low-latency cut-through switches. Various parts of the following subsections contain details, about L A N a i processors and host /LANai interface sup-port, which were taken from the Myrinet guide [Myr97a] and the Myricom website ("http://www.myri.com/"). 3.1.1 L A N a i 4 Processor Each network interface card contains a L A N a i 4 processor, which is a programmable communication device that provides the interface to the Myrinet L A N or S A N net-work. As shown in Figure 3.1, it consists of an instruction interpreting proces-sor and a packet interface which together make up the L A N a i core, as well as a DMA/Checksum engine, the Myrinet-link interface and the host/IO interface. The arrows in the figure indicate the flow of data between components. The L A N a i processor [Myr97a, B C F + 9 5 ] is a 32-bit dual-context machine, with 24 general-purpose registers. There are two contexts. The user context can be interrupted by events that cause the processor to switch to the system context. The system context is uninterruptable but it can be explicitly switched to user context by executing a punt (assembly) instruction. The three special registers : (a) Interrupt Status Register (ISR), (b) Interrupt Mask Register (IMR), and (c) External Interrupt Mask Register (EIMR), control interrupt generation. Packet-interface special registers and interrupt-control special registers are all memory-mapped except for IMR. A l l the memory-mapped registers can be accessed by both the L A N a i proces-sor and the external-access bus (EBUS). I M R is an internal on-chip L A N a i register 27 32-bit fast static SRAM DMA/Checksum Engine Processor Data T Bus Myrinet-Link Interface Packet Interface Programmed I/O LBUS LANai chip EBUS ( Host I/O Interface ) -•0= Myrinet Port Timing and Control Signals Logic peculiar to the bus Figure 3.1: Block Diagram of the Myrinet/Host Interface and hence not accessible from the E B U S . The L A N a i bus (LBUS) operates at twice the chip-clock speed, i.e., two memory cycles for every clock cycle. The addresses of both L B U S and E B U S are byte addresses and the byte order is big-endian. Words (32-bits) and half-words (16-bits) must be aligned on the L B U S . The L A N a i ignores least-significant bits of the address that would make it non-aligned. The L A N a i chip provides bi-directional access to the Myrinet network. The packet interface handles the injection of packets into the network and their consump-tion from the network. Special memory-mapped registers are to be used to access the packet interface. On the host side, the L A N a i core cannot access the E B U S directly and the processor has to initiate data transfer between L B U S and E B U S . In a typical mode, the L A N a i processor with the memory on its L B U S appears as a synchronous memory block from the E B U S . But, when the D M A engine starts the data transfer between the L B U S and E B U S , the chip becomes the master and 28 E B U S the slave. The E B U S is simple and needs extra hardware to connect to any standard bus such as PCI . The processor can compute partial Internet checksums and has a special ChecKSum (CKS) register which gets modified during E B U S - L B U S D M A trans-fers. It also has two real-time counters: (a) real-time clock which gets incremented every time-reference period (1 microsecond), and (b) interrupt timer which gets decremented on every time-reference period. It also has miscellaneous special reg-isters such as the T I M E O U T register (which specifies the timeout period of the L A N a i watchdog timer), M Y R I N E T register (its bit 0 NRES-ENABLE indicates whether the board is to be reset, and bit 1 CRC-ENABLE indicates whether C R C verification is to be performed on the incoming packets), etc. 3.1.2 Myrinet API The Myrinet A P I provides functions required for programs that run on the L A N a i . The L A N a i libraries are to be initialized by certain function calls prior to their use. The topics that are discussed in the following sections are: • Myrinet Control Program • Channel • Addresses • Resets to the L A N a i boards . • D M A • Sending/Receiving Messages These are discussed with the Myricom standard M C P as an example. The two topics, channels and sending/receiving messages are very specific to the Myricom M C P . The remaining issues are relevant to other M C P s although the details might not exactly conform to them. 29 M C P A program that runs on the L A N a i chip is called a Myrinet Control Program (MCP) . An M C P is not burned into the L A N a i chip. It is loaded from the host into the L A N a i memory using the lload Myrinet tool. The lload program uses the lanaiJoad-and_reset() function from the Myrinet A P I library. The Myrinet A P I must be compiled with the L A N a i code to be loaded into the L A N a i S R A M . The device driver performs the following to load an M C P into the L A N a i using the Myrinet A P I : 1. Reset the L A N a i with lanaLres'et.unit(). 2. Set the clock value with lanaiset-clock^value(). 3. Load zeros into the entire L A N a i memory. 4. Load M C P with myriApiLoadLanaiQ: 5. Set the M C P address with myriApiSetAddress(). 6. Set D M A burst size with myriApiSetBurst(). 7. Release L A N a i from reset mode by calling lanai-reset-unit() again. An M C P can do general purpose computing, but standard M C P s typically: • Receive messages from the network and send it to the local host. • Send messages from the local host to the network. There are some standard M C P s that only provide these services. Examples of such M C P s are the standard Myricom M C P , and the "Simple Message Protocol" (SMP) [SHDM96] from Mississippi State University. There are other M C P s which provide more functionality than just the basic service of flow control. One such example is the BullDog M C P [HDMS97, HDS97], 30 also from Mississippi State University, which provides different protocols for or-dered/unordered reliable message delivery. But its functionality is limited to reliable ordering and delivering of messages. Channel A channel is the software interface between the host and the M C P . Each channel consists of three queues: (a) Send queue, (b) Receive queue, and (c) Receive Ack queue. The number of channels in the M C P , which is configurable depending on the kind of L A N a i board, can be obtained using the myriApiGetNumChannels() A P I function. Channels are numbered starting from zero which is always used by the T C P / I P driver. Any channel should be used only by a single host process. Addresses of LANai Boards/MCPs Every L A N a i board has a unique 48-bit address written into its E E P R O M which the host relays to the M C P using the LANaiDevice function calls. The M C P messages use 64-bit addresses which is of the format 0:0:a:b:c:d:e:f where a:b:c:d:e:f is the 48-bit address burned into the L A N a i board. The static routing table stored in the L A N a i converts the 64-bit address into a route of the path to be traversed. If the highest bit of the M C P address is set, then it is a multicast address. Otherwise, it is a regular address. When the highest bit is set, the message is delivered to all L A N a i boards connected to the Myrinet network. If it is a regular address, the Myrinet address and the channel number in the header are to be considered. On reaching the L A N a i board with the specific Myrinet address, the message is delivered to the process on the host that holds the specified destination channel. Resets to LANai Boards To avoid deadlock, the L A N a i boards can be reset at anytime. There are two kinds of reset: (a) Hardware reset - this can either be generated by the host via 31 LANaiDevice library functions, or it can be a forward reset generated by the Myrinet network itself, (b) Software Reset-- these are generated by the host by calling the myriApiReset() function in the A P I . While hardware resets cause the M C P to be restarted, a software reset only resets a single channel. After either reset, the M C P needs to re-establish communication with the host via hand-shaking. The host also needs to restore t h e M C P channel's list of multicast addresses, reclaim any buffers held by the M C P at the time of the reset, and set up the channel queues. DMA There are two D M A s to consider when one host communicates with another host via the Myrinet network. The D M A between host memory and L A N a i memory is achieved with the help of the D M A engine in Figure 3.1. The host D M A pointers must be aligned on a 4-byte boundary, and must be in the memory space specific to the particular operating system they are running on. To be considered as valid,: the host D M A pointers have.different requirements on different platforms and op-erating systems. For example, a 32-bit-aligned kernel virtual address is considered as valid for a Pentium running Linux 1.2.x. Using the same address on a different architecture may lead to problems. During the D M A , the memory must be locked so it is otherwise not disturbed. The second D M A between the L A N a i memory and: the network is carried out with the help of the Packet Interface in Figure 3.1. The D M A s are started by the M C P executing on the network processor. Message Sending/Receiving The Send queue is a fixed size circular buffer of length NUMSENDS. When the hosts wants to send a message, it places it on the Send queue. The M C P removes the message and sends it over the Myrinet network. At any time, there can be only NUMSENDS messages in the Send queue. If the Send queue is full, the host has to wait until the M C P has removed a message from the Send queue and has sent it 32 over the network. Receiving messages is more complicated than sending messages since there are two queues involved: Receive queue and Receive Ack queue. When the Myrinet adaptor receives a message from the network, it D M A s the message into a single . buffer (if the entire message can fit in the fixed-size buffer) or into multiple buffers in the Receive queue. Then it increments the Receive Ack queue. The host, on noticing the Receive Ack update, figures which group of Receive queue buffers contain the new message. It D M A s the message from the buffers, and puts the empty buffers back in the Receive queue. If either the Receive queue is empty or the Receive Ack .queue is full, the M C P drops messages from the network. 3.2 Benchmarks Used for the Choice of the Host Pro-cessors Several CPUs were evaluated for the Myrinet testbed. The ideal processor should not starve either the C P U or the : Myrinet D M A engine of the P C I bus, but yet produce optimal throughput. The following four processors were considered for the choice of host machines: 1. 233MHz Pentium-MMX, 128MB R A M , 256KB cache, with Intel 430TX chipset and ASUS motherboard. 2. 200MHz Pentium-Pro, 128MB R A M , 256KB cache, with 440LX chipset and ASUS motherboard. 3. 266MHz Pentium-II, 128MB R A M , 512KB cache, with 440FX chipset and Intel motherboard. 4. 233MHz K 6 - A M D , 128MB R A M , 256KB cache, with 430LX chipset and ASUS motherboard. 33 Four benchmarks were used to choose the most ideal processor to build the high-speed gigabit network. The following sections discuss the relevance of each benchmark run, with the experimental results obtained on the four processors. 3.2.1 Results of 6w_mem_cp and hswap Benchmarks The terminology bwjraem-cp stands for bandwidth memory copy. It measures how fast the C P U can copy memory. It was implemented using bcopy of the lmbench test suite. The lmbench test suite is a system performance tool which can be obtained from the FreeBSD benchmarks ftp site. hswap measures how fast the Myrinet card can read or write memory with its D M A engine. This benchmark comes with the Myrinet software. Table 3.1 shows the memory bandwidth measured on each of the processors by bw-mem^cp, both when they are otherwise idle, and when the network device (Myrinet adaptor) is hammering on the PCI bus. The test runs are explained below: 1. bw_mem_cp, no competition: this is performed by just running bwjmem^cp with the hswap benchmark idle. By running this, the speed of the system's memory can be determined. 2. bw_mem_cp during E2L phase of hswap: bwjmem-cp is first started and later hswap is started. When hswap goes through the E2L (host to LANai) phase, the value of bwjmem-cp is reported. 3. bw_mem_cp during L2E phase of hswap: this is similar to the second test, except that the value of bwjmera-cp is reported during the L2E (LANai to host) phase of hswap testing. 4. hswap b/w for 8k transfers, no competition: hswap bandwidth results are reported during E2L and L2E phases without running bwjmera.cp. 34 Test Pentium-MMX Pentium-Pro Pentium-II K6-AMD in MB/sec in MB/sec in MB/sec in MB/sec bw_mem_cp, no competition 49.5 53 51 47 (test 1) bw_mem_cp during E2L phase of hswap (test 2) 5.53 24.34 23.78 8.75 L2E phase of hswap (test 3) 11.65 37.35 36.94 6.54 hswap b/w for 8k transfers, no competition (test 4) E2L phase 112.2 111.2 121.4 111.5 L 2 E phase 114.0 12L1 121.0 122.1 hswap b/w for 8k transfers while competing w / C P U for memory (test 5) E2L phase 112.2 50.7 49.9 111.4 L 2 E phase 111.9 40.7 38.7 113.1 Table 3.1: Results of bwjmem.cp and hswap Benchmarks 5. hswap b/w for 8k transfers while competing w / C P U for memory: hswap band-width results are reported during E2L and L2E phases while also running bw.mem-cp. A l l four processors have approximately the same processor speeds (test 1), but Pentium-II has the highest D M A bandwidth when considering both E2L and L2E phases (test 4). On combining bw_mem_cp/hswap, both Pent ium-MMX and K 6 - A M D maintain the same D M A transfer bandwidth which is around l l l M B / s e c (test 5). But in both the processors, the Myrinet card starves the C P U leading to very low (in the range of 5-UMB/sec) processor bandwidth (tests 2 and 3). On the other hand, both Pentium-Pro and Pentium-II have significant de-crease in their D M A transfer bandwidths, which is around 50MB/sec (test 5), when compared to about 120MB/sec in test 4. This is because the Myrinet card does not starve the C P U of its cycles, unlike the tests with the Pentium-MMX and K 6 - A M D processors. The processor speeds decrease only to about 30MB/sec (test 2) when compared to 52MB/sec in test 1. Pentium-Pro and Pentium-II have both good processor and D M A band-35 Test Pentium-MMX Pentium-Pro Pentium-II K6-AMD in MB/sec in MB/sec in MB/sec in MB/sec bw_mem_rd 8 M B 148.98 228.15 219.86 188.77 bw_mem_wr 8 M B 84.15 82.29 : 74.05 77.84 bw_mem_cp 8 M B unrolled aligned 49.93 47.20 44.30 46.71 libc aligned 46.37 51.39 53.25 47.14 unrolled unaligned 49.91 46.78 44.90 45.24 libc unaligned 33.39 50.70 52.84 45.66 Table 3.2: Results of lmbench Test Suite widths when running independent of each other, and also reasonable bandwidths when competing with each other. Pentium-II has slightly better D M A transfer bandwidth than Pentium-Pro. However, both Pent ium-MMX and K 6 - A M D fail to deliver balanced performance, when the C P U and Myrinet D M A compete with each other. 3.2.2 Results of lmbench Test Suite This system measurement tool was used to measure how fast the cpu can read, write and copy 8MB of memory. Table 3.2 gives the bandwidth in MB/sec for different tests on the processors. bwjmemjrd and bwjmem^wr measure the bandwidth for reading and writing 8MB of memory, bw.mem^cp is used to measure (a) unrolled (simplistic) and aligned data copy, (b) library (general) and aligned data copy, (c) unrolled (simplistic) and unaligned data copy, and (d) library (general) and unaligned data copy. Aligned means that the source and the destination are aligned to page boundaries and not that the pointers are word aligned. Both Pentium-Pro and Pentium-II have the highest read bandwidths. A l -though Pentium-II has slightly lower write bandwidths, it has almost the same copy bandwidths in all four cases as the other processors. In fact, in the libc case, it has the highest copy bandwidth (around 53MB/sec) in both the aligned and unaligned 36 cases. 3.2.3 Results of netperf Benchmarks The netperf tool measures how fast a network application can run when stressing both the C P U and the Myrinet driver. It makes use of both memory copies (for example, the uiomove() in the socket layer), and the Myrinet card (via myristartf) or checkjreceiveQ functions in the Myrinet driver). In short, it combines the testing of bw-mem-cp and hswap when sending data over the network. The sender sends data from its memory which is D M A e d to the L A N a i memory where the M C P pushes it to the network. On the receiver side, the data is D M A e d into L A N a i memory following another D M A into host memory, to the particular process. In Tables 3.3, 3.4, and 3.5, N/A implies that particular experimental result is not available because more than one processor of certain processor types were unavailable for testing. U D P - S T R E A M Testing Using netperf Tables 3.3 and 3.4 are both results for a U D P _ S T R E A M (data are sent in blocks over the network without waiting for any acknowledgement) with the latter repeating the same experiment after turning off the checksums. The experimental setup and the terminologies for tables 3.3 and 3.4 follow: • Socket Buffer Size = 41600 bytes. • Message Size = 8192 bytes. • M_Sent: Number of messages that the sender sent to the receiver. • M_Recv: Number of messages that made it all the way up to the user process listening on a socket on the receiver. It is likely that more messages were received into the receiver's kernel than were delivered all the way up to the user process. 37 • MS_BW: Throughput for M_Sent case in Mbits/sec • M R _ B W : Throughput for M_Recv case in Mbits/sec Receiver K6-AMD Pentium-II Pentium-Pro Sender K 6 - A M D M-Sent 31116 31043 M_Recv N / A 31109 31043 M S _ B W 203.89 203.42 M R . B W 203.85 203.42 Pentium-II M_Sent 53196 52728 53176 M_Recv 24789 52706 53124 M S - B W 348.58 345.53 348.17 M R _ B W 162.44 345.38 347.83 Pentium-Pro M_Sent 40335 40260 M_Recv 40279 40219 N / A MS_BW 264.11 263.68 M R . B W 263.74 263.41 Table 3.3: U D P _ S T R E A M Testing Using netperf From Table 3.3, Pentium-II is the fastest sender, about 347Mbits/sec, when compared to the sending speed of K 6 - A M D which is around 203Mbits/sec or Pentium-Pro which is around 264Mbits/sec. When considering the receiving bandwidth, Pentium-II and Pentium-Pro receive at the same speed (around 203Mbits/sec) when K 6 - A M D is the sender. Pentium-Pro and Pentium-II both receive at about 346Mbits/sec, but at twice the rate of K 6 - A M D receive bandwidth when Pentium-II is the sender. Pentium-II and K 6 - A M D receive at the same speed as the sending rate of the Pentium-Pro. On turning off the checksums, in Table 3.4, both the send and receive band-widths increase on some of the processors. However, the relative order of the pro-cessors in Table 3.4 is similar to that in Table 3.3, except that K 6 - A M D receives 38 Receiver K6-AMD Pentium-II Pentium-Pro Sender K 6 - A M D M_Sent 33622 33904 M_Recv N / A . 33615 33904 M S . B W 220.34 222.02 M R - B W 220.29 222.02 Pentium-II M_Sent 52125 52766 52274 M_Recv 52125 52631 52256 M S . B W 341.58 345.79 342.54 M R . B W 341.58 344.90 342.42 Pentium-Pro M_Sent 42889 42794 M J l e c v 42877 42781 N / A M S - B W 280.82 280.36 M R _ B W 280.74 280.27 Table 3.4: U D P _ S T R E A M Testing Using netperf (turning off the checksums) better from Pentium-II in the latter case. It is clear that Pentium-II can send data faster than both Pentium-Pro and K 6 - A M D , and receive slightly faster than or at the same rate as the latter two processors. TCP_STREAM Testing Using netperf Table 3.5 are the results for a T C P - S T R E A M where the data is sent over an estab-lished stream between the two ends. The experimental setup for T C P - S T R E A M testing are: • Recv buffer Size = 65536 bytes • Send buffer Size = 65536 bytes • Send Message Size = 65536 bytes 39 Receiver Sender K6-AMD in Mbits/sec Pentium-II in Mbits/sec Pentium-Pro in Mbits/sec K 6 - A M D N / A 193.83 193.13 Pentium-II 175.71 312.41 317.45 Pentium-Pro 178.62 290.57 N / A Table 3.5: T C P - S T R E A M Testing Using netperf Both Pentium-II and Pentium-Pro perform similarly when K 6 - A M D is the sender. Both Pentium-Pro and Pentium-II perform significantly better than K6-A M D (about 80% better) when Pentium-II is the sender, as with U D P J 3 T R E A M testing when checksums are not turned off. Pentium-II also performs about 63%bet-ter than K 6 - A M D with Pentium-Pro being the sender, which is different from the results with U D P _ S T R E A M testing. Evaluation netperf was run focusing on the U D P J 3 T R E A M . Many applications which actually process data are going to have throughput numbers similar to what one gets with a U D P _ S T R E A M . The better the U D P _ S T R E A M number, the better a system is at processing data, and running a Myrinet card at the same time. Both PCs and alphas (using results from Duke University) have problems with optimally executing the U D P _ S T R E A M while running the Myrinet card. A l -phas have limited hswap numbers, PCs have bad bwjmem-cp numbers. Because of this, neither is an ideal choice for a platform on which to run Myrinet. An ideal platform will have excellent hswap, bwjmem^cp and combined hswap/bwjmemory.cp bandwidths. A platform like this is capable of running the Myrinet card and process-ing data at the same time, and will indicate this capability with good U D P _ S T R E A M send and receive numbers. Cost constraints of this project limited the choice to PCs. Both Pentium-II 40 and Pentium-Pro have equivalent bw-merri-cp/hswap results. Pentium-II has slightly better copy bandwidths than the other processors. Although Pentium-Pro receives at the same rate as Pentium-II, the latter sends at a much faster rate than the former, as per the results using the network performance evaluation tool. These results lead Pentium-II to be the best choice. 3.3 Layout of the Myrinet Network Testbed The complete Myrinet layout is shown in Figure 3.2. Each of the 16 Pentium-IIs, with 128MB R A M , 512KB cache and 440FX chipset, numbered 1 through 16 in the figure have a M2M-PCI32C L A N a i interface. The L A N a i boards on the computers access the Myrinet network via M2M-CB-05 S A N cables. These are 5-foot length cables which connect the L A N a i interfaces to M 2 M - D U A L - S W 8 dual 8-port S A N switches via M 2 M - Y splitters. M2M-X M2M-X ! 1 t ! M2M-X M2M-X 0 1 2 3 0 1 2 3 M 2 M - D u a l - S W 8 M 2 M - D u a l - S W 8 Dua l 8-port S A N switch Dua l 8-port S A N switch (switch 0) (switch 1) 4 5 6 7 4 5 6 7 M 2 M - Y M 2 M - Y M 2 M - Y M 2 M - Y M 2 M - Y M 2 M - Y M 2 M - Y M 2 M - Y t 1 1 1 y V Y V 0 0 0 0 0 0 0 0 Figure 3.2: Layout of Myrinet Network The specifications of the Myrinet switches, interfaces, cables and splitters are 41 briefly reviewed here, but their details can be found in [Myr97b, Myr96, Myr97e]. 3.3.1 M 2 M - D U A L - S W 8 M y r i n e t Switch This switch is designed for a in-cabinet cluster to be deployed with M2M-PCI32C and Myrinet S A N cables. A single dual 8-port Myrinet S A N switch houses two 8-input/output, full-crossbar, Myrinet-SAN switches as shown in Figure 3.3. Each channel operates at 1.28 Gb/s , providing a total bisection data rate of 10.24 Gb/s for each switch. Port 0 Por t 1 Port 2 Por t 3 A & B A & B P o w e r A & B A & B | I | I 7 <v p " ~ [ I I —i—H n - — i — ^—) — i— i— h—i— Port 4 Port 5 Por t 6 Por t 7 A & B A & B A & B A & B Figure 3.3: Top View of Dual 8-Port Myrinet-SAN Switch The 8 ports support 1.28+1.28 Gb/s data rate on each of the A and B links. The A and B links may be split onto separate cables using the M 2 M - Y splitters as in ports 4)5,6, and 7of Figure 3.2. The switches A and B can also be connected using a M 2 M - X . This is called an A - B link because it is used to create a link, between the A and B switches at the S A N connectors of the M2M-Dual-SW8 switch, as in ports 0, 1 of switch 0 and ports 2,3 of switch 1, as shown in Figure 3.2. 42 In the network shown in Figure 3.2, computers "1,3,5,T are connected to the internal switch A of switch 0, "2,4,6,8" to the internal switch B of switch 0, "10, 12, 14, 16" to the internal switch A of switch 1, and "9, 11, 13, 15" to the internal switch B of switch 1. The cluster connection can be viewed as shown in Figure 3.4. C D G D C D G D Figure 3.4: Myrinet Cluster Connections 3.3.2 M2M-PCI32C Myrinet-SAN/PCI Interface This host interface contains the L A N a i 4 processor which runs the M C P . Its main characteristics are: • 32-bit PCI-bus interface operating at 33MHz. Its D M A performance at 33MHz, with large blocks, exceeds 120MB/s peer mode and with good PCI implemen-43 tations to/from system memory. , • Interface memory is 1MB operating at double the PCI-clock rate. • Interface processor is a LANai-4 RISC, operating at the PCI-clock rate(33MHz), explained in detail in section 3.1.1. 3.3.3 M 2 M - C B - 0 5 Cables These 5-foot cables are used to connect the interfaces to the switches. Each S A N -cable carries two channels which are full-duplex and have 1.28Gb/sec data rate each. 3.4 Language Support This section discusses Emerald's features that support mobility and the original structure of its interpreter before changes were made to its design for running within the L A N a i . 3.4.1 E m e r a l d P r imi t ives to Support M o b i l i t y Mobility in Emerald [JLHB88] is provided by a small set of language features which are discussed below: 1. Move : move X to Y , moves the object X to Y if Y is a node, or moves X to the node where Y is located if Y is an object. 2. Locate: locate X , returns the node where object X is located. 3. Fix: fix X at Y , moves X to Y and fixes it there, i.e., makes it non-mobile. 4. Unfix: unfix X makes a fixed object mobile again. 5. Refix: refix X at Y unfixes the object from where it is currently located, moves it to Y and fixes it there. 44 These routines are a complete list of functions that one would need to move objects around. A main concern in the move routine is deciding how much of the object's state, in addition to its activation records, to move to the target when the object migrates. If the object's state has an array of other objects, (a) should the entire array be moved, or (b) should the individual objects of the array be moved only when invoked on the remote site. The designers of Emerald have introduced the attached qualifier in the language, to be used by the programmer, in order to make explicit requests of moving certain data objects. For example if the object: object myobject attached var myDirectory: Directory var myOtherDirectory: Directory end object is moved, myDirectory has to be moved as well because it is declared to be attached, as opposed to myOtherDirectory which may or may not be moved depending upon the compiler's decisions. Parameter passing is implemented by the call-by-object-reference technique. Moving an argument object when an object moves depends on (i) the size of the argument object, (ii) whether it will be referenced frequently enough at the remote site, and (iii) how costly it is to move it to the remote site. The Emerald compiler tries to make these decisions. For example, small immutable objects such as strings or integers should be moved rather than being referenced remotely. The application program can assist the compiler in arriving at the appropriate decisions by specifying whether it would like a particular object to be moved to the remote site. The above review, though does not do complete justice to all the features of Emerald, supports the idea that mobility or service migratibility can be easily 45 Initialization Create Threads Pool of Worker Threads Reader >~ Reader > Put incoming packets in the queue to be processed Process queue ( ^ ^ W O T t o ^ ^ ) C^^Worker^^) Figure 3.5: Emerald Virtual Machine achieved using it as a base framework. 3.4.2 Structure of the Emerald Interpreter The host-based Emerald interpreter (named host-emx) is a multi-threaded dis-tributed object-oriented virtual machine. It consists of one big C program which uses the standard libc, as well as other libraries to handle system calls and com-munication over the network via sockets. When it starts, the virtual machine does basic initialization, loads all the builtin types from files, and builds objects of the loaded-in types. It then creates a worker pool of threads and a single listener thread as shown in Figure 3.5. The listener thread listens for new connections. On receiving a request from a new node, it creates a new reader thread which has its own T C P socket stream connection to the new node. The listener creates a new reader for every connection. For example, if there are four nodes in the Emerald world, each node will have three reader threads. 46 The reader threads listen for any incoming packets on their stream. They put the incoming messages in the process queue which uses the F IFO technique to send them to the worker pool. At any time, there should be at least one worker thread so as to avoid deadlock. When the worker thread receives a new message, it checks whether the ob-ject to which the message is to be delivered is resident. If the object has moved, the worker redirects the message to where the local node thinks it has moved to. Otherwise it processes the incoming message, and if needed, also handles sending out new messages. The virtual machine also has a garbage collector to clean up any stale objects. The virtual machine divides its memory area into three main blocks to handle malloc and garbage collection: • MALLOC memory: This area is used to allocate space for data that is not garbage collected, i.e, for thread control blocks, stacks, tables, sockets, file descriptors, etc. for the different threads running in the virtual machine. • GC memory: Emerald uses the generational garbage collection mechanism. There are two main blocks (a) NEWGC, and (b) OLDGC. The NEWGCmem-ory area is further divided into two blocks NEWGCl and NEWGC2. When objects are first created, they live in one of these blocks. When the garbage collector is called, it handles either NEWGCl or NEWGC2. It cleans up un-referenced objects and moves the recently created referenced objects to the other block. After a few garbage collections the old objects which have moved back and forth between NEWGCl and NEWGC2, and are still being refer-enced are promoted to the OLDGC block. Objects moved to OLDGC may be garbage collected if they remain unreferenced, but at a less frequent rate than NEWGCl or NEWGC2 because they are less likely to die. 47 3.5 Summary This chapter reviewed the Myricom details necessary for understanding the design and implementation of lanai-emx (running on the LANai ) , with particular focus on the L A N a i 4 processor and the Myrinet A P I . This was followed by the results of the experimental benchmarks to support the choice of Pentium-II for the Myrinet cluster. The Myrinet layout illustrates the network that was constructed and the different connectors and switches involved. The overall structure and the high level implementation of the host-emx was discussed to explain the changes necessary for porting it to the L A N a i . The discussion of the interpreter will help understand the changes necessary for execution on the L A N a i . The next chapter discusses the design and implemen-tation changes needed for running within the 1MB of memory on the L A N a i . The main changes include the multi-threadeddistribution package of the interpreter, communication protocols and the host /LANai interface due to the lack of an'OS on the L A N a i among other intricate implementation details. 48 Chapter 4 Design and Implementation of Emu Emu consists of two implementations of the Emerald virtual machine: lanai-emx, running on the LANai ; and host-emx, running on the host. The lanai-emx on the network adaptor can communicate with other interpreters, either on the local host via E B U S <-> L B U S transfer, or on some other host via the Myrinet network. In other words, the lanai-emx running on the network adaptor can do anything that the host-emx can. For example, it can process messages, build objects, execute the active objects, and garbage collect. The layout of the chapter follows. Section 4.1 discusses the key issues to be considered for running the Emerald interpreter within the L A N a i . This is followed by Section 4.2 which discusses the data structures and protocols designed and implemented in lanai-emx to get the support of the host due to lack of an OS. Sections 4.3 and 4.4 discuss the communi-cation protocols and new routing information protocol in the lanai-emx respectively. Section 4.5 discuss various miscellaneous issues in the design of Emu and Section 4.6 concludes. 49 4.1 Key Issues for Moving the Emerald Interpreter to the LANai This section discusses the main issues involved in running the Emerald virtual ma-chine within 1MB of S R A M on the L A N a i . The first two issues could be handled by implementation changes. The distribution package had to be completely redesigned, because the environment of the L A N a i is different from what one would expect of a regular processor. Each of the issues are concerned with resolving incompatibilities with the new environment. 4.1.1 L i m i t e d M e m o r y The Emerald interpreter on the L A N a i must execute within 1MB of S R A M . This is possible only if the Emerald code, shared memory between the host and the L A N a i , stack, heap and garbage collection memory space fit in this 1MB. The memory layout of the lanai-emx on 1MB of R A M on the L A N a i is shown in Figure 4.1. The problems faced in the mapping of the various parts of the interpreter to the L A N a i S R A M , due to the limited memory, will now be described. The L A N a i environment does not provide any debugger, such as gdb, or even printing to standard output such as simple printf stubs. Hence, initial mapping of the different parts of the interpreter to L A N a i S R A M was done by trial and error. When the host loads the M C P to the L A N a i , handshaking between the host and the L A N a i must successfully complete. This is necessary for the L A N a i and host to communicate via shared memory or for any sort of debugging protocol to be built. A considerable amount of time was spent in successfully completing the startup calibration between the L A N a i and the host, which could have been saved if some debugging support was provided. The lanai-emx running within the L A N a i is a highly optimized version, with code size of about 209 K B without debugging options (emx), and about 285 K B 50 with debugging help (emx.debug), as shown in Figure 4.1. Shared memory starts at 308 K B to provide for proper execution of either versions of the lanai-emx M C P . With the help of a short program, it was found that the stack grows to about 3 K B to 4 K B and the shared memory between the host and the L A N a i is about 56 K B . An additional 20 K B was added to provide for nested activation records, hence 76 K B was allocated in total for the shared memory and the stack. The remaining space was partitioned to about 256 K B for the heap, 64 K B for the new generation garbage collection, and 320 K B for the old generation garbage collection, as shown in Figure 4.1. 0x100000 I M e m o r y for o l d generation garbage co l l ec t ion on lanai -emx OxbOOOO OxaOOOO 0x60000 0x4d000 0x47280 0x34300 0x00000 M e m o r y for new generation garbage co l l ec t i on o n lana i -emx M e m o r y for heap/mal loc on lanai -emx ^ Stack grows d o w n Shared m e m o r y b /w host and f L A N a i ' L A N a i M C P , i.e., l ana i -emx code emx_debug emx Figure 4.1: Memory Layout on the L A N a i 4.1.2 L i b r a r y Support Emerald makes extensive use of the standard C library, libc. The L A N a i , being a different environment with no OS or debugging support, is an unfriendly program-ming environment. It does not support the standard libc routines. The original 51 host-emx was implemented using many functions from libc. The simple approach to solve this problem is to include all libc source files along with the V M code without modification. However this could not be done for two reasons. First, the entire V M including its code, heap and stack must run within 1MB of R A M , and including unnecessary libc routines would make this impossible. Second, many of the /defiles use system calls for a regular OS. Hence, including the entire libc unmodified will not work. The major libc issues to be considered for the lanai-emx were: • Malloc routines such as mallocQ, reallocf), free(), etc. had to be rewritten to work in the L A N a i environment. The malloc routines work in the S R A M area that is allocated for the heap/malloc, as shown in Figure 4.1. • Stdout calls were handled by using the print/debug protocol discussed in Sec-tion 4.2.2. • Stdin, system calls and file opens, reads, etc. were handled using the host interface protocol discussed in Section 4.2.1. • Although some libc routines could be worked around, sources for frequently occurring routines such as strcmp(), memmove(), strcpy(), etc. had to be provided. • Basic structures, defines and typedefs had to be provided in addition to rou-tines from libraries. 4.1.3 D i s t r i bu t ion and M o b i l i t y In the original Emerald interpreter, distribution implementation depends on user-level threads. But there is no builtin thread utility for the L A N a i . It seemed better to reimplement the multi-threaded package as a single thread in the L A N a i considering the memory limit of the network adaptor. Redesigning distribution from a multi-threaded model to a single processor model was the most difficult feature of porting the interpreter to the L A N a i . 52 The original multi-threaded Emerald interpreter, explained in Section 3.4.2, provides a conceptual understanding which facilitated coding. However, restructur-ing this multi-threaded system as a single process required a completely different understanding of the interpreter in order to make the different calls at the proper times. Achieving mobility and distribution for the V M on the L A N a i completed the implementation and porting of the lanai-emx. 4.2 Interface to the Host In order to execute on the network processor, the lanai-emx needs assistance from the host. The protocols implemented provide OS services, due to the lack of an OS, and debugging support on the L A N a i . This is followed by a discussion of the shared memory structure for communication between the host and L A N a i and the different services that rely on it. Each of the issues discusses the support provided by the L A N a i with the help of the host. 4.2.1 OS-Support Protocol The Emerald interpreter was designed to run in a Unix environment. It uses files and system calls. On initialization, Emerald interpreter loads a huge (about 120 K B ) Builtins fi\e to build the standard objects (note that everything in Emerald is an object). It also uses system calls to get the ipaddress of the local host which is a part of the object identifier of all objects resident on the local machine. The L A N a i does not support inodes or vnodes as in a Unix environment. There is no file system on the L A N a i . So the OS-Support protocol was designed and implemented to handle opening/reading/closing of files with the help of the host. But the Builtins file is too big to be loaded all at once into the L A N a i memory. Therefore, parts of the file are read and objects are built at every stage before reading the next part of the file. 53 The OS-Support protocol was extended to handle system calls such as geth-ostname(), gethostbyname(), etc., due to the lack of OS on the L A N a i . But the M C P running on the L A N a i has the illusion that the calls are satisfied by the L A N a i processor. It was also modified to handle basic information exchange between the L A N a i and the host. In order to make it possible for the host to assist the L A N a i in satisfying the above mentioned calls, the OS-Support protocol on the lanai-emx and the host-emx of the local host communicate via a shared structure shown in Figure 4.2. In short, this protocol can be extended as necessary to enlist the host's help in any future needs where the L A N a i is unable to do so by itself. typedef v o l a t i l e struct _MCP_E2L_info { int req_flag; /* query to be s a t i s f i e d */ int f d ; /* f i l e descriptor f o r f i l e c a l l s */ unsigned int nbytes; •/* number of bytes */ int r e s u l t ; /* r e s u l t of the c a l l */ char name[50]; /* name of f i l e or hostname, etc. */ char buffer[MAX-SIZE] ;/* buffer to return the reply */ } MCP_E2L_info; Figure 4.2: The Host Interface Protocol Memory Structure The routine in Figure 4.3, which is added to host-emx, assists the lanai-emx in handling various calls due to the lack of OS support on the L A N a i . 1. File requests: Cases 1, 2 and 3 handle opening, reading and closing of files respectively for the lanai-emx. 2. gethostname()/gethostbyname() requests: When an Emerald machine comes up, it tries to get its hostname and other details to set or define its node ad-dress. Due to the inability of the L A N a i interpreter to satisfy system calls, such as gethostnameQ / gethostbyname(), it enlists the help of the host interpreter in order to successfully complete those requests. The host V M periodically calls the function whose pseudocode is shown in Figure 4.3 to check whether 54 void handleReqFromLANaiO { access the KCPJEZLAnjo shared structure of the LANai; switch (req_flag of the shared structure) { case 0 no request to s a t i s f y ; case 1 open f i l e s p e c i f i e d and returns i t s f i l e descriptor case 2 read from f i l e as s p e c i f i e d ; case 3 close f i l e ; case 4 s a t i s f y gethostname() request; case 5 s a t i s f y gethostbyname() request; case 6 s a t i s f y cases 4 and 5 i n one go; case 7 get rootnodeid; } update req_flag; } Figure 4.3: Pseudocode for Handling Requests from the L A N a i by the Host these system requests (cases 4, 5 and 6) have been issued from the L A N a i . 3. Information exchange: Basic information about rootnodeid, etc. for the lanai-emx are handled by case 7. 4.2.2 P r i n t / D e b u g Protocol The L A N a i does not provide any support for printing to standard output. Debug-ging support was essential to complete porting of the stand-alone interpreter to the L A N a i . A new data structure MCP-L2EJnfo was introduced in the shared memory/ between the L A N a i and the host, to assist with writing to the standard output. It is in the form of a circular array. No locking is needed to access this shared block of memory. This is because the L A N a i only writes to the buffer and updates the tail of the circular queue, and the host only reads from the buffer and updates the head of the queue. 55 4.2.3 Shared Memory Structure Between LANai and Host The structure shown in Figure 4.4 is the memory on the L A N a i which can be directly accessed by the host. It plays the major role in handling any type of communication such as sending/receiving messages, handshaking, etc. between the L A N a i and the host. It consists of data structures for different services such as debugging, host interface support, etc. typedef v o l a t i l e struct _MCP.Sharedjnem_ { int handshake; int address; MCPJloute route; MCP_Send_info send_info; MCP_Recv_info recv_info; MCP_L2E_info debug_info; MCP_E2L_info host_interf a c e j n f o; int dma_burst; int staging_buf f er; } MCP JShared_mem; Figure 4.4: The M C P Shared Memory Structure The different services that this shared memory structure helps support are discussed below: • Handshaking between the host and the L A N a i is achieved by setting the hand-shake attribute. • Routing is done using MCP-Route which consists of the hop length and switch-ing information in an array. • MCPSendJnfo is used to send messages to the L A N a i from the host. It consists of a pointer to the buffer, the buffer length and a ready_bit. • MCPJlecvJnfo is used by the host to receive messages from the L A N a i . Its structure is similar to that of MCPSendJnfo. 56 • Debugging is done with the help of MCP-L2EJnfo. This structure handles messages from the L A N a i to the host, hence the name. This shared structure played an important role in porting the interpreter as there was no other means of debugging in the L A N a i . • MCP-E2LJnfo is the structure that handles file requests, gethostname(), and gethostbyname() requests, etc. The host does the appropriate work based on the req-flag value as explained in Section 4.2.1. 4.3 Communication Protocols Communication protocols of the original Emerald interpreter were redesigned in order to properly execute within the L A N a i . lanai-emx listens for messages not only from the network but also from the host-emx. Also, the original Emerald interpreter was designed for T C P / I P using sockets. This was redesigned for the Myrinet network due to the lack of socket support in the L A N a i environment. The following topics are covered to give an understanding of the different factors involved in building the communication protocol for lanai-emx. • Design of communication protocol in lanai-emx • Communication from L A N a i to Host • Communication from L A N a i to Network • Communication from Host/Network to L A N a i • Reading/Writing data to the Myrinet network 4.3.1 Design of Communication Protocol in lanai-emx The design of the communication protocol for the lanai-emx is portrayed in Figure 4.5. The lanai-emx supports message exchanges between (a) the L A N a i and the 57 network, (b) the host and network, and (c) the host and the L A N a i . This adds to the complexity of the lanai-emx when compared to the original Emerald interpreter. Me; sages to objects i 1 its own address space H O S T Messages to objects in its own space lanai-emx Myrinet Network Adaptor (^Switjch Figure 4.5: Communication Protocol of the V M on the L A N a i The communication control support, provided by lanai-emx has a more com-plex mux/demux routine, for sending and receiving messages either from the network or the host, when compared to a standard M C P . lanai-emx cannot just assume that messages from the network are for the host. This is because the messages can also be for some object in the L A N a i address space. A similar problem exists for mes-sages from the host. This difficulty was handled by modifying the packet header of Emerald packets, and all the marshalling and unmarshalling routines, in order for the mux/demux function to route packets properly. 4.3.2 From LANai to the Host The function sendToLocalhost() handles writing out messages from the L A N a i to the host. It sends messages using MCP-RecvJnfo of the shared memory structure discussed in Section 4.4. Its pseudocode is shown in Figure 4.6. 58 v o i d s e n d T o L o c a l h o s t ( v o i d * b u f f e r , i n t messageJLen) { w a i t t i l l s e n d _ b i t i s r e a d y ; s e t u p DMA o f b u f f e r t o t h e h o s t ; w a i t f o r DMA t o c o m p l e t e ; i n f o r m h o s t t h a t t h e r e i s a message w a i t i n g f o r i t ; } Figure 4.6: Pseudocode for Sending Packets from LANai to Host 4.3.3 From LANai to the Network The function sendToNetwork() is called by processIncomingMessages(), discussed in Section 4.3.4, to route packets from the host to the network. It is also used by vmThreadSend(), which handles outgoing messages for lanai-emx from the LANai address space. vmThreadSend() of lanai-emx, has been modified to write out the messages to the Myrinet interface (via sendToNetwork() ) from its original sending of objects using the Ethernet interface. The pseudocode of sendToNetwork() is given in Figure 4.7. v o i d s e n d T o N e t w o r k ( v o i d * b u f f e r , i n t message_ len , i n _ a d d r d e s t _ i p ) { (Note t h a t t h e r e a d y _ b i t o f L A N a i - > n e t w o r k need n o t be c h e c k e d , as t h i s f u n c t i o n r e t u r n s o n l y when t h e DMA f r o m L A N a i - > n e t w o r k c o m p l e t e s ) f i n d t h e r o u t e f r o m me t o d e s t _ i p ; i f (no r o u t e e x i s t s ) r e t u r n e r r o r ; e l s e { w r i t e t h e r o u t e t o d e s t _ i p t o n e t w o r k ; w r i t e t h e p a c k e t h e a d e r t y p e t o n e t w o r k ; w r i t e messageJLen t o n e t w o r k ; s e t u p t h e DMA of b u f f e r t o n e t w o r k ; w a i t f o r t h e DMA t o f i n i s h ; r e t u r n ; } } Figure 4.7: Pseudocode for Sending Packets from LANai to Network 59 4.3.4 From Host/Network to the LANai The function processIncomingMessages() handles all incoming messages from both network and the host. This function gets called periodically by the main loop of the lanai-emx. Its pseudocode is outlined in Figure 4.8. v o i d p r o c e s s I n c o m i n g M e s s a g e s O { / * From h o s t t o t h e L A N a i * / i f ( t h e r e a d y J b i t o f h o s t - > L A N a i i s s e t ) { r e c e i v e message f r o m h o s t ; e x t r a c t i t s h e a d e r ; i f ( the message i s f o r me) s t o r e i t i n t h e r e c e i v e b u f f e r f o r p r o c e s s i n g ; e l s e / * i t i s f o r o t h e r VM i n t h e ne twork * / DMA message t o t h e ne twork ( F i g u r e 4.7); c l e a r t h e r e a d y _ b i t o f h o s t - > L A N a i t r a n s f e r ; } / * From ne twork t o t h e L A N a i * / i f ( t h e r e a d y J b i t o f n e t w o r k - > L A N a i i s s e t ) { DMA message f r o m n e t w o r k ; e x t r a c t i t s h e a d e r ; i f ( t h e message i s f o r me) s t o r e i t i n t h e r e c e i v e b u f f e r f o r p r o c e s s i n g ; e l s e / * i t i s f o r t h e h o s t * / DMA t h e message t o t h e h o s t ( F i g u r e 4.6); c l e a r t h e r e a d y _ b i t o f n e t w o r k - > L A N a i t r a n s f e r ; } } Figure 4.8: Pseudocode for Processing Incoming Packets to the L A N a i 4.3.5 Reading/Writing from/to the Myrinet Network The pseudocode for reading a packet from the Myrinet network is shown in Figure 4.9 and writing a packet to the Myrinet is shown in Figure 4.10. Understanding of the L A N a i registers and flag were necessary in writing code for accessing the Myrinet network. However, low level details are beyond the scope of this section. 60 v o i d r e a d P a c k e t F r o m M y r i O { / * t h e f o l l o w i n g i s e x e c u t e d o n l y a f t e r c h e c k i n g i f a p a c k e t has a r r i v e d * / . r e a d RH r e g i s t e r f o r p a c k e t h e a d e r ; i f ( i t i s n o t my h e a d e r t y p e ) { consume p a c k e t and throw i t away; r e t u r n ; } r e a d RW r e g i s t e r f o r t h e p a c k e t l e n g t h ; . / * a p a c k e t s h o u l d have a t l e a s t 1 word * / i f ( l e n g t h <=0) { consume i t and throw i t away; r e t u r n ; } use RMP and RML r e g i s t e r s t o s e t u p DMA f r o m ne twork t o L A N a i memory; w a i t f o r DMA t o c o m p l e t e ; r e a d the. CRC; / * r e a d t h e t a i l o f t h e p a c k e t * / r e p o r t CRC e r r o r s i f any; i f ( t a i l b i t was n o t r e c e i v e d ) consume e x t r a b y t e s and c o n t i n u e ; Figure 4 .9 : Reading Packets from the Myrinet Network The function readPacketFromMyri() is used by processIncomingMessages() discussed in Section 4 .3 .4 . The function sendToNetwork, discussed in Section 4 .3 .3 , uses writePacketToMyri() to write packets out to the network. 4.4 Message Routing A standard M C P , or a program executing on a network adaptor, sends messages from the host to the network, or from the network to the host. The lanai-emx performs processing of the messages, in addition to just routing them between the host and the network. Hence, the routing function of the lanai-emx is required to be more intelligent than that of the original Emerald interpreter, because messages from the network or from the host could be for the objects in its own address space. As discussed in Section 4 .3 .4 , the router part of the lanai-emx extracts head-61 v o i d w r i t e P a c k e t T o M y r i ( ) { w a i t u n t i l t h e p a c k e t i n t e r f a c e says we can s e n d ; send t h e p a c k e t h e a d e r w h i c h c o n s i s t s o f t h e r o u t e d e t e r m i n e d p l u s a t a g ; { use t h e SB r e g i s t e r s t o w r i t e t h e route->path c a l c u l a t e d one hop a t a t i m e ; / * t h i s i s t h e t a g * / w r i t e out t h e p a c k e t h e a d e r i n t h e SH r e g i s t e r ; } w r i t e out t h e p a c k e t l e n g t h i n t h e SW r e g i s t e r ; s e t u p L A N a i - > m y r i n e t DMA u s i n g SMP and SMLT r e g i s t e r s ; w a i t f o r DMA t o c o m p l e t e ; } Figure 4.10: Writing Packets to the Myrinet Network ers from incoming packets, either from the host or from the network, in order to properly route the message to the indicated destination object. The destination ob-ject might be in its address space in which case the message is placed in its incoming buffer. Otherwise, the message is routed to the network or the host appropriately. Similarly, handling of outgoing packets from the L A N a i is handled appropriately, depending on the route, as discussed in Sections 4.3.2 and 4.3.3. The routing protocol implemented is a modified version of the routing code obtained from the B D M package [HDMS97] at Mississippi University. The original version would read the connections of the computers to the switch ports from a static file. As the OS-Support protocol implemented requires the assistance of the host, using it would require host assistance for routing every single packet. The purpose of the entire porting of the interpreter to the L A N a i , to offload the host and save C P U cycles, would be lost. The OS-Support protocol had to be designed to support file handling, in order to load the Builtins file initially. However, other protocols which use files frequently, and hence need constant host assistance were redesigned. The following sections discuss the data structures and routing function implemented to route packets by the shortest path into the network to the appropriate destination. 62 4.4.1 D a t a Structures Used for R o u t i n g The static file information, describing the network connections, were stored in a static array. The modified switchJnfo data structure which was used to store the information for each switch is shown in Figure 4.11 and the switch.conn structure it uses, follows in Figure 4.12. t y p e d e f s t r u c t { i n t num_ports ; u n s i g n e d i n t h o s t s [MAX_P0RTS] ; swi tch_conn s w i t c h e s [MAX_PORTS]; i n t v i s i t e d ; } s w i t c h _ i n f o ; Figure 4.11: Data Structure to Store Information for Each Switch t y p e d e f s t r u c t { i n t s w i t c h j i u m ; i n t port_num; } s w i t c h _ c o n n ; Figure 4.12: switch-conn Data Structure 4.4.2 R o u t i n g Information P ro toco l The lanai-emx calls the function in Figure 4.13 to find the routing path to the des-tination node. On return, route->path stores shortest routing path from hostJp to destJp, and route-> length stores the number of hops to traverse to reach the desti-nation. The routing function uses some basic routing code from the B D M package, but a lot of functions had to be modified for the lanai-emx. 63 v o i d Map (MCP_Route * r o u t e , u n s i g n e d i n t h o s t _ i p , u n s i g n e d i n t d e s t _ i p ) { i n t m y . s w i t c h , m y _ e n t r y _ p o r t , h o p s , i ; i n t t m p j r o u t e [MAXJIOUTE]; use t h e s t a t i c i n f o r m a t i o n a r r a y t o f i n d t h e s w i t c h and and t h e p o r t t o w h i c h t h e l o c a l h o s t i s c o n n e c t e d ; i f ( ( m y . s w i t c h == - 1 ) o r (my_entry jpor t == - 1 ) ) { my l o c a l h o s t _ i p i s n o t i n r o u t i n g t a b l e ; r e p o r t e r r o r ; } e l s e { hops = F i n d _ r o u t e ( t m p _ r o u t e , my_swi t ch , my . e n t r y j p o r t , d e s t _ i p ) ; i f ( hops == (MAXJIOUTE + 1) ) { c o u l d n o t f i n d c o n n e c t i o n f r o m h o s t _ i p t o d e s t _ i p ; r e p o r t e r r o r ; } / * t m p j r o u t e i s b a c k w a r d s , so a c c e s s i t f r o m end t o f r o n t * / f o r ( i=0 ; i<hops ; i++) r o u t e - > p a t h [ i ] = ( ( tmpjroute [ h o p s - ( i+1)] &0x3f ) I 0x80) ; r o u t e - > l e n g t h = h o p s ; . } } Figure 4.13: Myrinet Routing Protocol 4.5 Other Issues The following are miscellaneous issues encountered while designing Emu, and the changes made to the host-emx, both to assist the lanai-emx by providing OS support, and to communicate with other machines via the Myrinet network. 4.5.1 E m e r a l d over M y r i n e t Vs. E m e r a l d over IP over M y r i n e t Normally, an Emerald machine communicates with another machine via a T C P stream, using sockets for sending, receiving or listening for messages. But in a Myrinet cluster, the messages need to be packaged as Myrinet packets. The original Emerald packet headers did not need to specify the destinations since the T C P / I P protocol took care of copying the destination address in their packet headers. Emerald over IP over Myrinet would result in three levels of traversal when compared to the two levels in the case of Emerald over Myrinet. To satisfy the 64 latter case, the Emerald packet headers had to accommodate an extra field to store the destination node. Changing the Emerald packet headers resulted in changing different routines like startMsgQ, sendMsg(), forwardMsgf) which handle outgoing packets. Marshalling routines for assembling an outgoing packet and unmarshalling routines on an incoming packet had to be modified to support the added destination attribute. 4.5.2 Issues Concern ing Hardware Resets If the L A N a i is being bombarded with a lot of messages from the network, the board is sometimes hard reset by the network. The two least significant bits of the T I M E O U T register (refer to Section 3.1.1) determine the timeout period by which the L A N a i chip must consume packets from the network. If the L A N a i chip fails to do so within this period, the NRES signal is asserted. This causes the N R E S - E N A B L E of the M Y R I N E T register to the reset the L A N a i chip. So far, in our application, this has not occurred as high priority is given to messages from the network. 4.5.3 Modif ica t ions to the host-emx The Emerald interpreter on the host had to be modified to communicate with the lanai-emx. The host-emx is a modified version of the model described in Section 3.4.2. The following items discuss the changes to the original interpreter running on the host: • Communication medium support: The host-emx, of the Emu system, was modified to listen for messages on the Myrinet interface rather than the Ethernet interface. It checks whether the ready.bit of the MCP-RecvJnfo shared structure is set for it to receive messages. If so, it copies the message to its incoming buffer and clears the ready.bit. For sending messages, vmThreadSend() of the host-emx is modified 65 to set the ready-bit of the MCPSendJnfo shared structure, in order to inform the interpreter on the L A N a i that a new message has to be D M A e d from the host. Functionality to support lanai-emx: Support for lanai-emx with files, debugging, system calls and other basic in-formation exchanges were discussed in Section 4.2. Modification to Packet Headers: As in the case of the lanai-emx, the headers had to be changed to accommo-date the destination field for communicating with the Myrinet interface. The routines that use the packet headers had to be modified to support the new format. Byte-order: L A N a i uses the big-eridian format whereas the PCs use the little-endian format for the representation of integers. This initially caused a lot of handshaking problems between the L A N a i and the host when trying to read/write the handshaking variable. It was however difficult to pinpoint this problem as there was no debugging tool at the commencement of the project. Heterogeneous Processor Environment: Initially, the Emerald interpreter, on the host, only had to process packets from other nodes which are also running in a similar environment (Unix), while it now has to assist the L A N a i as well. Coordinating all these interactions required proper interleaving of function calls. For example, the host-emx has to assist with handling file support, debugging support, etc. for the lanai-emx in addition to coordinating with the L A N a i in sending and receiving messages via the Myrinet interface. 66 4.6 Summary This chapter discussed the various issues involved in the design and implementation of Emu. It discussed the problems with the L A N a i environment in running the Emerald interpreter and the protocols implemented to resolve the different process-ing environment conflicts. Second, the communication issues for the new network, and the complexity of the mux/demux function was explained. Third, the new rout-ing functionality introduced by lanai-emx was discussed. The new routing informa-tion protocol implemented was discussed at a high level, explaining the technique of routing involved in the Myrinet interfaces and switches. Finally, miscellaneous issues that arose during the design of Emu were discussed including the changes needed to host-emx to communicate with the lanai-emx. The Myrinet interfaces can now run an Emerald interpreter, thereby execut-ing Emerald packets. Emerald objects circulate within the Myrinet network and can be processed either by the lanai-emx or the host-emx. Thus, an interpreter on a network adaptor can do so much more than just a simple routing program as is the traditional case. 67 Chapte r 5 Performance This chapter is concerned with the performance of the Emu system and the applica-tions that benefit from the interpreter running on the L A N a i . Section 5.1 discusses the micro benchmarks run on Emu to measure the slow down of certain operations on the network processor when compared to the central processor of the computer. The results of the micro benchmarks support the expected claim that the L A N a i processor is better for communication intensive processing while the host processor is better for computation intensive processing. Two applications were used to test the Emu system, or to be more explicit, the usefulness of migrating functionality to the L A N a i . The first application, kilroy, is a standard Emerald program, and the second application, locator, is designed to locate services among the workstations in the cluster. The final section concludes with a discussion of the usefulness of Emu. 5.1 Micro Benchmarks Two categories of micro benchmarks were used to compare the performance between the lanai-emx and the host-emx. The first category dealt with computation-intensive operations and the second concentrated on communication-intensive functions. The experimental framework of the system is described in detail in Section 3.3. 68 Operation host-emx runtime lanai-emx runtime L A N a i Slow down in microsec in microsec factor add (2) 2.7 15.8 5.9 multiply (7) 2.7 24.5 9.1 divide (436 / 17) 2.9 25.2 8.7 invoke 3.6 20.6 5.7 create 3.4 21.5 6.3 Table 5.1: Computation Intensive Micro Benchmarks 5.1.1 Compu ta t i on Intensive M i c r o Benchmarks The micro benchmarks under the first category are listed in Table 5.1. Each of the operations were run one million times and the averages are reported for accuracy. The L A N a i timings are measured using the host's clock because the LANai ' s clock does not report time accurately. As each of the operations are repeated one million times, the slight increase in contacting the host for the runtimes on the L A N a i can be ignored. The second and third columns of Table 5.1 report the runtimes of the reported operation on the host processor and the network processor respectively. The fourth column is the ratio of the runtime on the L A N a i to that on the host. The operations in the first column are discussed in detail below: 1. add (2): adding 2 to an integer object. 2. multiply (7): multiply an integer object by 7. 3. divide (436 / 17): dividing object 436 by object 17. 4. invoke : invoking another object. 5. create: creating an object. Note that the reason for the arguments 7 and 436 / 17, in the multiply and divide operations respectively, is to avoid any hardware support provided for arguments such as 1 or 2. 69 The slow down in the last column of Table 5.1 is almost as expected. The clock rate or the speed of the host (Pentium-II) processor is 266MHz, and that of the L A N a i processor is 33MHz [BCF+95]. So the network processor is slower that the host processor by a little more than a factor of 8. Even though the clockrate ratio of the host to the L A N a i is 8, the measured slow down ratio is in the range of 5.7 to 9.1 for the different operations. The slowdown on the network processor cannot be avoided due to the lagging speed of the L A N a i processor compared to the host processor. 5.1.2 Communica t ion Intensive M i c r o Benchmarks Table 5.2 contains the results of communication intensive operations. Each of the operation were run one thousand times and the averages are reported. In the table, the acronym R T T stands for Round Trip Time. The explanation of the communi-cation intensive operations, and the results reported are discussed below. Communication times along the four paths: (i) from one host to another host via the Myrinet network, (ii) from host to the local L A N a i , (iii) from host to a remote L A N a i , and (iv) from one L A N a i to another L A N a i , are reported for each of the different operations in the first four rows. The fifth row reports the slow down on the L A N a i , which is the ratio of the communication on the L A N a i to L A N a i path to that on the host to host path. The description of the operations executed in all the four possible paths are discussed below. 1. Move Obj R T T : move an object back and forth. 2. Move Self R T T : move the executing process object back and forth. 3. Remote Invoke: invoke a method within a remote object. It is clear from the results, that as expected, communication intensive oper-ations fare better than computation intensive operations. The slow downs reported in the last row are all less than a factor of 6. In fact, in the cases of moving another 70 Path Move Obj RTT Move Self RTT Remote Invoke in microsec in microsec in microsec Host to Host (HH) 1854.7 2020.5 1041.2 Host to locLANai 936.8 5076.9 965.9 Host to remLANai 2418.2 6855.8 1721.3 L A N a i to LANai (LL) 3064.2 11827.4 2336.9 Slow down ( L L / H H ) 1.7 5.9 2.2 Table 5.2: Communication Intensive Micro Benchmarks object or remote invocations, the slow downs are hardly noticeable considering the measurements are in microseconds. In such cases, avoiding host interrupts can be beneficial if the host is involved in a heavy computation intensive workload. It is difficult to give the exact reasons for this almost negligible slow down in the two cases move obj RTT and remote invoke. For a network processor that is slower than the host processor by a factor of 8, the slowdown in the range of 1.7 to 5.9 was quite unexpected. A reasonable explanation, for the move self RTT slow down to be greater than the other two cases, is because it is more computation intensive: In the case of moving the current executing process object, all its activation records also need to be moved to the remote node. This extra computation, of figuring out the state of the executing object to be migrated, is not required in the other two cases. 5.1.3 Discussion The micro benchmarks support the thesis statement in Chapter 1 that communica-tion intensive components of an application are better off on the network processor, if extensive computation intensive processing is also needed which can occur on the host processor. The lagging clock rate of the L A N a i processor makes it unsuitable for computation intensive processing. 71 5.2 Application 1: Kilroy kilroy is a simple Emerald program which tests whether an object that is started on one node can migrate to all other nodes in the Emerald world and return to the originator successfully. This application is useful for testing whether each node has state about every other node in the Emerald world. The scenario, shown in Figure 5.1, is just one particular path taken by kilroy in traversing the Emu world. For clarity, it is limited to three nodes in the figure. There were two reasons why Kilroy was run in the Emu system. Figure 5.1: Kilroy on Emu First, kilroy tests the robustness of Emu. In the Emu world, there are two interpreters on each host, one on the host and the other on the NIC. So, kilroy on Emu will run on 32 Emerald virtual machines in the 16-node Pentium-II cluster. This process object successfully visited all nodes in the Emu system and validated the communication protocols used. 72 Second, a process object which migrates from one node to another provides the framework needed for active networking applications. The application can be easily modified to visit only the lanai-emxs of Emu. As the Myrinet network is dynamically reconfigurable, kilroy can be modified to do dynamic routing in the L A N a i without host intervention. As the network increases in size and complexity, kilroy can modify the routing tables in the S R A M . In this way, it extends the static routing function in the L A N a i for dynamic routing. Emu makes this possible because it is designed to be extended at runtime, unlike systems such as Hamlyn [BJM+96] or U-Net [vEBBV95]. 5.3 Application 2: Location Service This application was designed and implemented to provide a generic location service to different types of applications. The following sections discuss the design of the service, the experimental setup, the efficiency of migrating the locator to the L A N a i and examples of applications that might benefit from it. 5.3.1 Design of the Loca t ion Service The location service provides an A P I for applications that need to locate services in some part of the distributed environment. The location service is provided by two classes, locator and dataFinder, listed in Figures 5.2 and 5.3 respectively. Locator is responsible for maintaining information about the services available at the local node. dataFinder is the class that is invoked by applications to find the location of particular services either on the local node or on remote nodes. This requires dataFinder to maintain a list of all of the locators existing in the cluster. The locator is a monitored class which provides mutual exclusion to protect its shared state when different dataFinders try to access or modify the state of a particular locator at the same time. It provides the basic operations to: 73 c o n s t l o c a t o r < - m o n i t o r c l a s s l o c a t o r a t t a c h e d c o n s t l o c a l < - S e t . o f [ S e r v i c e ] . c r e a t e [ 3 2 0 ] e x p o r t o p e r a t i o n a d d [ i n d : S e r v i c e ] l o c a l . a d d [ i n d ] end add e x p o r t o p e r a t i o n r e m o v e [ i n d : S e r v i c e ] l o c a l . r e m o v e [ i n d ] end remove e x p o r t o p e r a t i o n o b j R e q [ i n d : S e r v i c e ] - > [ r e s u l t : B o o l e a n ] r e s u l t < - l o c a l . c o n t a i n s [ i n d ] i f r e s u l t t h e n l o c a l . r e m o v e [ i n d ] end i f end o b j R e q e x p o r t o p e r a t i o n i s i t H e r e [ i n d : S e r v i c e ] - > [ r e s u l t : B o o l e a n ] r e s u l t < - l o c a l . c o n t a i n s [ i n d ] end i s i t H e r e end l o c a t o r Figure 5.2: The Locator Service Class • add a service • remove a service • check whether a particular service is available at the local node • be remotely invoked by other objects when requesting a remote service The dataFinder is the class that is called by applications wishing to locate services. When requested to find a particular service, it checks whether it is available locally. If it is locally available, no remote operations are required. If the requested service is not locally resident, dataFinder serially multicasts requests to other /oca-tors in the cluster until it finds the service. It reports failure to the application if none of the nodes provide the requested service. The dataFinder class which exports one operation could have been easily merged with the locator class, as this was the initial design of the location service. The reason for separating the location service into two classes is to avoid deadlock if one monitored locator tries to access another locator which is in turn trying to 74 c o n s t d a t a F i n d e r < - c l a s s d a t a F i n d e r a t t a c h e d c o n s t f i e l d l o c a t o r s : V e c t o r . o f [ l o c a t o r ] < -V e c t o r . o f [ l o c a t o r ] . c r e a t e [ 1 6 ] a t t a c h e d f i e l d count : I n t e g e r < - 0 e x p o r t o p e r a t i o n g e t [ i n d : S e r v i c e ] - > [ r e s u l t : B o o l e a n ] r e s u l t < - l o c a t o r s [ 0 ] . i s i t H e r e [ S e r v i c e ] i f ! r e s u l t t h e n f o r i : I n t e g e r < - 1 w h i l e i < count by i < - i + 1 r e s u l t < - l o c a t o r s [ i ] . o b j R e q [ i n d ] i f r e s u l t t h e n l o c a t o r s [ 0 ] . a d d [ i n d ] e x i t end i f end f o r end i f end g e t end d a t a F i n d e r Figure 5.3: The Service Finder Class access the former at the same time. In the final version, no monitored operation invokes another monitored operation. This confirms the avoidance of deadlock in this service. The ease of designing locator clearly exemplifies the advantage of using Emer-ald for a distributed system such as Emu. Each of the consistency, remote invocation and migration services required by the location service are inherently provided by Emerald. This saved considerable time in implementing locator whose naive version is less than 100 lines of Emerald code. The real difficulty in successfully completing the application was in deter-mining the data structures that had to be used because of the memory limitations of the L A N a i . For example, using an array to keep track of locators in the Emu world failed when the location service was migrated to the L A N a i . The lanai-emx was unable to malloc memory for the array data structure. Hence, the vector was used instead as shown in Figure 5.3. Vector is the most primitive mechanism in Emerald for indexable storage. This builtin object is used in the implementation of the array in Emerald. Vector is a mutable object and supports the basic subscripting 75 operations. Using arrays would have been more appropriate, because they can dy-namically grow, unlike vectors, and hence can be extended at runtime appropriately as the number of nodes in the cluster increase. 5.3.2 Experimental Setup The experimental framework is described in Section 3.3. A dataFinder and a locator is created for each application on the host. The application invokes the dataFinder to locate a service. The dataFinder maintains state about other locators, by invoking the getlocationServer operation on all the remote host nodes, in the Emu world. The dataFinder checks its local locator and if it receives a negative reply, invokes the remote locators. Experiments were performed for two configurations. • Configuration 1: In this case, locator, dataFinder and the application execute on the host-emx of Emu as shown in Figure 5.4. In this situation, the lanai-emx of Emu is involved only in the routing of messages. • Configuration 2: In the second case, the locator and dataFinder are migrated to run on the lanai-emx with the application running on host-emx as shown in Figure 5.5. The two configurations can be compared with the help of Figures 5.4 and 5.5. In these figures, only three hosts in the cluster are depicted for clarity. The number of I /O crossings, interrupts and context switches avoided in Figure 5.5, when compared to Figure 5.4, suggest that there are significant advantages in migrating the location service to the L A N a i . The next section discusses the benefits achieved, and the overhead avoided by running the location service on the L A N a i . 5.3.3 Efficiency of Emu Migrating the location service to the L A N a i alleviates the I/O bottleneck discussed in Chapter 1. The advantages of the Emu system are clearly demonstrated in the 7 6 Figure 5.4: The Location Service on host-emx of Emu scenario where only 3 hosts are considered. • Traversal of the I /O bus is reduced as all the remote invokes and their replies are handled by the lanai-emx. In configuration 1, a maximum of 6 traversals of I /O bus is reduced to a constant 2 traversals in configuration 2. • Unnecessary main memory copies across the I /O bus are avoided. A l l remote invocations in configuration 2 use the L A N a i S R A M . • There is always only 1 interrupt on the host for any invocation of the location service. • The reduction of interrupts avoids context switches which consume a signifi-cant amount of the C P U time. Each of the benefits, listed above, result in lower demands placed on the host C P U , freeing more of its resources for useful work. In this evaluation, maximum 77 HOST Figure 5.5: The Location Service on lanai-emx of Emu of 2 context switches is achieved for any request on dataFinder even in the worst case scenario where all remote locators have to be invoked. This constant is valid if the number of nodes is 3 or 100. This evaluation can be generalized to n nodes of the cluster. The following equations estimate the time spent in locating a ser-vice in the worst case where all n locators have to invoked for the two configurations. CT(n) = n(tCp) + n[2(tc + tpci + tmem)+ tintJ + 2(n-l)(tnet) (1) NT(n) = n(tNP) + [2(tc + tpci + tmem)+ tint].+ 2(n-l)(tnet) -•— (2) where CT(n) and NT(n) are the times to locate a service, in the worst case, for Figures 5.4 and 5.5 respectively. The factors involved are: • n is the number of hosts in the cluster. • tc is the time spent in context switching in the kernel. 78 • t i n t is the kernel interrupt time. • t p c i is the PCI traversal time for D M A between the host and the L A N a i which is approximately 15.8 microseconds for 256 bytes. • tmem is time spent in memory copies in the host which is approximately 1.02 microseconds for 256 bytes, assuming the data is in the cache. • tnet is time spent for each D M A from/to L A N a i to network which is a little less than 15.8 microseconds for 256 bytes. • t c p is processing time on the C P U which is about 10.3 microseconds. • t p t p is processing time of the network processor which is approximately 59.4 microseconds (about 5.7 times that of tcp)-The times reported for t p c i and t m e m are for messages of size 256 bytes as the request messages are small. There is only one final response message of size 6 K B for any service request, and hence that is not taken into calculation. Moving the location service to the L A N a i saves about n(15.8 + 1.02) mi-croseconds in host /LANai D M A s and memory copies on the host. The measured network processing time, however, is on average about 5.7 times slower than the measured host processing time. Thus, the host overhead savings is accompanied by an increase in the processing times on the L A N a i . From the equations it is clear that context switches, memory copies and interrupts grow with the number of nodes in equation (1), but remain a constant in equation (2). For a sample run of 256 random service requests in a 6-node cluster, for example, the savings in context switches amounted to about 892 and interrupts to about 446, just by moving the location service to the L A N a i . Thus, when the location service runs on the L A N a i , this savings in processing time can be spent by the n hosts to do useful work. 79 5.3.4 Appl ica t ions that could Use the Loca t ion Service Most applications that are built for a distributed environment need some kind of location service if the services are distributed and no central state information is maintained. Even if state is maintained, services could migrate and correct state information update might be impossible. In such a case, multicasting is often done to locate the requested service. The location service, implemented, is a generic service demonstrated by using integers. However, Emerald code could be easily modified to use any other type defined by the programmer. It could also be optimized to perform in a more intelligent manner. Examples of systems that might use this service are: • Distributed directories where the data type would be strings or files. • A global memory system which would use pages as the service data types. • Parallel matrix multiplication where the data type would be matrices. 5.4 Di scuss ion The main motivations for Emu, discussed in Chapter 1, were to alleviate the I /O bottleneck and to offload all unnecessary processing from the host. The locator im-plemented clearly demonstrates that Emu has satisfied the goals, stated in Chapter 1, which are: avoiding context switches, interrupts, unnecessary memory copies and I/O traversals, and utilizing the processing power on the network processor for more useful work than basic flow control. Emu is an attempt to focus on reducing the overhead involved in message transmissions as suggested by the analysis in [MVCA97]. In [MVCA97], Martin et al. state that applications demonstrate strong sensitivity to overhead involved in sending or receiving a message and the gap between transmissions. But, they are surprisingly tolerant of increased latency and lower per-byte bandwidth. Their 80 results obtained using a wide variety of chosen applications verify their claim. The Emu system reduces the overhead by eliminating host processing as much as possible. The system would work well if the right applications are chosen for migration to the network processor. Applications that are computation intensive are not going to benefit running on the lanai-emx because the network processor is slower than the host processor by a factor of 8. Emu provides support for Emerald object processing on the network processor. It does not decide whether to execute on the central processor or the network processor. It is the responsibility of the programmer to make that deci-sion. Components of applications that are communication intensive can benefit from Emu's support of processing on the NIC. Intelligent offloading of communication-intensive components, from the host to the L A N a i , will save host processing power for computation-intensive components that are better off executing on the host-emx. 81 Chapter 6 Conclusions and Future Work 6.1 Conclusions The Emu system has made possible the complete implementation of an interpreter on the network processor. The choice of an object oriented distributed interpreter, such as Emerald, makes the applications that run on the system more modular and compact because the language takes care of all the gory details. Emu has introduced runtime extension of the functionalities on the network processor, and an approach for utilizing idle network processing power. Emu has satisfied all the thesis goals stated in Chapter 1: first, migration of the Emerald interpreter to the stand-alone embedded environment of the L A N a i has been completed; second, the interpreter running on the central processor has been modified to communicate via the Myrinet interface; third, the I/O bottleneck problem has been attacked successfully. Unnecessary host interrupts, associated context switches, memory copies and C P U cycles can be avoided with the help of the interpreter running within the network processor. Emu can improve the throughput of the system if intelligent decisions are made in choosing those components of applications that will be run (optimally) on the network processor. Emu does not make migration decisions, and leaves that responsibility to the end application. 82 However, it provides all the support necessary for the actual migration process; Performance of the Emu system in satisfying these goals is supported by the locator service. A formula has been derived which compares the total processing time in locating a service, when processing is done on the central processor, to that obtained by moving the processing to the network processor. The Emu system only requires a single interrupt, when the location service executes on the L A N a i , in locating a service in a cluster with increasing number of nodes, even in the worst case scenario when all the nodes have to be invoked. Emu has proved that processing on the network processor can increase or at least maintain similar overall system throughput, while at the same time sav-ing host processing, for more useful work, during heavy computational workload. However, intelligent delegation of processing to the network processor is required, because the network processor is slower than the central processor by a factor of 8. Communication-intensive applications which require low latency and high through-put would benefit from the interpreter on the network adaptor in the Emu system. Emu has paved the way for other similar systems to be built. It has demon-strated that it is possible to take full advantage of the programmable NIC to increase system throughput. It has shown that with intelligent choice of applications, idle network processing power can be utilized without sacrificing performance. 6.2 Effort Required in Creating Emu Implementing lanai-emx, in the Emu system, was writing programs for the bare hardware, with not much support for programming. This required learning the intricate details of the NIC registers and Myrinet network. Although the design was simple, the actual implementation was a painful learning experience. This can be attributed to the different environment of the L A N a i processor and the existence of bugs in the Emerald interpreter code which became more pronounced when the communication interface was modified to use the Myrinet network. 83 The main limitations of the L A N a i processor were the lack of an OS, lack of standard C library support in the lanai3-gcc compiler, 1MB of S R A M memory and no debugging tools. Protocols were built to provide OS and libc support, and debugging tools. The 1MB of S R A M forced a smaller and a more compact version of the Emerald interpreter. In fact, the optimized version of the interpreter code, with no debugging support, is only about 209 K B of compiled C code. The implementation of Emu helped make Emerald more robust. First, it helped find the dormant bugs that became active with the introduction of the new interface. Examples of not completely correct code were found in move and re-mote invocation implementations. Second, in the original version there were tables for mapping oids (these are numeric object IDs) to objects and objects to oids. Due to the memory limitations, the implementation was modified to have both the mappings in a single table with a more complex mapping function but with a 50% savings in memory usage. Third, in the original version, the distribution package was implemented using user level threads. Due to lack of a thread utility, and space constraints in the L A N a i 4 processor, the current version of Emerald supports a single process distribution package for the L A N a i as well the original package with threads. Emu would significantly reduce system creation for any future research that would make such extensive use of the network processor. It has demonstrated the difficulties that one would encounter while implementing such systems. The debug-ging tools and the protocols implemented, to overcome the limitations of a network adaptor with a different environment, would help in providing the steps to take for building a similar system. Thus, Emu not only shows that such a system is possible but also would be useful research for future systems on the NIC. 6.3 F u t u r e W o r k Emu is a system that provides support for network processing on the L A N a i pro-84 cessor. But it could, be improved to perform better with some optimizations and modifications to certain parts of the Emerald interpreter. The following sections discuss some features to help improve the Emu system and suggests some additional applications that can benefit from this system. 6.3.1 M e m o r y Fragmentat ion Problems The malloc and free routines, implemented to allocate and free space from the L A N a i heap, do not provide for the defragmentation of memory when it is freed. This causes fragmentation of memory which creates havoc in the already stringent heap space allowed which is about 256 K B of S R A M . ' Hence, even though pages are freed as expected, malloc reports failure if space is requested for a large number of pages. This is because there is no one contiguous block of memory available for the request to be satisfied. If some sort of defragmentation is implemented for the heap area, more efficient utilization of memory will be possible. 6.3.2 A M o r e Precise Debugger for the L A N a i The debugger implemented for the lanai-emx part of Emu is the simple approach using printf stubs. It has helped solve the memory problems in Emu but was not easy to use. For example, there is no way to read the point of execution in the program, such as is possible with the where and stacktrace commands in gdb, or to display the values of the variables when the program on the L A N a i dies due to segmentation faults. A more precise debugging tool would avoid unnecessary printing during the process of debugging memory faults. The debugger could help with providing only the necessary information when the program fails. 85 6.3.3 Performing G M S lookups on the L A N a i One of the applications that would benefit from Emu is the global memory system [Fee96]. It has two main operations getpage() and putpage(). G M S makes extensive use of the communication bandwidth. If these requests are directed to an engine on the L A N a i which can D M A pages from/to host memory without consuming C P U cycles, the host processor need not be bothered with remote requests. In this way, it lends its memory to trusted hosts but is not disturbed in order to process any remote messages. If the hashtable of the G M S is moved to the L A N a i , the getpage() and put-page() requests can be successfully handled by the L A N a i . It could make use of the location service which was explained in the previous chapter. Moreover, most of the processing is in the transferring of pages which makes these requests communication intensive. Hence, the lower speed of the network processor when compared to the central processor can be ignored. 6.3.4 Support for A c t i v e Ne twork ing on the N I C Kilroy, in chapter 5, demonstrated the capabilities of Emu to provide support for Active Networking. Applications such as this could be created, following kilrotfs pattern, which run only on the NICs and do useful work as they migrate, such as checking state and modifying it appropriately. For example, an active object migrating among the NICs can be responsible for maintaining up to date information in the hash tables of the G M S application described above. 6.3.5 D y n a m i c R o u t i n g A p p l i c a t i o n The concept of supporting an intelligent interpreter engine on the NIC, can be extended to a local area network. But in a local area network, the unpredictable latency in traversing the network might mean that traditional paging to disk is faster than paging to global memory. However, other applications might benefit from the 86 Emu concept. If the engine on the NIC is modified to make intelligent routing decisions based on (i) network partitioning, (ii) heavy traffic on the standard route, and (iii) a new shorter route becoming available, then a dynamic routing scenario is established. First, the engine on the NIC need not bother the host, thereby avoiding interrupts and context switches among other things. Second, manual intervention is unnecessary because the network can automatically reconfigure itself to changes, almost instantaneously, which is not the case otherwise. 6.3.6 H P J a v a Engine as Framework for Emu The HP Java Virtual Machine [Vij98] is designed for embedded systems with re-source constraints such as the Myrinet NIC, with memory limitations of 1MB of S R A M . It complies with Sun's Java Virtual Machine specifications, but with its 0.5MB of code, provides more flexibility. Using HP Java as the framework for Emu instead of Emerald would be an interesting future work. With Java being so popu-lar, using it as the basis for Emu would make the latter more easily accepted as a new programming model for a cluster. 6.4 F i n a l R e m a r k s The Emu system is a model that supports migration of active objects to the network processors, thereby offloading the hosts for other useful work. The runtime migra-tion of services to the network processor is made possible by the interpreting engine designed for the NIC. It provides all the features to support active networking appli-cations, as stated above. Emu has proved that processing on the network processor is possible. But good throughput in such a system depends on proper delegation of responsibilities to the NIC, keeping in mind that the network processor is much slower than the host processor. Unless the network processors are made faster, the performance in such a system would be limited by the speed of the network proces-87 sor. Emu has provided a model to avoid the I/O bottleneck by utilizing idle network processing power, but the new network processing bottleneck introduced has to be tackled for more complex processing to occur on the network interface cards. 88 Bibl iography [Ach93] Bruno Achauer. The D O W L Distributed Object Oriented Language. Communications of the ACM, 36, Nr 9:48-55, September 1993. [AF89] Yeshayahu Artsy and Ralph Finkel. Designing a Process Migration Facility: The Charlotte Experience. IEEE Computer, 22, Nr 9:47-56, September 1989. [BCF+95] Nanette J . Boden, Danny Cohen, Robert E . Felderman, Alan E . K u -lawid, Charles L . Seitz, Jakov N . Seizovic, and Wen-King Su. Myrinet -A Gigabit-per-Second Local-Area Network. IEEE-Micro, 15(l):29-36, February 1995. [BCZ96a] Samrat Bhattacharjee, Kenneth L. Calvert, and Ellen W . Zegura. Im-plementation of an Active Networking Architecture. White paper pre-sented at gigabit switch technology workshop, College of Computing, Georgia Institute of Technology, Washington University, St. Louis, July 1996. [BCZ96b] Samrat Bhattacharjee, Kenneth L. Calvert, and Ellen W . Zegura. On Active Networking and Congestion. Technical Report GIT-CC-96-02, College of Computing, Georgia Institute of Technology, 1996. 89 [BHJ + 87] Andrew Black, Norman C. Hutchinson, Eric Jul, Henry Levy, and Larry -Carter. Distribution and Abstract Types in Emerald. IEEE Transac-tions on Software Engineering, 13(l):65-76, January 1987. [BHJL86] Andrew Black, Norman Hutchinson, Eric Jul, and Henry Levy. Object Structure in the Emerald System. In Proceedings of the First ACM Conference on Object-Oriented Programming Systems, Languages and Applications, pages 78-86, Portland, Oregon, September 1986. [Bic95] Lubomir F . Bic. M E S S E N G E R S : Distributed Computing using Au-tonomous Objects. Technical Report TR-95-19, 1995. [BJM+96] Greg Buzzard, David Jacobson, Milon Mackey, Scott Marovich, and" John Wilkes. An Implementation of the Hamlyn Sender-Managed In-terface Architecture. In Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation, Seattle, Washington, October 1996. [BL85] A . Barak and A . Litman. M O S : A Multi-Computer Distributed Operat-ing System. Software—Practice and Experience, pages 725-737, August 1985. [BMKK97] A . Baratloo, Karaul. M . , H . Kar l , and Z. M . Kedem. KnittingFactory: An Infrastructure for Distributed Web Applications. Technical Report TR1997-748, New York University, November 1997. [Boo96] G . Booch. Object-Oriented Development with Java. Report on Object Analysisand Design, SIGS Publications, 2(6):2-4, March 1996. [BSW89] Ammon Barak, Ammon Shiloh, and Richard Wheeler. Flood Preven-tion in the M O S I X Load-Balancing Scheme. IEEE Computer Soci-ety Technical Committee on Operating Systems Newsletter, 3(l):23-27, 1989. 90 [BVW95] Matt Bishop, Mark Valence, and Leonard F . Wisniewski. Process M i -gration for Heterogeneous Distributed Systems. Technical Report PCS-TR95-264, Department of Computer Science, Dartmouth College, Au-gust 1995. [BWvE97] Anindya Basu, Matt Welsh, and Thorsten von Eicken. Incorporating. Memory Management into User-Level Network Interfaces. Technical Report TR97-1620, Department of Computer Science, Cornell Univer-sity, August 1997. Presented at Hot Interconnects V , Aug 1997, Stan-ford University. [CAL + 89] Jeffrey S. Chase, Franz G . Amador, Edward D. Lazowska, Henry M . , Levy, and Richard J . Littlefield. The Amber System: Parallel Pro-gramming on a Network of Multiprocessors. In Proceedings of the Twelfth ACM Symposium on Operating System Principles, pages 147— 158, Litchfield Park, Arizona, December 1989. [Che88] David R. Cheriton. The V Distibuted System. Communications of the ACM, 31, Nr 3:314-333, March 1988. [dKdP88] W . de Kinder and M . de Prycker. A T M : A Logical Evolution from ISDN. 1st European Conf. on Inf. Techno, for OrganisationalSystems, Elsevier Science Publishers B.V., 1988. [Dou91] Fred Douglis. Transparent Process Migration: Design Alternatives and the Sprite Implementation. Software—Practice and Experience, 21, Nr 8:757-785, August 1991. . [DRS89] F . Brent Dubach, Robert M . Rutherford, and Charles M . Shub. Process-Originated Migration in a Heterogeneous Environment. In Proceedings of the ACM Conference on Computer Science, New York, 1989. 91 [ea95] David Culler et al. Generic Active Message Interface Spec-ification. Technical report, Department of Computer Sci-ence, Berkeley University, February 1995. Available from U R L http://now.cs. berkeley.edu/Papers/Papers/gam_spec.ps. [FBDM96] Munehiro Fukuda, Lubomir F . Bic, Michael B . Dillencourt, and Fehmina Merchant. Intra- and Inter-Object Coordination with M E S -S E N G E R S . In Proceedings of the First International Conference on Coordination Models and Languages, Cesena, Italy, April 1996. [Fee96] Michael J . Feeley. Global Memory Management for Workstation Net-works. PhD thesis, University of Washington, Seattle, Washington, December 1996. [GHK+97] Carl A . Gunter, Mike Hicks, Pankaj Kakkar, Jonathan Moore, Scott Nettles, and Sam Weber. P L A N : A Programming Language for Active Networks. Technical report, Department of Computer and Information Science, University of Pennsylvania, July 1997. Available from U R L http://www.cis.upenn.edu/ switchware/PLAN/. [HC91] C. Horn and V . Cahill. Supporting Distributed Applications in the Amadeus Environment. Computer. Communications, 14, Nr 6:358-365, July 1991. [HDMS97] Gregory Henley, Nathan Doss, Thomas McMahon, and Anthony Skjel-lum. B D M : A Multiprotocol Myrinet Control Program and Host Appli-cation Programmer Interface. Technical Report MSSU-EIRS-EIRC-97-3, Integrated Concurrent and Distributed Computation Research Lab-oratory, Mississippi State University, May 1997. Available from U R L http: / / www.erc.msstate.edu/labs/hpcl/bdm.html. 92 [HDS97] Gregory Henley, Nathan Doss, and Anthony Skjellum. B D T : A Thread Library for the Myricom L A N a i 4.x Communications Pro-cessor. Technical Report MSSU-EIRS-ERC-97-2, Integrated Con-current and Distributed Computation Research Laboratory, Mis-sissippi State University, January 1997. Available from U R L http://www.ere.msstate.edu/labs/hpcl/bdm.html. [JLHB88] Eric Jul, Henry M . Levy, Norman C. Hutchinson, and Andrew P. Black. Fine-Grained Mobility in the Emerald System. ^4CM Transactions on Computer Systems, 6(1):109-133, February 1988. [Kak97] Pankaj Kakkar. The P L A N Tutorial. Technical report, Department of Computer and Information Science, University of Pennsylvania, July 1997. Available from U R L http://www.cis.upenn.edu/ switch-w a r e / P L A N / . [LA89] Luke Lin and Mustaque Ahamad. . Checkpointing and Rollback-Recovery in Distributed Object Based Systems. Technical Report GIT-ICS-89/43, School of Information and Computer Science, Georgia In-stitute of Technology, November 1989. [LJP93] Rodger Lea, Christian Jacquemot, and Eric Pillevesse. Cool: System Support for Distributed Programming. Communications of the ACM, 36, Nr 9:37-46, September 1993. [LS91] Michael Litzkow and Marvin Solomon. Supporting Checkpointing and Process Migration Outside the U N I X Kernel. In Proceedings of the Usenix Winter 1992 Technical Conference, pages 283-290, Berkeley, C A , USA, January 1991. Usenix Association. [LTBL97] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint and Migration of U N I X Processes in the Condor Distributed 93 Processing System. Technical Report 1346, Computer Science Depart-ment, University of Wisconsin-Madison, Apri l 1997. Available from U R L http://www.cs.wisc.edu/condor/publications.html. [MDP+96] Dejan Milojicic, Fred Douglis, Yves Paindaveine, Richard Wheeler, and Songnian Zhou. Process Migration. Technical report, T O G Research Institute, A T & T Laboratories, University of Toronto and Platform Computing, October 1996. Available from U R L http://www.opengroup.org/ dejan/papers/indx.htm. [MH97] Jonathan T. Moore and Michael W . Hicks. P L A N Programmer's Guide. Technical report, Department of Computer and Information Science, University of Pennsylvania, July 1997. Available from U R L http://www.cis.upenn.edu/ switchware/PLAN/. [Moo97] Jonathan T. Moore. P L A N Grammar. Technical report, Depart-ment of Computer and Information Science, University of Pennsylvania, July 1997. Available from U R L http://www.cis.upenn.edu/ switch-w a r e / P L A N / . [MV93] Herman Moons and Pierre Verbaeten. Object Migration in a Heteroge-neous World - A Multi-Dimensional Affair. In Proceedings of the Third International Workshop on Object Orientation in Operating Systems, pages 62-72, Asheville, Noth Carolina, December 1993. [MVCA97] Richard P. Martin, Amin M . Vahdat, David E . Culler, and Thomas E . . Anderson. Effects of Communication Latency, Overhead, and Band-width in a Cluster Architecture. In Proceedings of the ISCA '97, pages 85-97, Denver, C O , 1997. A C M . [Myr96] Myrinet internal report, Myricom, Inc., November 1996. Available from U R L http://www.myri.com/myrinet/cables/m2m-widget.html. 94 [Myr97a] Myrinet internal report, Myricom, Inc., September 1997. Available from U R L http://www.myri.com/myrinet/. [Myr97b] Myrinet internal report, Myricom, Inc., July 1997. http://www.myri.com/myrinet/PCI/m2m-pci32.html. [Myr97c] Myrinet internal report, Myricom, Inc., September 1997. Available from U R L http://www.myri.com/myrinet/pitch.html. [Myr97d] Myrinet internal report, Myricom, Inc., July 1997. Available from U R L http://www.myri.com/myrinet/sw-index.html. [Myr97e] Myrinet internal report, Myricom, Inc., July 1997. Available from U R L http://www.myri.com / myrinet/switches/m2m-dual-sw8.html. [Nut94] Mark Nuttal. A Brief Survey of Systems Providing Process or Object Migration Facilities. Operating Systems Review, pages 64-80, October 1994. [Ous93] John K . Ousterhout. Tel: A n Embeddable Command Language. Tech-nical Report CSD-89-541, University of California, Berkeley, August 1993. [PCA95] David A . Patterson, David E . Culler, and Thomas E . Anderson. A Case for N O W (Networks of Workstations)—abstract. In Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing, page 17, Ottawa, Ontario, Canada, 2-23 August 1995. [PM83] Michael L. Powell and Barton P. Miller. Process Migration in D E -M O S / M P . Proceedings of the Ninth ACM Symposium on Operating System Principles, pages 110-119, October 1983. [RAA+88] M . Rozier, V . Abrossimov, F . Armand, I. Boule, M . Gien, M . Guille-mont, F . Herrmann, C. Kaiser, S. Langlois, P. Leonard, and 95 W . Neuhauser. Chorus Distributed Operating Systems. Computing Systems Journal, l(4):305-370, December 1988. [RTL + 91] Rajendra K . Raj, Ewan D. Tempero, Henry M . Levy, Andrew P. Black, Norman C. Hutchinson, and Eric Jul. Emerald: A General-Purpose Programming Language. Software—Practice and Experience, 21 (1) :91— 118, January 1991. [Sap88] Peter S. Sapathy. W A V E - 1 : A New Ideology of Parallel Processing on Graphs and Networks. Proceedings of Future Generations Computer Systems, 4, 1988. [Sap95] Peter S. Sapathy. Mobile Wave Technology for Distributed Knowledge Processing in Open Networks. Technical report, June 1995. [SB88] Marcel Schelvis and Eddy Bledoeg. The Implementation of a Dis-tributed Smalltalk. In ECOOP'88 European Conference on Object-Oriented Programming, pages 212-232, Oslo, Norway, August 1988. [SDP91] Santosh K . Shrivastava, Graeme N . Dixon, and Graham D . Parrington. An Overview of the Arjuna Distributed Programming System. IEEE Software, pages 66-73, January 1991. [SHDM96] Anthony Skjellum, Gregory Henley, Nathan Doss, and Thomas McMahon. A Guide to Writing Myrinet Control Programs for L A N a i 3.x*. Technical report, Integrated Concurrent and Distributed Computation Research Laboratory, Missis-sippi State University, February 1996. Available from U R L http://www.erc.msstate.edu/icdrl/learn_mcp/smp.tar.gz. [Shu90] Charles M . Shub. Native Code Process-Originated Migration in a Het-erogeneous Environment. In Proceedings of the ACM Conference on Computer Science, pages 266-270, New York, 1990. 96 [SJ95] Bjarne Steensgaard and Eric Jul. Object and Native Code Thread Mo-bility Among Heterogeneous Computers. In Proceedings of the Fifteenth ACM Symposium on Operating System Principles, pages 68-78, Copper Mountain Resort, Colorado, December 1995. [SM] Alexander B . Schill and Markus U . Mock. D C E + + : Distributed Object-Oriented System Support on top of OSF D C E . Technical re-port, Institute of Telematics, University of Karlsruhe, Germany. [Smi97] Peter W . Smith. The Possibilities and Limitations of Heterogeneous Process Migration. PhD thesis, Computer Science Department, Univer-sity of British Columbia, Vancouver, B C V 6 T 1Z4, October 1997. [tel96a] Telescript Technology: Mobile Agents. Telescript internal report, Gen-eral Magic, 1996. [tel96b] Telescript Technology: An Introduction to the Language. Telescript internal report, General Magic, 1996. [tel96c] Telescript Technology: The Foundation for the Electronic Marketplace. Telescript internal report, General Magic, 1996. [tel96d] Telescript Technology: Scenes from the Electronic Marketplace. Tele-script internal report, General Magic, 1996. [TGSK96] D . L. Tennenhouse, S.J. Garland, L. Shrira, and M . F . Kaashoek. From Internet to Activenet. Request for comments, Laboratory for Computer Science, M I T , January 1996. [TH91] M . M . Theimer and B . Hayes. Heterogeneous Process Migration by Recompilation. In Proceedings of the Eleventh International Conference on Distributed Computing Systems, May 1991. Also: Xeroc P A R C Technical Report CSL-92-3. 97 [TLC85] Marvin M . Theimer, Keith A . Lantz, and David R. Cheriton. Pre-emptable Remote Execution Facilities for the V-System. Proceedings of the Tenth ACM Symposium on Operating System Principles, December 1985. [TSS+97] David L. . Tennenhouse, Jonathan M . Smith, W . David Sincoski, David J . Wetherall, and Gary J . Minden. A Survey of Active Network Research. IEEE Communications, 35, Nr 1:80-86, January 1997. [TW96] David L . Tennenhouse and David J . Wetherall. Towards an Active Network Architecture. Computer Communication Review, 26, Nr 2, April 1996. An earlier version of this paper was delivered during the keynote session of Multimedia Computing and Networking, San Jose, C A , January 96. [vEBBV95] Thorsten von Eicken, Anindya Basu, Vineet Buch, and Werner Vo-gels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of the Fifteenth ACM Symposium on Operating System Principles, Copper Mountain Re-sort, Colorado, December 1995. Also: Available from U R L http://www.cs.cornell.edu/Info/Projects/U-Net/. [Vij98] Jaikumar Vijayan. HP's Java effort raises questions on Java's Fu-ture. Technical report, HP, Inc., March 1998. Available from U R L http://www.idg.net/idgJrames/english/content.cgi?vc=docid_9-50365.html. [WPE+83] B . Walker, G . Popek, R. English, C. Kline, and G . Thiel. The LOCUS Distributed Operating System. In Proceedings of the Ninth ACM Sym-posium on Operating System Principles, pages 49-70, Bretton Woods, 1983. 98 [WT96] Dave Wetherall and David Tennenhouse. The A C T I V E IP Option. In 7th ACM SIGOPS European Workshop, Connemara, Ireland, Septem-ber 1996. [Zay87] E . Zayas. Attacking the Process Migration Bottleneck. In Proceedings of the Eleventh ACM Symposium on Operating System Principles, pages 13-24, Austin, T X , November 1987. A C M . 99 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051385/manifest

Comment

Related Items