"Science, Faculty of"@en . "Computer Science, Department of"@en . "DSpace"@en . "UBCV"@en . "Mulye, Sameer S."@en . "2009-01-11T00:00:00"@en . "1994"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "Multicomputer (distributed memory MIMD machines) have emerged as inexpensive,\r\nyet powerful parallel computers that promise high computational performance. A parallel\r\napplication running on these machines consists of a set of user processes running on\r\ndifferent processors. The only way processes share data in this environment is by message\r\npassing which can become the bottleneck to obtaining high performance.\r\nThis thesis examines the issues influencing communication in multicomputers. Two\r\nstatic analysis tools are designed: a mapping tool and runtime system for communication\r\nand synchronization. An efficient mapping of a parallel application onto a parallel\r\ncomputer is necessary to minimize the total communication cost. Delegating this task\r\nto the user makes application development a complex process as the user must know the\r\nunderlying hardware. An automated hardware-independent mapping tool is designed to\r\nmap arbitrary process graphs on a transputer-based multicomputer.\r\nA native runtime system on these multicomputers consists of message-passing primitives\r\nwhich are efficient but only support simple send and receive operations. A high-level\r\nruntime system provides functionalities such as flow control, buffering, and group communication\r\nusing a wide range of primitives. However, these primitives are difficult to\r\nuse and it is up to the user to select the optimal primitives for efficiency. Hence a\r\nmessage-passing library is designed for a transputer system. It is built on top of Trollius,\r\nan operating system for multicomputers, which offers a variety of functions, each with\r\ndifferent combination of performance and functionality. The library includes a few communication\r\nprimitives which provide a common interface for all the Trollius functions\r\nwhile maintaining the efficiency of the low-level functions. The mapping tool statically\r\ncompiles a program structure info the configuration information and based on this information,\r\nthe message-passing library selects the optimal Trollius functions.\r\nParsec (the Parallel System for Efficient Computation), a parallel programming environment\r\nfor multicomputers, is being developed at UBC. Our tools are integrated with\r\nParsec and become the backbone in achieving high performance. Heterogeneity is one of\r\nthe important design issues in Parsec. Currently these tools are available on two different\r\nhardware platforms: a transputer-based multicomputer and a cluster of Sun Sparc\r\nworkstations. Thus the user can take advantage of either environment to write portable\r\nand efficient parallel applications."@en . "https://circle.library.ubc.ca/rest/handle/2429/3549?expand=metadata"@en . "5358386 bytes"@en . "application/pdf"@en . "C O M M U N I C A T I O N O P T I M I Z A T I O N IN P A R S E C : A P A R A L L E L P R O G R A M M I N G E N V I R O N M E N T F O R M U L T I C O M P U T E R S by SAMEER S. MULYE B.E. (Computer Engineering) University of Bombay, India, 1990 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF T H E REQUIREMENTS FOR T H E D E G R E E OF M A S T E R OF SCIENCE IN T H E F A C U L T Y OF G R A D U A T E STUDIES D E P A R T M E N T OF C O M P U T E R SCIENCE We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A November, 1994 \u00C2\u00A9 Sameer S. Mulye, 1994 I In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Computer S a ' f c h f g The University of British Columbia Vancouver, Canada Date feJb- V*, W 5 DE-6 (2/88) Abst rac t Multicomputer (distributed memory M I M D machines) have emerged as inexpensive, yet powerful parallel computers that promise high computational performance. A parallel application running on these machines consists of a set of user processes running on different processors. The only way processes share data in this environment is by message passing which can become the bottleneck to obtaining high performance. This thesis examines the issues influencing communication in multicomputers. Two static analysis tools are designed: a mapping tool and runtime system for communica-tion and synchronization. A n efficient mapping of a parallel application onto a parallel computer is necessary to minimize the total communication cost. Delegating this task to the user makes application development a complex process as the user must know the underlying hardware. An automated hardware-independent mapping tool is designed to map arbitrary process graphs on a transputer-based multicomputer. A native runtime system on these multicomputers consists of message-passing primi-tives which are efficient but only support simple send and receive operations. A high-level runtime system provides functionalities such as flow control, buffering, and group com-munication using a wide range of primitives. However, these primitives are difficult to use and it is up to the user to select the optimal primitives for efficiency. Hence a message-passing library is designed for a transputer system. It is built on top of Trollius, an operating system for multicomputers, which offers a variety of functions, each with different combination of performance and functionality. The library includes a few com-munication primitives which provide a common interface for all the Trollius functions while maintaining the efficiency of the low-level functions. The mapping tool statically compiles a program structure info the configuration information and based on this infor-mation, the message-passing library selects the optimal Trollius functions. 11 Parsec (the Parallel System for Efficient Computation), a parallel programming envi-ronment for multicomputers, is being developed at U B C . Our tools are integrated with Parsec and become the backbone in achieving high performance. Heterogeneity is one of the important design issues in Parsec. Currently these tools are available on two differ-ent hardware platforms: a transputer-based multicomputer and a cluster of Sun Sparc workstations. Thus the user can take advantage of either environment to write portable and efficient parallel applications. 111 Table of Contents Abstract 1 1 Table of Contents vi List of Tables vii List of Figures viii Acknowledgment x 1 Introduction 1 1.1 Problems 2 1.2 Motivations : 4 1.3 Approach 10 1.4 Thesis Context , . 12 1.5 Thesis Organization 12 2 Related Work and Background 14 2.1 Parallel Programming Environment 15 2.2 Parallel Languages and Compilers 18 2.3 Mapping tools 18 2.4 Message-passing Libraries 21 2.5 Background. 22 3 System Design 26 iv 3.1 System Overview 27 3.2 The Mapping Tool 29 3.3 Runtime System 32 3.3.1 Parsec Process Model 32 3.3.2 Message Passing Primitives 34 4 Mapping Strategy on Multicomputers 37 4.1 Hardware System , 37 4.2 Structural Overview 39 4.3 Mapping Strategy 46 4.3.1 Mathematical Formulation 46 4.3.2 Mapping Constraints 47 4.3.3 Algorithm 53 4.3.4 Experimentation 58 5 Runtime System for Transputers 61 5.1 Transputer Programming Model 61 5.2 Low Level Software Tools . . . . . . . 62 5.2.1 Logical C :. . . : 62 5.2.2 Trollius 64 5.3 Runtime System 67 5.3.1 Implementation Details . . 70 5.4 Performance Analysis 75 6 Programming A Network of Workstations 77 6.1 Workstation Environment 77 6.2 Mapping Tool 79 v 6.3 Runtime System 83 6.3.1 Internal Features 84 6.4 Performance Analysis 86 6.5 Program Development in Parsec 88 7 Conclusions 97 7.1 Future Work 99 Bibliography 100 Appendices 105 A Transputer System Layout 105 vi Lis t of Tables 1.1 Message Passing Timings 8 4.1 Wire Structures 41 4.2 Mapping Statistics 59 5.1 Trollius Function Invocation 72 5.2 Parsec: Message Passing timings . . . 76 6.1 Parsec: T C P communication . 87 6.2 P V M 3 : daemon mediated U D P communcation . 87 6.3 P V M 3 : T C P communcation 88 6.4 Spring: Performance 96 vii List of Figures 1.1 Process Graph 7 1.2 Mapping 8 2.1 Parsec Objects 23 3.1 System Overview 27 4.1 Transputer Network (part of Component 2) 38 4.2 Structural Overview 39 4.3 Resources (part of Component 0) 40 4.4 System graph 42 4.5 Conflict-graph for Figure 4.3 . . . 43 4.6 User Process Graph 44 4.7 Guest Graph 45 4.8 (a) Guest graph, (b) Host graph for case 1, (c) Host graph for case 2 . . 51 5.1 Decision Tree . . . 69 6.1 Process Graph for Spring in Parsec 90 6.2 Graphical Interface for Mapping Tool in Parsec 92 6.3 Graphical Interface for Loading Tool in Parsec 93 A . l Component 1 (Box 1) . : 105 A.2 Component 0 (Box 2) 106 A.3 Component 3 (Box 3) 107 vm A.4 Component 2 (Box 4) 108 A.5 Component 2 (Box 5) 109 A.6 Transputer System as a 8x9 mesh . . . 110 ix Acknowledgment I would like to thank both my supervisors Dr. Samuel Chanson and Dr. Alan Wagner for their support and guidance. I thank David Feldcamp, Mandeep Dhami, and H . V . Sreekantaswamy for their valuable assistance during my research work. My thanks go to Bi l l Gates for proofreading this thesis, Vishwa Ranjan for his continuous encouragement, and Raza Khan and Chris Romanzin for being there on those countless nights. I am grateful to Dr. Chanson and to the Department of Computer Science at the University of British Columbia for providing the financial support. Finally, I would like to express my acknowledgments to my parents, my constant source of inspiration. x Chapter 1 Introduction Parallelism is one way to achieve computational speedup for a large class of applications. A wide range of parallel computer architectures [59] are available to exploit parallelism. The most basic distinction between these architectures is whether the processors syn-chronously execute the same instruction on multiple data (SIMD) or asynchronously execute multiple instructions on multiple data (MIMD). M I M D machines are grouped into two categories according to their address-space organization [2]: message-passing architectures and shared-address-space architectures. In a message-passing architecture, each processing element (PE) consists of a processor and its local memory and the PEs are interconnected by a high-speed network. M I M D message-passing computers are commonly referred to as multicomputers. In a shared-address-space architecture, all processors have read and write access to a shared address space. Processors interact by modifying data objects stored in a shared space. These computers are referred to as multiprocessors. There are several factors that motivate the use multicomputers for parallel computing. A multicomputer can be built out of inexpensive and powerful microprocessors and can scale easily to accommodate a large number of PEs. Performance of these machines has improved over the years with increasing C P U speed and decreasing message-passing time. For these reasons, multicomputers are viewed as low-cost, scalable, and high-performance parallel computers [35]. Examples of multicomputers include Intel iPSC, Transputers, CM-5, and Intel Paragon. In recent years, networked workstations (workstations in a local 1 Chapter 1. Introduction 2 area network) have emerged as an effective alternative for dedicated parallel machines [37]. 1.1 Problems Unfortunately, it is difficult for a naive user to program a multicomputer. Many mul-ticomputers provide low-level and hardware-dependent programming environments (for example, N X / 2 for the iPSC and Logical C for the transputer). Programmers in these environments, must know the details of a machine architecture such as the number of processors, the network connection, the memory organization, the I/O processing, the operating system, and the communication and synchronization facilities. These programs guarantee high performance but at the cost of a complex design and development pro-cess. The hardware dependencies inherent in these programs also make them difficult to port to a different architecture. On the other hand, high-level programming environ-ments [40, 42, 41] support common interfaces across different hardware platforms. These environments provide high-level abstractions to simplify the program development, but their performance is poor. Thus there arises a trade-off between high-level and low-level programming environments. Developing high-performance high-level programming envi-ronment is a challenging task. The primary focus of this thesis is on two performance related issues associated with high-level programming environment for multicomputers: the mapping problem and a runtime system for communication and synchronization. Mapping problem The first step in executing a parallel application on a multicomputer is to assign each of its processes to an individual processor. The problem of assigning these processes, in order to minimize the communication cost is known as the mapping problem [17, 29]. The Chapter 1. Introduction 3 process assignment can be static or dynamic. In static mapping, processes are assigned to processors prior to any program execution and in dynamic mapping, processes are dynamically reassigned to processors in response to changing load on processors and network. The way in which a mapping algorithm allocates the processes directly affects the program execution time. For example, if two communicating processes are assigned to two non-adjacent processors, then messages are delayed because of extra copying at the intermediate nodes. Similarly assigning the processes to the same processor serializes the communication and hence slows down the execution. The problem of finding an optimal task assignment is known to be NP-complete [ 3 ] . This problem is even more difficult for reconfigurable multicomputers. In these multicomputers, the interconnection network can be changed using control switches that are configured by software. A good mapping requires not only an efficient algorithm but an elegant way to formulate the problem. The difficulty lies in formulating the dynamic characteristics of the interconnection network. R u n t i m e system Multicomputers support task-level parallelism based on a message-passing paradigm. A native runtime system provides message-passing primitives for communication and synchronization which ensure high performance. Usually these primitives support only point-to-point send and receive operations. Higher level message-passing systems, built on top of the native runtime system, provide flow-control, buffer management, routing, and group communication. However with these added functionalities, the high-level primitives are difficult to use. Given a wide range of primitives, it is up to the user to select the optimal message-passing primitives. Ideally one would like to tailor the message-passing to the type of communication so that neighbour-to-neighbour, non-multiplexed processes can use the lowest-level faster Chapter 1. Introduction 4 mechanism whereas higher-level multiplexing and routing is reserved only for the mes-sages that need it. However, this information is not available, until after mapping is done. Such an optimization cannot be done in the application because the user code needs to be changed if the mapping changes. A runtime system should provide this type of optimization. Apart from these performance-related problems, we also address portability. The tools which are hardware-dependent should be designed for all the target architectures and should present a generic user interface for application development across multiple platforms. Once an application is developed for a particular M I M D architecture, it cannot be run on other M I M D architectures because the tools and techniques used to parallelize the code are specific to a particular hardware platform. As a result, considerable effort is necessary to port code to other platforms. This will certainly discourage the user from developing applications for different parallel architectures. 1.2 Motivations We first identify the relationships between the aforementioned problems and the target machine architecture and its native runtime system. Target Hardware and Software Systems Both the mapping problem and the runtime system are discussed in the context of a re-configurable transputer-based multicomputer. The multicomputer consists of 75 INMOS T800 transputers and 10 C004 programmable crossbar switches. Each crossbar switch has 32 data links and a control link. Any pair of data links can be connected by sending an appropriate software command to a control link. The system can be reconfigured into Chapter 1. Introduction 5 a variety of interconnection networks. There are two software tools available on the trans-puter system at U B C : Trollius [9] and Logical C [7, 8]. Logical C provides hardware-level message-passing primitives. The Trollius message-passing primitives resemble the OSI model of protocol architecture. Different primitives are available at the physical level, data-link level, network level, and transport level in order of increasing functionality and decreasing efficiency. The mapping problem is characterized by various factors such as parallel program structure, computational cost of a task, communication cost among the tasks, number of processors, processors' connectivity, the speed of an individual processor, and communi-cation cost between the processors. It is independent of a machine architecture and the native runtime system. The objective of the message-passing library, built on top of the native runtime system, is to provide high-level easy-to-use primitives without compro-mising performance. The optimization techniques employed by such a system depends on the native runtime system as well as on the machine architecture. There are several reasons for having a mapping tool to allocate tasks in order to minimize the communication overhead. 1. Assignment of the processes to the processors should be simplified. Many program-ming environments like P V M [42], TOPSYS [44], and Trollius and languages like Occam and Logical C leave this up to the programmer. Although this is not diffi-cult when there are only a few processes, it becomes more tedious as the number of processors and processes grow. 2. Different programming paradigms require different process topologies. In order to map these different topologies by hand, on a particular parallel architecture, the user must know the underlying hardware interconnection. But such topology inde-pendence is easier when the mapping is performed automatically by the software Chapter 1. Introduction 6 system. 3. A reconfigurable network makes the mapping problem more difficult. One can enhance performance by using a mapping tool that can do better than a naive user. 4. A variety of runtime systems are available to support different functionalities for different types of communication. We sought to take advantage of these functional-ities while at the same time providing easy to use programming interface. Different communication primitives affect the performance differently. The mapping should be done in such a way that the total communication cost will be minimum. This will allow one to take advantage of the lowest-level (and most efficient) commu-nication primitives. However, such optimization by hand is not desirable for the programmer because every time a program structure changes, the mapping infor-mation changes, and so does the optimization code. Hence both the mapping and the optimization should be done automatically. Example In order to motivate the problems associated with optimizing communication, we present an example which shows how mapping affects the optimization and what overheads are associated with the Trollius communication primitives and the Logical C channel primi-tives. A process graph is mapped onto the transputer system, and the message-passing time between every pair of processes is measured. Figure 1.1 shows a five node complete graph. This graph is mapped on the component 0 of our transputer system (Figure A.2) which has 25 transputers where transputer T 0 is directly connected to the host machine (Sun Sparc machine). Component 0 has two rings of 8 transputers each (T 0-T 7 and T 1 6 -T 23) and one chain of 9 transputers (T 8 -T 1 5 , T a ) . The Chapter 1. Introduction 7 S: Host Process. P i : Transputer Processes. Figure 1.1: Process Graph transputers within a ring or a chain are connected on hard-links whereas two transputers across a ring and a chain can be connected over the crossbars. The mapping of the process graph onto component 0 is shown in Figure 1.2. Dilation of an edge in the process graph is defined as the number of physical links the edge is mapped onto. Edges e s p i , e s p 2 , esp3, e s p 4 , and epoP4 have dilation greater than one. Logical routes include all the physical links between every pair of communicating processes. Depending on the mapping done (which determines the dilation of edges in the process graph and the physical links that are multiplexed), a specific Trollius function can be used. For example, two processes P 0 and P i are mapped on the adjacent transputers T a and T 1 5 respectively. However, the physical link between T a and T\"i5 is multiplexed over four different edges in the process graph. Hence the physical level or data-link level Trollius primitives or Logical C channel primitives cannot be used for communication between P 0 and P i . One is then limited to the network level or the transport level Trollius primitives. Table 1.1 shows the message-passing time between every pair of processes for the given mapping. In this experiment 1000 messages, each of 4k bytes, were sent from the source node to the destination node. The round-trip time is measured for message-passing between every process pair. Two communication levels are used: network level and Chapter 1. Introduction 8 Ta T15 T12 T13 Physical Links Edges in the process graph with dilation > 1 Logical routes for the dilated edges Figure 1.2: Mapping hardware level. At the hardware level, Logical C primitives are used instead of physical-level Trollius primitives to minimize the delay. (To use Logical C primitives from within Trollius, it is required to set up virtual circuit using data-link level or network level Trollius primitives. This is described in detail in Chapter 5). Process pairs P Q - P I and P1-P2 communicate over the hard-links (two transputers Source Dest Communication Time Transmission rate node node level (in sec) (in MBytes/Sec) Po Pi Network 6.060 0.675 Pi P2 Hardware 4.864 0.842 P2 PA Hardware 7.025 0.582 Pi PA Network 7.633 0.536 Po PA Network 13.318 0.307 Table 1.1: Message Passing Timings Chapter 1. Introduction 9 are directly connected to each other) whereas P2-P4 and P1-P4 communicate over the crossbars (in this case, transputer links are connected to a common crossbar switch). The transmission rate for P1-P2 is higher than that for P0-P1 because the hardware level primitives add little overhead compared to the network level primitives (network level primitives incur the cost of buffering, routing, and packetization). Also, communication on a crossbar is slower compared to one on a direct link (note the time between P0-P1 and P1-P4 or P1-P2 and P2-P4). Edge epoP4 is dilated and hence a message travels over the multiple links which results into poor performance (i.e., higher message passing time between P0-P4). It is possible to use only the network level functions in the above example without any need to know about the mapping details. However to achieve better performance it is required to select the optimal communication primitives based on the configuration information (which includes dilation over the edges and the multiplexing over the physical links) available from the mapping process. Works ta t ion Envi ronment A network of workstations has evolved into a viable programming environment for parallel applications [37]. A variety of systems are available (\u00C2\u00A72.1) to program in a networked environment. Some of the motivations for using networked workstations for parallel and distributed computing are: 1. They provide inexpensive computing resources to develop parallel applications and are capable of delivering the performance of a parallel computer [47]. 2. They are easy to use and many software tools are available in a familiar environment for program development such as optimizing compilers, debugger, and performance monitors. Chapter 1. Introduction 10 3. Network resources can be easily shared. They include peripheral devices, dis-tributed file system, local memory, and computing power of individual workstation. 4. If a workstation remains idle then its otherwise wasted C P U cycles can be harnessed to improve performance. 1.3 Approach The main objectives of the mapping tool are: \u00E2\u0080\u00A2 It should be topology independent, i.e., independent of changes in the underlying hardware and runtime system. \u00E2\u0080\u00A2 It should assist the user in generating a good mapping. In an ideal mapping, every pair of communicating processes map onto adjacent processors such that the communication cost is minimum. Usually this mapping is not feasible and hence the mapping tool attempts to optimize the mapping with respect to several criteria. \u00E2\u0080\u00A2 Routing determines the path (one or more physical links) between every pair of communicating processes. It is optimal if communication takes place along the shortest path and no two messages are multiplexed over the path. In the case where dilation is greater than one, the mapping tool should generate optimal routing between all non-adjacent processes so as to use the most efficient message-passing level, i.e., it should, where possible, use the lowest level of communication. The mapping problem is formalized using a graph theoretic model [20, 21, 24, 18]. A parallel program structure is represented as a process graph where each node denotes a user process and an edge.indicates that the two processes it joins communicate with each other. Thus'parallelism is explicitly specified in the graph. Similarly, the hardware Chapter 1. Introduction 11 architecture is represented as a processor graph where each node denotes a processing element (PE) and each edge indicates the communication link between two PEs. With this formalization, the mapping problem is reduced to mapping a process graph into a processor graph. A mapping algorithm is developed that combines heuristic search with backtracking. The mapping takes place in three steps: contraction, assignment, and routing. The mapping information is then passed to the runtime system where it is used to optimize the communication. A high-level runtime system has been designed on top of Trollius. It includes a mes-sage passing library to provide a consistent user interface to all of the Trollius message-passing functions while maintaining the efficient use of low-level functions. The runtime system supports point-to-point and collective communication (scatter/gather operation). The mapping tool statically compiles the program structure into the configuration infor-mation. Based on this information, the runtime system selects the optimal Trollius prim-itives for communication. The optimization is entirely transparent to the user. However, the semantics of the message passing such as synchronous vs. buffered communication are provided by the user. In the workstation environment, the mapping tool and the runtime system were ported to a network of Sun Sparc workstations. A static load balancing algorithm was developed to assign processes to the processors and a runtime system was designed that uses T C P communication services. Both environments (the transputer system and the networked workstations) share the same user interface and program code and thus location trans-parency is achieved (i.e., the user can develop an application in either environment and then transparently move the code). Chapter 1. Introduction 12 1.4 Thesis Context Developing a parallel application is a complex process. It includes several steps: program structuring, specifying parallel control and communication, mapping, program loading, runtime support, monitoring, visualization, and debugging. Ideally all of the above should be provided by one integrated environment. Parsec [32] is one such environment for programming multicomputers which is being developed at U B C . It provides program modules (i.e., skeletons, templates) that are partially implemented virtual machines corresponding to a particular paradigm. Parsec has various constructs to define the structural information required by our optimization tools. It supports a graphical interface to create process graphs and provides three objects, process, port, and channel, to specify explicit parallelism in a program. The mapping tool and the runtime system for message passing are integrated with Parsec to optimize the communication within the environment. Parsec provides an X-window based user interface to the mapping tool and also supports a hardware database which stores information such as the number of processors, processor name, processor speed, and connectivity. In the remainder of the thesis, these tools are discussed in the context of Parsec. 1.5 Thesis Organization The remainder of this thesis is organized as follows. In Chapter 2, the relevant literature is reviewed. We briefly survey different programming environments, message-passing systems, and mapping tools. Parsec objects which provide structural information to the mapping tool and the runtime system are also described in this chapter. The mapping tool and the runtime system are integrated with Parsec. Although the tools differ in their implementation on each architecture, they possess the same Chapter 1. Introduction 13 functionality and user interface. This integration along with the Parsec communication primitives are described in Chapter 3. The Parsec mapping tool for a transputer system is presented in Chapter 4. The Parsec runtime system for a transputer system is described in Chapter 5. Native runtime systems (Trollius and Logical C) and optimization techniques employed by the Parsec runtime system are described in this chapter. Chapter 6 discusses the tool with respect to the workstation environment. In order to demonstrate portability, we also present a parallel application that is developed in the Parsec environment and run on both the architectures. Conclusions are drawn in Chapter 7 followed by a brief discussion of future work. Chapter 2 Related Work and Background There are two distinct approaches to developing parallel programming systems. The first approach is to develop a programming environment that provides tools for different phases of program development such as defining parallel code, assigning processes to parallel architecture etc.. Usually in these systems, a standard sequential language is ex-tended by providing some message-passing libraries; examples include P V M [42], Topsys [44], Express [46], and Frameworks [47]. A second approach is to develop new parallel languages and compilers based on some high-level programming paradigms. Compilers are responsible for detecting and managing parallelism; examples include Linda [48], Jade [51], and DINO [50]. In Parsec, the parallel programming environment developed at U B C , our goal is to obtain better performance and simplify code development. A mapping tool and runtime system were developed and integrated with Parsec to achieve these goals. Like other programming environments, Parsec helps the user to specify parallel program structures, binds processes to processors and provides primitives for communication and synchro-nization. However, the Parsec approach differs from many tools in that the mapping tool statically compiles the program structure into information which can be used by the runtime system to select the optimal communication primitives supported by the native runtime system. In this chapter we briefly survey the related work. In \u00C2\u00A72.1, we present various parallel programming environments with the main .emphasis on communication-related issues. 14 Chapter 2. Related Work and Background 15 A compiler-based approach is described in \u00C2\u00A72.2. The scope of this thesis is limited to a mapping tool and runtime system. In \u00C2\u00A72.3, we discuss transputer mapping tools (namely, Marc [28], Oregami [31], and Prep-P [29]) followed by message-passing systems for mul-ticomputers such as Zipcode [53] and Via [54] in \u00C2\u00A72.4. Finally, in \u00C2\u00A72.5, we describe the three Parsec objects, process, port, and channel, which are used to define the structural information required by our tools. 2.1 Para l le l P r o g r a m m i n g Envi ronment PVM3 Parallel Virtual Machine (PVM) [42] is a software system that allows a heterogeneous network of parallel and serial computers to appear as a single computational resource. It consists of a set of user-interface primitives, a daemon that runs on every participating machine, and a library for process creation, management, communication, synchroniza-tion along with other housekeeping facilities. P V M was primarily designed for networked workstations, however, it has also been ported to the Intel iPSC/860 multicomputer. In the latter case, the native message-passing routines are used for inter-node communica-tion. In P V M , a programmer is responsible for configuring a virtual machine. As yet, it does not provide load balancing, partitioning, or mapping. P V M supports dynamically reconfigurable runtime system. It includes a few simple message-passing primitives for send/receive, broadcast, and multicast operations. It is the programmer's job to structure and implement parallelism and communication types such as point-to-point communi-cation and group communication. The user is also responsible for buffer management and message packing/unpacking. In terms of communication cost, a daemon on each machine adds to the cost as the message (destined for remote machine) is routed through Chapter 2. Related Work and Background 16 the daemons. In general, no communication optimization is provided by the runtime system. In the most recent version of P V M (PVM3.x), the runtime system supports two types of communications: daemon mediated U D P communication and direct T C P com-munication. Unlike P V M , Parsec runtime system utilizes statically configured structural information to perform optimization. In the workstation environment, Parsec supports only T C P communication. In Chapter 6, the performance of our system is compared to that of P V M 3 T O P S Y S Tools for Parallel Systems (TOPSYS) [44] is an integrated, hierarchical tool environment for distributed multiprocessors. The main design goals of Topsys are ease of use and portability. It was originally implemented on an iPSC/860 and later ported to Sun Sparc workstations. Parsec followed similar design approach in that it was originally implemented on a transputer system and later ported to networked workstations. Topsys consists of a parallel debugger, a performance analyzer, a graphical animation tool, and an interface for all of the above tools. It supports M M K (multiprocessor multitasking kernel), a parallel library for process creation and communication. Topsys provides three objects for communication and synchronization: process, mailbox and events. Once the programs are written, the user must specify the connectivity of the objects and the mapping of these objects onto the processors. This information is supplied in a special text file called the mapping file. M M K provides only point-to-point communication and no communication optimization is done in the system as the communication objects are defined statically by the user. Good monitoring support is provided with back-end tools like a debugger and a performance analyzer. However, there are no front-end tools (tools to implement parallel programming paradigm, specify parallelism, or perform task allocation and communication optimization) to help during program development. Chapter 2. Related Work and Background 17 Enterprise Enterprise [40] is a graphical programming environment for distributed systems. It pro-vides tools to design, code, debug, monitor, and execute the parallel applications. Enter-prise programs run on a network of workstations. A user attaches assets to the sequential code to specify parallelism in a program. The assets are similar to the templates or pro-gramming paradigms that represent high-level techniques for exploiting parallelism. The template-based approach was proposed in Framework [47], a precursor to Enterprise. To specify parallelism, Enterprise uses an analogy between the structure of a parallel pro-gram and the structure of a business organization. This analogy provides different types of assets (e.g., contracts, departments, individuals), each defining a different way of par-titioning an operation across multiple processes. The system automatically inserts the code to handle communication and synchronization into a module based on its attached asset. It also loads the processes on machines for execution and provides limited process migration and dynamic load balancing. P I E The Programming and Instrumentation Environment (PIE) [45], developed at C M U , is directed towards generation of performance efficient parallel programs. Like Parsec, PIE provides a level of abstraction between the algorithm and the parallel architecture level. It is called an Implementation Machine (IM) and corresponds to a particular par-allel programming paradigm like master-slave and divide and conquer. The pre-coded structures (which encapsulate communication, synchronization and data partitioning) are made available to the user in the form of templates. The user is required to provide only the sequential code. The main emphasis of PIE is on performance monitoring and debugging. PIE does not supports tools for mapping and loading, and a programmer is Chapter 2. Related Work and Background 18 expected to know the details of- the hardware architecture and the underlying operating system. Message passing is supported by the IPC mechanism of the operating system. Templates in Parsec are similar to the assets in Enterprise and the Implementation Ma-chines in PIE. However, the main difference is that Parsec also supports the mapping and loading tools and the runtime system for optimized communication. 2.2 Para l le l Languages and Compi le rs High-level parallel languages provide programming paradigms that hide the complexity associated with explicitly parallelizing the code. It is built around a set of abstractions that express a parallel algorithm. The implementation of such a high-level language hides the low-level details such as mapping and presents an operating system and hardware-independent interface to the user. Some parallel languages, such as Fortran 90 [49] and Linda [48] are extensions to existing sequential languages while others such as Jade [51] and DINO [50] are completely new. Like parallel languages, parallelizing compilers also free programmers from the diffi-cult task of writing low-level parallel code. They are responsible for extracting parallelism from the code and for effectively schedule it on a target parallel machine. These compilers perform static optimization to generate efficient machine code. However, compiler-based approach is restricted to low-level optimization in that only basic send and receive op-erations are supported. Another problem is that the compilers are not portable across different machines. \u00E2\u0080\u00A2 . 2.3 M a p p i n g tools Many good mapping algorithms have been proposed in the theoretical literature to map processes onto various fixed interconnection networks. They use a variety of searching Chapter 2. Related Work and Background 19 heuristics. For example, the pairwise interchange algorithm by Bokhari [17] used prob-abilistic jumps to solve a mapping problem on a specific array of processors (i.e., the finite element machine). A greedy algorithm by Chen and Gehringer [21] and a mapping scheme (initial assignment followed by pairwise exchange) by Lee and Aggarwal [20] are targeted for hypercube architectures. Simulated annealing [22] and recursive divide-and-conquer strategy [23] based on Kernighan-Lin graph bisection heuristics for hypercubes were proposed by Ramanujam et al. In all these approaches, the topology of the target architecture was fixed, and the algorithms took advantage of this fact. There is another approach where the mapping tools are designed to accommodate one or more mapping algorithms, each suitable for a particular topology type. Three such tools are described below. M A R C M A R C [28] is an automatic configuration tool for mapping parallel programs onto a trans-puter network. It includes methods for load balancing and communication-optimized pro-cess distribution onto arbitrary topologies as well as efficient and secure routing strategies. The tool accepts a program written in Occam [13] and transforms it into a configured program for a fixed network of transputers. It constructs a software-graph from the Oc-cam program and a hardware-graph from the transputer interconnection and attempts to optimize the mapping of the software-graph onto the hardware-graph. The actual mapping takes place on the entire transputer network defined by the hardware-graph. The mapping algorithm uses a distributed method based on diffusion where neighbouring processors exchange information about their load and communication cost and processes are moved based on some optimization condition. In case of a reconfigurable transputer network, the desired topology should already be switched and thus a user is expected to know the hardware configuration. The input program is required in the Occam language Chapter 2. Related Work and Background 20 and this limits the scope of M A R C to the transputer networks. Oregami Oregami [31] provides an automatic and guided mapping of parallel programs onto the parallel architectures in order to achieve portability and maximal performance. The system has two key features: (1) an ability to take the advantage of the regularity present in both the computation structure and the interconnection network, and (2) a desire to balance the user's knowledge and heuristics with efficient mapping algorithms. Oregami consists of three components: 1. LaRCS, a language that allows the user to express the input graph. The input graph specifies regularity in the communication topology (i.e. which tasks communicate with each other) and regularity in the execution and communication patterns over time (i.e. phase behaviour of an algorithm). 2. Mapper, a library of mapping algorithms that allow canned mapping and the map-ping of regular (like tree, mesh, and ring) and irregular topologies. 3. Metrics, an interactive graphics tool that provides a way for the manual mapping where a user can specify particular process to processor binding. In Oregami, the hardware is assumed to be fixed and connected by some regular network topology. In contrast, our mapping strategy is aimed for reconfigurable network (which itself is not regular (\u00C2\u00A74.1)) to map variety of topologies. Prep-P Preprocessor for Poker (Prep-P) [30] is a mapping preprocessor that takes as input, a set of user processes whose communication graph is described in a Graph Description Chapter 2. Related Work and Background 21 Language (GDL). The mapping takes place in three phases: contraction, placement, and routing. A G D L and a compiler (GraphCom) facilitates representation of the input graph in a layout-independent form. The algorithm used for contraction is a local neighbour-hood search and the algorithm for placement is a modified version of the Kernighan and Lin circuit placement algorithm [24]. Once contraction and placement take place, Prep-P provides a mechanism for multiplexing mapped processes over a single processor. In Prep-P, the target hardware is a CHiP machine, a configurable message-passing parallel architecture which can simulate variety of network topologies. The dependency on a compiler makes Prep-P difficult to port to other hardware architectures. Tmap [36], a programming environment developed at U B C , uses Prep-P as a front-end tool for mapping on our transputer system. It also includes a runtime system for communication built on top of Trollius [12]. Like Prep-P, mapping in Parsec takes place using the same three phases. But unlike Prep-P and Tmap, a new technique is used to generate the mapping (\u00C2\u00A74.3) and the mapping tool is more flexibly incorporated into the system. 2.4 Message-passing Librar ies Most multicomputers have hardware support for message passing. However, it often differs from machine to machine and can sometimes be complex to use. Message-passing libraries have been designed to provide performance and convenient functionality. Given the proliferation of message-passing standards, there have been an attempt to standardize the libraries. M P I [52] (Message Passing Interface) is a standard that includes a large number of functions to support various communication types and semantics. It is necessary to design platform-specific message-passing library which willsupport a few easy-to-use primitives for optimized communication. Zipcode [53] is one such system Chapter 2. Related Work and Background 22 which also influenced the M P I standard. It is a dynamically configurable runtime system which provides optimized communication on a variety of platform including n C U B E / 2 , iPSC/860, Delta, CM-5 scalar machine, and a network of Sun Sparc and I B M RS/6000 machines. It introduces an object called mailer that encapsulates groups, contexts, and virtual topologies and provides the appropriate scope for all communication operations in Zipcode. Parsec differs from Zipcode in that optimization is done statically and encoded in the form of load-time parameters to the message-passing primitives. Thus it avoids overheads associated with dynamic optimization. Via [54], a structured message-passing system, has a hierarchy of objects: task class, task set, and task. This hierarchy is useful in many ways. First, it allows the general properties of a task to be defined in libraries. Second, a group of tasks can be renamed to avoid name collision and third, it allows programs to contain groups of tasks which have the same properties but are addressed differently. Each task has a number of named ports. This task and its ports together provide the communication context. To optimize the communication, V ia makes use of the configuration information provided by the user. Parsec has similar hierarchy of objects but, unlike Via , the configuration information is automatically generated by the mapping tool and used by the runtime system to optimize communication. In the following section, we describe Parsec objects used to specify a parallel program structure. 2.5 Background In Parsec, there is a clear distinction between sequential program code and specification of parallelism. The specification captures two notions of communication: communica-tion structure and communication semantics. Communication structure encapsulates Chapter 2. Related Work and Background 23 declarative information such as which processes communicate with each other, their con-nectivity, and communication intensity (number of messages transferred). The semantics specify whether communication is synchronous or buffered. The mapping tool converts communication structure into detailed configuration information needed by the runtime system. The runtime system, based on this information and the user-specified semantics, optimizes communication (this is explained in Chapter 5). Parsec provides a graphical user interface to specify the communication structure of a parallel program. It uses three primitive types of objects: process, port, and channel. Process, a global object, is the smallest decomposable unit of computation in a parallel application. Port, a local object bound to a process, is used for communication on a channel which is uniquely defined by the pair of process-port bindings. Figure 2.1 shows {A, Next; B, Prev} association. The port Prev of process A and port Next of process B are unbound and hence cannot be used for the communication. Process Process is the most widely used type of object that specifies the structure of a parallel program. It is an instance of a process type, another object provided by Parsec for Process Port (Unbound) Figure 2.1: Parsec Objects the relationship between these objects. Processes A and B each have two ports (Next and Prev) associated with them. The two processes communicate on a channel defined by Chapter 2. Related Work and Background 24 process reuse. Process type has two types of information: construction information and process structure information. Construction information includes source files, libraries, and runtime parameters that are needed to build an executable image of a process. Process structure information specifies the port name, process weight or priority. A new process is created by specifying its type and by binding its named ports to channels. This logical division makes the high-level program structure independent of the actual process instances. Por t In a parallel program, processes are often replicated (multiple instances of a process type) to form a program structure. As with non-replicated processes, these replicas must be bound with explicit connections to other processes. Changing the statically configured program structure requires that these connections are often changed. However, if the process is to remain replicable, its internal specification must remain independent of its external connection. The port object is used to facilitate this isolation. Port is a named variable whose value specifies the data structure that includes configuration information generated by the mapping tool. Thus, port hides the complexities of the underlying communication medium. Ports also encapsulate communication scope. A single port can be bound to several other ports and thus provide a simple way for collective (one-to-many or many-to-one) communication. Channel Channels define the communication paths in a program. In Parsec, a channel itself has no name but a channel end is defined by the process-port binding. Thus a channel is an association of a process-port connected to another process-port. The channels possess weights which denote the traffic density (the number of messages passing through the Chapter 2. Related Work and Background 25 channel). These weights are used during mapping to determine the best possible way to map these channels over the hardware architecture. A channel implementation is hardware-dependent. In multicomputers, channels are implemented over the physical links and on workstations they are built on top of some layered protocol and use the service access points of that protocol (e.g., sockets on T C P / I P protocol). The process-port-channel model has a close resemblance with our graph theoretic ap-proach in the mapping tool. In a process graph, a node denotes the process-port binding and an edge denotes the channel. Similarly, port internals provide the.information re-quired by the message passing primitives for communication optimization. These objects are further discussed in that context in Chapter .4 and 5. Apart from these primitive ob-jects, the Parsec interface has process groups, parameterized process graphs, and modules to support reusability, scalability, and ease-of-use [32] [38]. Chapter 3 System Design The various programming systems discussed in the last chapter offer different tools and techniques for developing efficient parallel programs. However, the user is responsible for binding a parallel program structure to a parallel architecture. Also, the runtime systems provided by these programming environments do not support automatic communication optimization. In Parsec, the tools are designed to support these features. As described in \u00C2\u00A72.5, Parsec supports the port-process-channel model for specifying a parallel program. In this chapter, we present a tool that automatically maps this program structure onto a parallel architecture. We also introduce the runtime system that provides a high-level message-passing library and uses the mapping information to optimize the communica-tion. In \u00C2\u00A73.1, we describe the integration of our tools with Parsec. This system presents a conceptual model that defines a generic tool interface and functionality for any type of multicomputer. The mapping tool is described in \u00C2\u00A73.2. Different phases of the mapping process are defined in this section. In \u00C2\u00A73.3, the Parsec process model is defined and its message passing primitives are described. Some of the features of the. runtime system are directly influenced by the underlying operating system, the network management, and the hardware support for communication. However the primitives described here are generic and hide platform-specific dependencies. 26 Chapter 3. System Design 27 3.1 System Overview The mapping tool and the runtime system are integrated with Parsec. Figure 3.1 shows their interaction with the Parsec environment. A database is used for data integration between the various tools. In Parsec, a parallel program is specified in the form of a process graph using a graphical interface. In a process graph, each node denotes a process, each edge denotes a channel, and the connection between the edge and the node specifies a port. This process graph along with the hardware resources such as the number of processors, the processor type, the number of links, and the interconnection of the network are stored in the database. Parsec Database Executable Image Figure 3.1: System Overview The mapping tool takes as input the process graph and the hardware resources and performs the mapping in three distinct steps: contraction, assignment, and routing. Chapter 3. System Design 28 Its output is the process-processor bindings and platform-dependent information. The platform-dependent information includes the communication paths between every pair of processes. For the transputer system, these paths specify transputer links, crossbar switches, and crossbar links. For networked workstations, the paths may specify bridges or routers. The mapping tool can be viewed as a static compilation tool that converts structural information into detailed platform-specific configuration information. This configuration information is stored into the database for further use. A loader tool provides the database interface between the mapping tool, the runtime system, and the program execution. It passes the optimized configuration information in the form of load-time parameters to the runtime system. A loader also generates the script to boot the network that is configured by the mapping tool and to load the application program onto the multicomputer. The Parsec runtime system has a few simple message-passing primitives that are built on top of the native runtime system. These primitives hide the complexity associated with low-level communication while maintaining the efficiency of the low-level primitives. For the transputer system, the Parsec primitives select one of the many Trollius functions provided at different levels (physical, data-link, network, and transport level). In case of workstations, they use a IPC mechanism supported by the operating system. The run-time system has two distinct phases: an Initialization phase and an Optimization phase. During initialization, the configuration information is converted into the data structures corresponding to the ports (\u00C2\u00A72.5) so that the optimal primitives can communicate on those ports. During the optimization phase, the port data structure is analyzed and the most efficient low-level function is selected. As we will see in Chapter 5, this requires that the physical links are reserved for the lowest-level of message passing. As shown in Figure 3.1, the user can also specify communication semantics such as synchronous and buffered communication. A timeout mechanism is also provided that allows user Chapter 3. System Design 29 to control blocking of every send and receive operation on a per port basis. Finally, the optimized message-passing primitives are compiled and linked with the application program to form an executable image. \u00E2\u0080\u00A2 3.2 The Mapping Tool Methodology The mapping problem is related to two basic problems: task allocation and routing. Stone [25] proposed a network flow algorithm for task assignment in order to minimize interprocess communication. Various other approaches pursued are: simulated annealing [22], heuristics [26], recursive bipartitioning method [23], and initial assignment, followed by pairwise exchange [20] or boundary refinement [27]. Since obtaining optimal solutions is intractable, these algorithms are not guaranteed to produce the optimal solution. We have proposed a graph-oriented mapping, strategy that uses a greedy algorithm combined with heuristics to find a solution. Greedy algorithm were used by systems like Prep-P and in the theoretical literature on mapping [17, 21]. The mapping problem is represented as one that relates a process graph, correspond-ing to a parallel program structure, to a processor graph, corresponding to a parallel architecture. A node in a process graph indicates a user process and an edge indicates a communication path between the two adjacent processes. The cost of Communication between two processes is indicated by the weight associated with every edge of the pro-cess graph. In the case of non-adjacent nodes, the communication path is a sequence of such edges. Similarly in a processor graph, a node represents a P E and an edge is the physical link between two PEs. Each link of a processor graph is bidirectional so that the underlying processor graph is undirected. Chapter 3. System Design 30 Funct ional Overview The mapping process has the following phases: resource allocation, contraction, assign-ment and routing. Dividing the mapping problem into different phases allows one to provide different solution to each individual phase. Thus it is possible to implement different algorithms, each best suited for a particular task, for different phases of the mapping process. Same approach is taken by the Prep-P [30] mapping tool. However, the difference (with respect to Parsec mapping tool) lies in the tasks performed in each of these phases. For example, the routing phase in the Parsec mapping tool analyzes the assignment and generates the configuration information which is required by the runtime system for communication optimization. The basic inputs to the mapping tool are the hardware resources and a process multigraph (multigraph is an undirected graph that can have multiple edges between the vertices and self-loops). 1. Resource A l l o c a t i o n This module determines which ports (hardware ports; e.g., port that connects trans-puter to the host machine) and processors to use, and which processors are con-nected to the host machine (i.e. root processors). The following information is required by this module: different logical components of the system and how they are connected, the number of processors, the machine-type of the processors, the number of processes and their machine types. A hardware graph is built from this information and is used during assignment and routing. 2. Cont rac t ion If the number of processes is more than the number of processors available, then more than one process will need to run on each processor. This task is performed by contraction. The process graph is contracted in such a way that the number Chapter 3. System Design 31 of processes is less than or equal to the number of processors in a target machine. This is similar to partitioning a process graph of n nodes into m groups, m < n, such that a given cost function is minimized. A n example cost function is the communication cost over the edges. Several algorithms are available to perform the contraction including: simulated annealing, local neighbourhood search and branch-and-bound. However, to keep the mapping problem simple, we ask the user to specify which processes need to be assigned to the same processor. The input to this module is a process multigraph and output is a abstract process graph where each node is a virtual process which represents one or more user processes. This is many-to-one mapping from the user multigraph to the abstract process graph. 3. Assignment . This process assigns a single virtual process to a single processor of the target machine. This is a one-to-one assignment. The assignment strategy is driven by the type of algorithm used for mapping and its main objective is to minimize the total cost of communication in the process graph. 4. Routing Once the node assignment is done, there are two levels at which routing takes place: physical routing and logical routing. physical routing : in a dynamically reconfigurable system, the interconnection network is defined by setting the control switches between the processor links. A l l the physical links between every pair of processors are statically configured. This results into a fixed interconnection graph. logical routing : from a fixed interconnection graph, the communication path Chapter 3. System Design 32 between all the non-local processes are determined. The channels of the ab-stract process graph are then assigned to the physical links in the intercon-nection graph. 5. Dumping Information This phase gathers all the information generated in previous phases and stores it into the Parsec database so that the loader tool and the runtime system can access it. This information includes physical wires, routes, type of communication, and objects for synchronization. The contracted user processes within a virtual process are invisible to the assignment and the routing phase. Hence their routes are generated separately and stored into the database. Mapping strategy is platform-dependent. We present a mapping strategy for the transputer system in Chapter 4 and for the workstation environment in Chapter 6. 3.3 Runtime System The Parsec runtime system is built on top of the native runtime system. The implemen-tation details are platform-dependent but the message-passing primitives are generic. We first present the Parsec process model which specifies the communication and synchro-nization characteristics and then describe the Parsec primitives. 3.3.1 Parsec Process Model As described in \u00C2\u00A72.5, a parallel program in Parsec consists of a set of processes. A process is an entity that performs internal, unobservable actions and also interacts with other processes. A n interaction implies synchronization among multiple processes. Such a synchronization is an observable action and may involve data exchange. The Parsec Chapter 3. System Design 33 process model allows multiple processes to synchronize and thus support one-to-many (scatter) and many-to-one (gather) type communication. The communication can be either synchronous or buffered. In the case of multi-process synchronization where a single process can receive data from more than one processes, the synchronization behaviour is non-deterministic in that the sender's address cannot be known in advance. The runtime system implements this model using processes, ports, and events that correspond to processes, ports, and channels \u00E2\u0080\u0094 three static objects that are defined in Parsec. The approach is similar to that of Zipcode [53] and M P I [52] where it is charac-terized in terms of contexts of communication and communicators. A process is defined earlier in this section. A process group is a collection of processes and defines the scope of collective (scatter/gather) communication. It also provides an efficient mechanism to define and alter the characteristics of the processes, such as port bindings and process weights. We now describe the other two objects: events and ports. Events Events are the identifiers used for synchronization. They are globally unique integers, generated by the mapping tool. There is at least one event associated with every channel. The events are introduced in Parsec because they are required by the Trollius functions (described in Chapter 5). The Logical C functions do not use events. They access the communication channels directly. In the workstation environment, events are mapped to the port addresses used for communication over the sockets (described in Chapter 6). The user never manipulates these events but only observes the synchronization behaviour implied by the events. Chapter 3. System Design 34 Ports Ports are similar to mailers in Zipcode and communicators in M P I and define the context of communication. The ports encapsulate the configuration information generated by the mapping tool and make it available to the runtime system for optimization. The infor-mation includes the physical addresses of the processors that embed all the members of a process group and the events for every channel accessible through the port. The physical address and the events together provide scope for all the communication operations in Parsec. 3.3.2 Message Passing P r imi t ives The runtime system supports two message-passing primitives: port_send and port_recv. They are used to select the optimal communication primitives provided by the native runtime system. A l l ports are declared and bound to a process through the Parsec's graphical interface. Before any message is sent or received, every user process must initialize all its ports. This is accomplished by the following function: init_ports(int argc, char **argv, Port *pl, Port *p2, , NULL); where p i , p2, .. are the names of the ports that are bound to a process. The names can be either defined by the user or defined by Parsec. Function init_ports reads the configuration information passed by the loader tool and stores it into the data structures corresponding to the ports. Other initialization tasks are specific to the native runtime system. For example, in our transputer system init_ports registers a process with Trol-lius kernel and allocates soft channels (described in Chapter 5) for local communication. Once the communication is over, a process is exited with the following function call which essentially frees all the port structures (and in the case of transputer system it also de-registers the user process): Chapter 3. System Design 35 exit_ports(Port *pl , Port * p 2 , , NULL); Processes send or receive data on a specific port by invoking two functions: port_send and port_recv, respectively. Timeout parameters are used to control the blocking of send and receive operations. Two parameters, one for the send operation and one for the receive operation, are defined for each port. The user can also specify whether the communication is buffered or synchronous. The default options are: zero timeout (i.e. blocking until the send or receive returns successfully) and buffered. Like the configuration information (generated automatically by the mapping tool), these user-controlled parameters are encapsulated in the port structures. The syntax of the port functions is as follows: port_send(Port *port, void *data, int size, int msgtype); port_recv(Port *port, void *data, int *size, int msgtype); where port is a pointer to the port to (or from) which process is sending (or receiving) the data. Parameter \"msgtype\" is either a predefined token R A W or a user-defined token for X D R data structures. The data structures can be primitive data types or user-defined structures. Parsec automatically generates X D R code to pack and unpack the data structures and perform the byte-ordering when needed. If port_send and port_recv are called with user-defined message type then the the X D R code does the appropriate data conversion. For all X D R messages, the size argument in both the functions should be N U L L . The data argument is a pointer to an instance of the data structure for the send function and a pointer to a variable of the message type for the receive function. If the data structures contain the elements that are accessed through pointers, they are allocated by the X D R code if the size pointer is N U L L . In the case of the R A W message type, the X D R code is bypassed. For these messages, the data argument is a pointer to the existing buffer for both send and receive function. The'size argument is the actual size of the buffer in the case of send and in the case of receive, it is set to the number of Chapter 3. System Design 36 bytes that are received. Features of the runtime system are summarized as follows: 1. The communication primitives are easy-to-use and portable across different hard-ware platforms. 2. They select the optimal communication primitives depending on the configuration information generated by the mapping tool. This type of optimization is platform-dependent and is described in \u00C2\u00A75.3. 3. They support point-to-point communication as well as scatter and gather type of communication. 4. They support data conversion for heterogeneous networks. The basic data types as well as the user-defined complex data structures can be used. 5. Messages of arbitrary length are allowed, and the packetization is done internally. In Chapter 5, the runtime system for the transputer system is described. This includes the underlying communication primitives and the optimization techniques employed by the runtime system. Finally, the Unix implementation of the runtime system is discussed in Chapter 6. Chapter 4 Mapping Strategy on Multicomputers Following the general design of the tool given in \u00C2\u00A73.2, we describe our mapping strategy on multicomputers'. The mapping algorithm has been implemented on our configurable transputer system. A brief overview of the transputer system in Department of Computer Science at the University of British Columbia is presented in \u00C2\u00A74.1. In \u00C2\u00A74.2, we describe the structural design of the mapping tool. The structural design shows different components of the mapping tool. These components provide a layer of abstraction on top of the hardware architecture and the parallel program structure.-. This allows the mapping to be done in a hardware-independent way. The mapping strategy is described in \u00C2\u00A74.3. We discuss various cost metrics, an objective function, a mapping algorithm, and the experimental results of our mapping tool. 4.1 Hardware System The hardware system consists of 75 INMOS T800 transputers and 10 C004 programmable crossbar switches. Each transputer has a 32 bit, 10 MIPS C P U , a 64 bit F P U , 4KB of on-chip R A M , and four bit-serial communication links. Each node is a single transputer with external (off-chip) memory: 64 nodes have one MByte, 10 nodes have two MByte and there is one node with 16 MByte. Each crossbar switch has 32 data links and a control link. Any pair of data links can be connected by.sending the appropriate software command to the control link. Crossbars can have up to 16 bidirectional links. 37 Chapter 4. Mapping Strategy on Multicomputers 38 The system is reconfigurable and supports a variety of interconnection networks. The transputer system is accessed by a Sun Sparc2 workstation through a S-bus card with four data ports. The nodes are partitioned into four reset components. Each com-ponent has one transputer directly connected to the host through a data port. Trollius, a single-user, multitasking operating system runs on each node. This allows only one user to access any node at a given time. However, the logical partitioning allows up to four users to concurrently use the system. Once a particular component is allocated, the user has complete control over all the transputers in it. The entire transputer network resides in five physical boxes1. Links 0 and 1 of most transputers. are hardwired to one another whereas links 2 and 3 are connected to the crossbar links. Some transputers in box 4 have all the links connected to the crossbars. xb3 14 3 Toxb2 15 T46 To T45 T47 TOT40 24 2 To Xb2 16 T48 Xb4 Figure 4.1: Transputer Network (part of Component 2) There are six 8-node rings and one 8-node chain. The remaining transputers are either connected to the host machine or to the crossbars. Thus, the network is not regular. Figure 4.1 shows the part of our transputer network which is configured as described above. 1There are four components partitioned into five physical boxes as shown in Appendix A. Chapter 4. Mapping Strategy on Multicomputers 3 9 4.2 S t ruc tura l Overview The different components of the mapping tool are shown in Figure 4.2. The components are connected by directed edges that show the dependency between them. In the rest of the chapter we will use the terminology defined in this section to describe the various graph structures used in the mapping process. Figure 4.2: Structural Overview 1. Hardware Database ( H W D B ) The hardware database includes the machine dependent information on the nodes and their connectivity. A node is either a processor, a crossbar, or a host machine. The node information includes the logical name of the node, its architecture type (either transputer, crossbar, or host machine), the component to which it belongs, the number of communication links, memory size, processor speed. A transputer link is either hardwired to another transputer link or connected to a crossbar link.' Chapter 4. Mapping Strategy on Multicomputers 40 In the latter case, connections between the transputers depend on the crossbar settings. Thus, multiple physical links (Figure 4.3) are possible between two trans-puters depending on the number of hardwired links and the number of crossbar, switches. A l l such transputer interconnections are stored in the database. Given two transputers and their links, a database query can determine whether there exists a path between them. 2. Resource A l l o c a t i o n M o d u l e (Resources) This module retrieves information from the hardware database. The module al-locates the resources from the component requested by the user. The resources include the number of wires (network connections), a data structure for each wire (which includes transputer names, transputer links, current crossbar settings, and weight on each wire), the names of all the processors (i.e., the transputers and the host machine), and the names of all the transputers that are connected to the host machine. Only these resources are used by the remaining phases of the mapping tool. xb7 15 16 17 3 2 T15 0 1 Ta 3 TOT14 Xanadu Figure 4.3: Resources (part of Component 0) The four hardware components of our system are shown in Appendix A. As an example consider component 0 which has 25 transputers (T 0 -T 23 and T 0 ) . In Fig-ure 4.3, we have shown the connections between T a , T15, and Xanadu (the host machine). The data structures for the wires, retrieved by the resource allocation apter 4. Mapping Strategy on Multicomputers module, are given in Table 4.1. 41 Wire # Lk i P E 2 L k 2 Wt Xlen Xname X l k i X l k 2 W i T -1- a 1 Tis 0 1 0 - -w 2 T 2 ' T 1 5 3 2 1 xb7 16 15 w 3 T 3 Tis 3 2 1 . xb7 17 15 W4 Xanadu 0 T a 0 X 0 - . - -w'i Ti5 0 T 1 1 0 - - -w' 2 T i 5 3 T 2 2 1 xb7 15 16 w' 3 Tis 3 T 3 2 T xb7 15 17 w'4 T t t 0 Xanadu 0 X 0 - - -Table 4.1: Wire Structures Each wire is unidirectional. If two transputers are hardwired then xlen is 0 indicat-ing that there are no crossbar connections. In the case of positive xlen (i.e., wire has an intermediate crossbar links), the name of the crossbar switch and its links are stored into the data structure. Wt indicates the communication cost on a wire. For the crossbar links this cost is double the cost on a direct link. The processes running on Xanadu and T a are specified by the user and are assigned to the proces-sors during initial phase of the mapping algorithm. Hence the communication cost between Xanadu and T a is not considered for the mapping. 3. Sys tem G r a p h A system graph is a multigraph representing the interconnections of the transputer network. The system graph for the processors in Figure 4.3, minus Xanadu, is shown in Figure 4.4. Each node represents a transputer, and an edge between two nodes represents a physical link between two transputers. Each transputer has four links. If all four links of the two transputers are connected to the same crossbar, then there are 16 different ways to connect them. Therefore, the system graph has up Chapter 4. Mapping Strategy on Multicomputers 42 to 16 edges between .two nodes. Each edge is associated with two data structures (as shown in Table 4.1) for two unidirectional wires. e 1 = {w 1 ,w^} e 2={wtj,w^} Figure 4.4: System graph The main advantage of using a system graph is that it hides the hardware-specific information. This graph can be used as a input graph to any mapping algorithm. This is evident from Figure 4.2, in that there is no direct edge from hardware database to an algorithm module. A system graph is independent of a particular mapping instance (i.e. process graph). It can be constructed once and stored in the database for reuse, unless the hardware configuration or. the requested resources change. 4. Conf l i c t -Graph The system graph stores the communication paths between the nodes. There are, however, the potential for conflicts. A conflict graph is constructed from the system graph to encode the information about which pairs of system edges can not be mapped simultaneously. In Figure 4.4, there are three edges between T a and T15: ei , e2, and 63. Since e 2 and e3 share link 3 of T15, only one of them can be used for the mapping. In general, two edges e =< ni,lki,ri2,lk2 >. and e '=< n'1?lk^,n'2,lk'2 > conflict if and only if Chapter 4. Mapping Strategy on Multicomputers 43 ((m = n;) V(n 2 = n'2)) A ( ( l k i = 1^) V ( l k 2 = lk' 2)) A conflict graph for Figure. 4.4 and its matrix representation are shown in Fig-ure 4.5. The graph shows a conflict between e2 and e 3 and adjacency matrix has \u00C2\u00A9 0 \u00E2\u0080\u0094 0 e, e, e, i e. e2 o o o o o 1 0 1 0 Figure 4.5: Conflict-graph for Figure 4.3 value one in the corresponding row and column. Like a system graph, a conflict graph depends only on the hardware configuration and is independent of the map-ping instance. 5. Abstract Processor Graph The abstract processor graph or Host graph is a subgraph of the system graph. Different host graphs are constructed for different mapping instances. Before the mapping starts, there are as many nodes in the host graph as the transputers in a component requested by the user. Since the crossbar switches are reconfigurable, their link information is not available until the mapping is finished. Hence only hardwired links (i.e. wires between two PEs without any intermediate crossbar switches) are added to the host graph initially. Once the mapping is done, physical routing takes place on paths between every pair of transputers that are statically configured. Configuration information such. as.transputer links, crossbar switches and their links is stored during routing. The transputers that are not mapped and are not connected to any mapped transputers are removed from the host graph. Thus, after the mapping is done, it is guaranteed that the host graph is a connected graph that embeds the process graph. Chapter 4. Mapping Strategy on Multicomputer 44 6. Process Graph A parallel application is specified in the form of a process multigraph where each node represents a process and an edge between two nodes indicates a communication channel. It is assumed that communication channels are unidirectional, one-to-one, \u00E2\u0080\u00A2 and named on each end. This is called the \"port name\" of the channel. Thus a channel endpoint is defined by a tuple and a channel is an association of two such tuples. The type of communication (e.g., one-to-one, one-to-many, and many-to-one) is also characterized in this graph. An example of a process graph is shown in Figure 4.6. /east westv Figure 4.6: User Process Graph The above process graph has four processes and each process has two ports bound to it. These processes communicate on eight different channels: {, }, {, }, {, }, {, }, {, }, {, }, {, }, and {, }. There is only one-to-one communication since no two channels have the same end-points (i.e., process/port pair). 7. Abstract Process Graph The abstract process graph or Guest graph is the same as a process graph shown in Figure 4.6 except that if contraction occurs, each node of the guest graph denotes Chapter 4. Mapping Strategy on Multicomputers 45 Figure 4.7: Guest Graph a virtual process (a virtual process is a process with more than one user process mapped onto it). An edge stores (logical) routing information. Once the mapping is done, logical routing determines the paths of any non-local communication wherein the channels as given by the process graph are assigned to the physical links in the host graph. The contracted process graph for Figure 4.6 is shown in Figure 4.7. X and Y denote the virtual processes. Once the node assignment is done, processes A and C will be mapped to the same processor and the same is true for processes B and D. 8. Mapping Algorithm Given a guest graph and a system graph, the mapping problem is reduced to embed-ding a guest graph onto a system graph. A one-to-one embedding of a guest graph onto a system graph is desirable. However, designing a generic algorithm for various topologies is intractable. Instead, different algorithms and heuristics for different topologies can generate good mappings in many cases [31]. The graph structures, described above, provide basic objects for any mapping algorithm. These graphs hide the internal details of the parallel architecture and the parallel program struc-ture. In \u00C2\u00A74.3.3, we present a greedy algorithm that employs heuristics to map several process graphs onto a configurable transputer system (\u00C2\u00A74.1). 9. Mapping State Chapter 4. Mapping Strategy on Multicomputer 46 Once the mapping is done, the information about embedding, logical routes be-tween all pairs of guest nodes, and wires connecting mapped processors is stored in the database. The loader tool extracts this information from the database, con-figures the network of the. host graph, and generates a script to boot the network and to load the application programs onto the transputers. The mapping infor-mation is also used to measure congestion2 on the hardware channels, to identify the communication type for every edge in the guest graph (i.e./whether the com-munication occurs on local, neighbouring or remote transputers), and to generate events for synchronization between the processes. This configuration information is then passed to the runtime system where it is used to select the optimal Trollius functions (\u00C2\u00A75.3). . , 4.3 M a p p i n g Strategy The mapping strategy is guided by constraints such as dilation, congestion, load, and expansion and aimed at minimizing the cost of communication. One can formulate these constraints in the form of objective functions. The mapping strategy then attempts to optimize these objective functions. We first present the mathematical formulation of a mapping problem, followed by the mapping constraints and the objective function, and finally the algorithm. 4.3.1 Ma thema t i ca l Formula t ion Let G = ( V G , E G ) be a guest graph to be mapped onto a target machine, where V G = { Vi \ Vi \u00E2\u0082\u00AC G } is a set of nodes representing the processes and E G = { e,-j | e^ - = (u,-, Vj) and Vi,Vj \u00C2\u00A3 V G } is a set of bidirectional edges. The weight of an edge, u>ij, denotes the 2denned in \u00C2\u00A74.3.2 \u00E2\u0080\u00A2 ' , . Chapter 4. Mapping Strategy on Multicomputers 47 traffic intensity. The adjacency matrix for G is denoted by AG-Let S = (Vs, Es) be a system graph representing a target architecture, where V s = { V{ I Vi G S } is a set of processors and E 5 = { e;j | tij = (v;, Vj) and i>;,Uj 6 V s } is a set of physical links of the hardware. Let Wij be the weight on edge \u00C2\u00A3 E 5 . It denotes the communication cost associated with the edge. .4s denotes the adjacency matrix for the system graph. A system path pij corresponds to the sequence of system edges when it is defined for any two system nodes. It is denoted as = (e^ , eklk2> eknj) where eikl, eklki, ekjlj G E 5 . As explained in \u00C2\u00A74.2, there are conflicts, as given by the conflict graph, between many edges in the system graph. T denotes this conflict graph. We assume that | V G | < \Vs\- G is a abstract process graph defined in \u00C2\u00A74.2. If there are more processes than processors then the contraction is specified by the user and G is a contracted graph. The mapping problem has been reduced to an embedding of G into S. An embedding maps each vertex in V G onto a vertex in Vs and each edge in E G onto an edge (or edges) of Es- The corresponding mapping function is denoted as / : V G \u00E2\u0080\u0094\u00E2\u0080\u0094> Vs-4.3.2 M a p p i n g Constraints The following cost metrics are considered to measure the quality of the mapping: load, congestion, dilation, and expansion. They are discussed in detail in'[4]. The objective function for our optimization is derived from these measures. \u00E2\u0080\u00A2 Load: Load is the number of nodes in a process graph that are embedded in a single node in the system graph. In our case, contraction is done manually through the Parsec interface and we assume that only one guest node (virtual process representing one or more user processes) is embedded on a single system Chapter 4. Mapping Strategy on Multicomputer 48 node. \u00E2\u0080\u00A2 Congestion: Edge congestion is the congestion over a system edge and is defined as the number of guest edges mapped onto it.. The congestion of an embedding is the maximum edge congestion over all the system edges. When congestion is greater than one, the system edge is required by more than one guest edge. Hence communication on.these edges is delayed unless there are multiple system edges between the two nodes. Thus communication cost increases with the congestion. The congestion over e,-j \u00C2\u00A3 Es is given as, \u00C2\u00A3 ( e y ) =1 E'G I (4-1) where E ' G = { emn | pmn = (emil .., e,-j, .., ejn) for emn \u00C2\u00A3 E G and emi, e,-j, ejn \u00E2\u0082\u00AC E 5 } such that f(vm) and f (vn) \u00E2\u0082\u00AC V s V vm, \u00C2\u00AB \u00E2\u0080\u009E e V G \u00E2\u0080\u00A2 Dilation: Dilation of a guest edge is the length of a path (a path consists of one or more system edges) that a guest edge is mapped onto. The dilation of an embedding is the maximum dilation over all the guest edges. When dilation is greater than one, communication over the guest edge may get delayed because the message has to traverse over multiple system edges. In addition there may be congestion over some of the system edges. Thus, like congestion, dilation also adds to the communication cost. The dilation of a guest edge e\u00C2\u00BBj- is given as the distance between /(u;) and f(vj). We-denote this distance as T>Q (/ (v,-), / (VJ)). \u00E2\u0080\u00A2 Expansion: Expansion is the ratio of the number of nodes in a system graph to the number of nodes in a guest graph. The expansion is related to the dilation of an embedding, and optimizing one factor can adversely affect the other [4]. The expansion of the embedding does not directly affect the cost of communication. Our mapping strategy attempts to minimize only the dilation and the congestion. Chapter 4. Mapping Strategy on Multicomputers 49 In Parsec, the expansion is determined by the user. The number of nodes in a guest graph depends on the contraction of a process graph which is specified by the user whereas the nodes in the system graph are limited by the size of the component. However, once a component is allocated, the mapping tool will use as many transputers from that component as required. Given that the entire communication takes place in a single phase (a phase is the time interval for the communication to take place over a single guest edge), if the edges in a system graph and a guest graph have same weights then the total communication cost is same as the maximum dilation or congestion. However, if the weights on the edges are different then these weights should be taken into account while measuring the communication overhead. The weight on a guest edge indicates the traffic intensity. For example, if one edge is used more frequently than another edge, then it carries more weight. In the system graph, the edges representing crossbar links and direct links have different weights, reflecting the different bandwidth of the two.:. With these different weights, the congestion over a system edge, e^ -, is measured as the sum of the weights, wmn, of all the guest edges that are mapped on it: C{eij)= ^2 wmn . (4.2) eE' where E ' G is defined in equation 4.1. Similarly, the weights on the system edges (instead of number of system edges) should be taken into account while measuring communication cost due to dilation. The dilation of an embedding is minimum if each of the guest edges maps to a single system edge. Let \"dil-one\" be the number of guest edges mapped onto one and only one system edge. We call it an \"embedding amount\" of the mapping function / and define Chapter 4. Mapping Strategy on Multicomputer 50 it as d i l -one=^ Y, AQ{ {} do 6 forVg in UnMappedVg do 7 forVs in UnUsedVs do 8 Load := GetLoad(Vs); 9 end do 10 BestVh = GetBestNode(Load, UnUsedVs); 11 Delete Vg from UnMappedVg; 12 Delete Vs from UnUsedVs; 13 end do 14 end do end Chapter 6. Programming A Network of Workstations 82 The various aspects of this algorithms are discussed below: 1. This is a static load balancing algorithm. A l l the processes are assigned to the pro-cessors before the job is started. The assignment decision is made deterministically and takes into account the average behaviour of the system. 2. GetLoad function (Line 8) computes the load on each machine. The load index used to measure the load is the load average which gives the average number of processes in a ready queue. The R P C function rstat is used to get the kernel statistics. The statistics include the number of processes in the ready queue averaged over 1, 5 and 15 minutes. The weighted average of these three averages is used as the load index. 3. One cannot isolate the physical characteristics of a machine (e.g., C P U speed, memory size) with its. dynamically changing load. For example, when the load on two machines varies only slightly, a machine with higher C P U speed may process the job faster even if it has a higher load. In general, this is true but it is difficult to identify the acceptable load variations in order to apply this heuristics. It largely depends on the processor speed of the individualmachine as well as the job itself. However, we have divided the load into seven different ranges. The load thresholds for each range are: [0.000, 0.009], [0.0091, 0.001], [0.011, 0.09], [0.091, 0.1], [0.101, 0.5], [0.501, 1.0] and [loads greater than 1]. If two machines with different load fall within the same range then the machine with higher C P U speed, and large memory is selected for the node assignment. Given a set of machines and load on each machine, GetBestNode function (Line 10) finds the suitable machine as described above. In summary, the Parsec mapping tool for the workstation environment does a simple Chapter 6. Programming A Network of Workstations 83 processor allocation according to the load status. It differs from the programming envi-ronments which are primarily designed for networked workstations. In Framework [47], the user can specify the mapping constraints completely or partially using a graphical interface or may leave processor allocation entirely up to the system. An extensive load sharing facility is provided by Utopia [41] for a large heterogeneous distributed systems. Different algorithms are implemented for managing resource load information, for task placement, and for remote task execution. Some of the load indices used by Utopia are C P U speed, memory, disk I /O, and login sessions. A number of applications have been developed such as a load sharing command interpreter, a distributed batch facility, and a parallel make. The main emphasis of Utopia is on distributed computing, and it uses general purpose communication libraries (such as P V M ) for message passing between the processes in a parallel application. In other programming environments like P V M and Topsys, processor allocation is left to the user. This requires a detailed knowledge of the available resources, and their loads. Parsec provides a simple tool to relieve the user from this difficult task. 6.3 R u n t i m e System A l l participating Sun Sparc machines run under SunOS release 4.1.x. SunOS supports two types of IPC mechanisms: Socket-based and Remote Procedure Calls (RPC). Socket-based IPC provides the lowest level communication primitives. A socket is an endpoiht for communication (at transport level) and provides a descriptor for I/O over the network [6]. There are several types of sockets such as stream, datagram, raw, and sequenced packet sockets and each provides different services. For example, the stream sockets provide sequenced, reliable, two-way connection based on byte streams. A n explicit connection is set up between two processes prior to any message passing. The datagram Chapter 6. Programming A Network of Workstations 84 sockets support connectionless, unreliable communication. The user should check for flow control, sequencing, and length mismatch. Since it is a connectionless mechanism, the peer's address is required for all sendand receive operations. System calls are provided to control the socket operations for buffering, blocking/non-blocking, out-of-band message passing etc. RPCs are a communication paradigm built on top of low-level networking mechanism like T C P / I P and UDP/ IP . They present a client-server model. A client makes a procedure call to send a data packet to theserver, when the packet arrives, the server calls a dispatch routine, performs the requested service, and sends back the response after which the procedure call returns to the client. R P C services can be accessed at three different layers with the highest layer entirely transparent to the operating system and the lowest layer directly using the network services. However, R P C is not an attractive choice for group communication as it can not handle communication from one sender to multiple receivers [15]. We have chosen the socket primitives over the R P C mechanism as they are the most efficient and they are available in all Unix systems. This decision is critical to future development of Parsec on heterogeneous machines. 6.3.1 Internal Features As discussed in \u00C2\u00A72.5, the user specifies parallelism explicitly by the process, port, and channel objects. A channel specifies a communication path between any two processes. Ports are bound to each process and thus define the endpoint of a channel. Thus ports are high-level abstractions for the sockets and encapsulates the configuration information generated by the mapping tool. The configuration information includes event and physi-cal address of a machine. The port primitives are implemented for reliable communication using T C P sockets. For T C P sockets, two-way connection must be established between every pair of Chapter 6. Programming A Network of Workstations 85 communicating processes. The initialization phase sets up the connection between two processes A and B as follows: One process initiates a connection to another on its known network port address. It is required for both the processes A and B to know the net-work port address in advance. On the transputer system, the events (positive integers) generated by the mapping tool are mapped onto the Trollius events. Similarly, in the workstation environment, the events are mapped onto the network port addresses, used for naming the sockets. However, it is mandatory that only process A(B) will initiate a connection and process B(A) will accept it. Failing this requirement results in a deadlock since both A and B are waiting to accept a connection from each other. The runtime sys-tem avoids a deadlock by having the mapping tool generate a unique event (network port address) for every process. Process A will either initiate a connection or accept it with its neighbours depending on whether its event is less than or greater to its neighbour's event. In the case of more than one neighbour, they are sorted in the ascending order of their events and then a connection is established with each neighbour sequentially in ascending order. Function init-port performs this task. It must be called by each process prior to any message passing. Since system calls accept and connect, used by init-port, are blocking, init-port returns successfully when the T C P connection is established. Once the initialization phase is over, portsend (write system call) and port.recv (read system call) are used for message passing. Currently, sockets operate only in blocking mode and thus support only synchronous communication. With T C P implementation, the kernel tries to batch-up the requests by waiting for some minimum amount of data to communicate. We have used the TCPJJODELAY feature to avoid this stream-oriented batching and thus reduce the overheads. A scatter and gather type of communication is simulated on top of basic point-to-point communication. Once the communication is over, all the socket descriptors are closed by calling exit-port before exiting the process. In the next section the performance of the runtime system is measured and compared Chapter 6. Programming A Network of Workstations 86 with PVM3[42]. 6.4 Performance Analys i s Two Node test This example measures the communication time between two Sun Sparc workstations in a local area network using Parsec. Both the machines run a network file system and are part of the Department's shared computing environment. A message of desired size is sent from one node to the other and received back. The round trip communication time is measured for a large number of iterations. The same experiment is repeated using P V M 3 , and the communication time of Parsec and P V M 3 are compared. P V M 3 (release 3.2.6) supports two types of communication: T C P sockets between every two communicating processes and local daemons which com-municate.using U D P sockets. The default is daemon-mediated communication. The T C P communication is set up by using a pvm-advise(PvmRouteDirect) call prior to communi-cating. Bofh the types are used in the experiment. The following tables show round trip communication times for different message sizes ranging from 100 bytes to 1 megabyte, using 100 iterations. Parsec's communication times are consistently faster than both the P V M 3 options for all message sizes and for the same subnet. In the case of daemon-mediated U D P communication in P V M 3 , each message gets copied, first at a daemon on the source node and second at a daemon at the destination node, before it is delivered to the destination process. This overhead is considerable for a communication-bound example like the one presented here. For the T C P sockets, there is the cost of establishing a T C P connection between every pair of processes. However, this cost is amortized when the total number of communications is large. In P V M 3 , two processes establish a T C P connection when Chapter 6. Programming A Network of Workstations Same Subnet Diff Subnet Msg. Size Msg. Passing Trans. Rate Msg. Passing Trans. Rate (Bytes) Time (in Sec) (Mbytes/Sec) Time (in Sec) (Mbytes/Sec) 100 0.208076 0.048059 0.414514 0.024125 400 0.279717 0.143002 0.527060 . 0.075893 1000 0.382999 0.261097 0.782206 0.127844 4000 0.948864 0.421557 1.419315 0.281826 10000 2.369537 0.422024 3.835125 0.260748 40000 8.879566 0.450472 14.671687 0.272634 100000 . 22.012445 0.454288 35.786808 0.279433 400000 88.589821 0.451519 137.850052 0.029017 1000000 222.976151 0.448478 342.420776 0.292038 Table 6.1: Parsec: T C P communication Same Subnet Diff Subnet Msg. Size Msg. Passing Trans. Rate Msg. Passing Trans. Rate (Bytes) Time (in Sec) (Mbytes/Sec) Time (in Sec) (Mbytes/Sec) 100 1.265465 0.007902 ' 1.449805 0.006897 . 400 - ' 1.434604 0.027882 1.677073 0.023851 1000 1.572177 0.063606 \u00E2\u0080\u00A2 1.914482 0.052233 4000 2.998988 0.133378 3.697648 0.108177 10000 6.421205 0.155734 8.436946 0.118526 40000 20.890930 0.191471 27.349926 0.146253 100000 50.802238 . 0.196842 67.533371 0.148075 400000 200.089066 0.199911 262.817474 0.152197 1000000 493.379272 0.202684 639.039490 0.156485 Table 6.2: P V M 3 : daemon mediated U D P communcation Chapter 6. Programming A Network.of Workstations 88 Same Subnet Diff Subnet Msg. Size Msg. Passing Trans. Rate Msg. Passing Trans. Rate (Bytes) Time (in Sec) (Mbytes/Sec) Time (in Sec) (Mbytes/Sec) 100 0.487040 0.020532 0.640915 0.015603 400 0.565463 0.070738 0.787222 0.050812 1000 0.775981 0.128869 1.026842 0.097386 4000 1.343468 0.297733 1.883596 0.212360 10000 2.814400 0.355316 3.505754 0.285245 40000 9^983309 0.400669 11.321166 0.353321 100000 27.766409 0.360147 27.465364 0.364095 400000 106.868057 . 0.374293 112.383034 0.355926 1000000 267.402222 0.373968 287.563660 0.347749 Table 6.3: P V M 3 : T C P communcation they first communicate. This makes the first communication significantly costlier than subsequent communications. In Parsec, the start-up cost is hidden in port initialization as all the T C P connections are established prior to any communication. P V M 3 differs from Parsec in that the communication buffers must be explicitly packed and unpacked. This cost is also included in the round trip communication time. In all of the above cases, the transfer rate does not increase for large messages (of the order of hundreds of kilobyte). This is attributed to greater packetization cost and the kernel overheads. For different subnets, Parsec is slower than P V M 3 (TCP version) when the message size is greater than 10000 bytes. In the following section an example is presented to show how program development is.done in Parsec using the mapping tool and the runtime system. 6 . 5 P r o g r a m Development in Parsec This section describes the Parsec GUI required to create a parallel program structure and to run the mapping tool and the loader tool. The application, called Spring, takes r Chapter 6. Programming A Network of Workstations 89 as input any graph and springs it to output a new graph (hence the name Spring) which retains the node connectivity of the input graph. The example runs on a ring of eight nodes. One of the nodes, called the root node, is connected to a node, called the host node, outside the ring. The worker process runs on each node in the ring and the host process runs on the host node. Different phases in the processing are: \u00E2\u0080\u00A2 Initially, the host process reads a graph (its vertices and edges) from a file and displays it on the GUI. It assigns the vertices to each node in the ring. The graph is then sent to the worker process on the root node. The other information, sent to the root node, includes the simulation parameters, number of iterations (N) for simulation and communication, and the vertices assignments. The Host process then waits for the result. \u00E2\u0080\u00A2 The root node passes the entire data to the next node in the ring and it is then propagated to all the nodes. \u00E2\u0080\u00A2 Upon receiving the data, each node carries out simulation, followed by communica-tion (data exchange with the neighbours). During simulation, each node computes new positions of the vertices (based on simulation parameters) assigned to it. The new positions are sent to the adjacent nodes during communication. Both the steps-are repeated N times. \u00E2\u0080\u00A2 Finally, the new position of each vertex is sent to the host node where the new graph is displayed in the GUI. Figure 6.1 shows the process graph (eight node ring) for the Spring application. A label associated with each node (or process) indicates the process type. A l l nodes of type Worker and Last Worker have two ports bound to them: \"Prev\" and \"Next\". The type First Worker has an extra port \"Host\" bound to port \"Root\" of the Interface Chapter 6. Programming A Network of Workstations 90 process Type. The details about how to create this parameterized process graph are beyond the scope of the discussion and are contained in [32]. The software and hardware environments can be set up as shown in the Project Information window. Once the Figure 6.1; Process Graph for Spring in Parsec process graph is defined and code is linked with each process type, the following steps are carried out: 1. The code is compiled and executable images are produced using the B u i l d F i l e s and Make B ina r i e s options under Tasks choice in the A p p l i c a t i o n menu (refer to Project Tasks window in Figure 6.1). Chapter 6. Programming A Network of Workstations 91 2. The process graph is mapped onto the transputer network by invoking the mapping tool (Figure 6.2) from the Map Structure option in the Project Tasks window. (a) As previously discussed in Chapter 4, the mapping tool can be run in two modes: Normal and No Backtracking (or Quick). For some process graphs, the time required for the mapping to complete can be unpredictable. There-fore, the user can set time limit (in seconds) \u00E2\u0080\u00A2 for the mapping tool to run before it terminates with or without the solution. The default time is zero and is ignored by the mapping tool. In this case, the mapping tool continues to run until it finds a solution. In the No Backtracking mode, the time limit is ignored and a solution (not necessarily optimal though) is guaranteed. (b) The Network Root process is the process that is connected to the Host pro-cess. If this option is not set, the mapping tool will select the node (having minimum degree) from the process graph. (c) The contraction is done manually and the the mapping tool is notified by set-ting the flag permit contraction, (this is necessary since the configuration information for the contracted processes is generated by the tool). (d) The transputer network is divided into four components with each component having one port connected to the host machine. The number beneath each component shows the total number of transputers in the component. In this example, component 1 is selected. (e) Once these options are set, the mapping process is started by pressing Map button. The mapping results (node assignments and edge assignment) are available at the output window. Chapter 6. Programming A Network of Workstations 92 S9 Mapping Options Mecwork Sonera : Normal Mapper Mode: Topo logy Adv i ce : Q N->it* Network Root Process: (?) Worker 0 A l g o r i t h m Adv i ce : Q N->n* No Backtracking Topology Root Process: (?) None Permit Contraction: \u00E2\u0080\u00A2 Mapping Constraints for Environment: Transputer Network Host System: (?) xanadu Components: [cTJ [JJ [T] [T] Size: 2G 9 24 16 [Vfrtccept 1 i n * i Restore [Cancel Project Tasks ( Build Files v ( Make Bi ( M a p Structurev ( Load Config. v ( Execute ( Clear Window ) a Parsec Transputer Mapper \u00E2\u0080\u0094 Version 3.0.OA (04/10/94) Allocating resources... Retrieving information from database... Mapping in NORMAL mode. constructing multigraph done constructing conflict graph . . . . . done 1NEW-><8[0] \u00E2\u0080\u0094 [1]9> C03- - c o 72 2NEW-><0[2]\u00E2\u0080\u0094 [1]B> C O - -C2) 12 2NEW-X1 [2] \u00E2\u0080\u0094 [2]8> C O - - -(8) 27 3NEW-><0[1] \u00E2\u0080\u0094 [0]7> C23-- -C3) 11 4NEW-X6[0] \u00E2\u0080\u0094 [1]7> ( 3 ) - -w 65 5NEW-X5[0]-\u00E2\u0080\u0094 [1]B> C4)~ -C5) 59 6NEW-X4[0] \u00E2\u0080\u0094 [1]5> ( 5 ) - -CG3 51 7NEW-><3[0] \u00E2\u0080\u0094 [1]4> CB)-- -(7y 41 map[0] map[1] map [2] map [3] map [4]. map [5] map [6] = 4 map [7] = 3 map [8] = 1 NODE MAPPING 9 8 0 7 6 5 <0[1] <0[2] <1[3] <1 [2] EDGE MAPPING [0]7> [1]8> ' [3]3> [2]8> Figure 6. 2: Graphical Interface for Mapping Tool in Parsec apter 6. Programming A Network of Workstations 93 3. Once the mapping is done, the loader tool generates a file to configure the transputer network (this is called the boot schema file) and to load Trollius and also generates a script file to load user processes onto the hardware and to pass the configuration information to the runtime system. The loader tool (Figure 6.3) is invoked from Load option in Figure 6.1. Loading Files Generated by Loader Network Configuration: load.bnaiL^ Switch Configuration: load.swdata Monitor Enabled: H E .0 iK ions : Startup Script: load.script Trace File Prefix: Kill Script: kill,script Application Execution Options Default Port Semantics: [vj Buffered Dont background host process: Bf Only use network communication: O [VfAccept \"5\" i l l * I \"B\" -Restore 'Cancel \u00C2\u00A39 Project Tasks ( Build Files v ) ( Make Binaries 7) (Map Structure7) ( Load Config, v ) ( Execute ) ( Clear Window ) E Parsec Trollius Loader \u00E2\u0080\u0094 Version 2.2.0 B1 COG/16/94) Building network configuration f i l e . . . Building network routing f i l e . . . Building switch configuration f i l e . . . Building 1 oad scr ipt . . . Downloading data... Checking arguments for process type 'Worker'... Checking arguments for process type 'Last Worker'. Checking arguments for process type 'First Worker' Checking arguments for process type 'Interface'.. . Done. S3 Figure 6.3: Graphical Interface for Loading Tool in Parsec Chapter 6. Programming A Network of Workstations 94 Building makefiles and executable files (Step 1) is independent of the mapping and loading process (Steps 2 and 3). However, the loader must be invoked after the mapping is done as the configuration information, generated by the mapping tool, is utilized by the loader. The code fragments for each process type are shown below. The main objective is to show how the Parsec primitives are used in the program and hence the computation part is omitted in these fragments. The following code which runs on each node in the ring uses the conditional macros ROOT and LAST to differentiate between three process types: Worker, First Worker, and Last Worker. Param P; /* data structure to send information between nodes */ Port Next, Prev; #ifdef ROOT /* macro for process type First Worker */ Port Host; #define in Host; #else #define in Prev #endif main () * #ifdef ROOT init_ports (argc, argv, &Prev, &Next, &Host, NULL); #else init_ports (argc, argv, &Prev, &Next, NULL); #endif while (TRUE) { set_initial_data (); I* receive parameters from host/previous node */ for (i=0; kn; i++) { simulate (); ^ communicate (); I* send/receive data from adjacent nodes */ #ifdefROOT send_final_position (); /* send result to the host node 7 Chapter 6. Programming A Network of Workstations 95 void set_initial_data() { port_recv (&in, &p, NULL, PARAM); #ifndef LAST I* macro for process type Last Worker */ port send (&Next, &p, NULL, PARAM); } #endif /* initialize the adjacency matrix */ void communicatee) * pmsg sp; /* data structure to send vertices */ if (myid%2){ port_recv (&Prev, &sp, 0, POS); port_send (&Next, &sp, 0, POS); } else { port_send (&Next, &sp, 0, POS); pprt_recv (&Prev, &sp, 0, POS); } } The code for the process type Interface is given below. This process runs on the host node. Param P; Port Root; main () ^ I* GUI part to display input graph 7 /* read edges and vertices from a file 7 send_parameter (); I* send data to the root node 7 receive_finaLposition (); I* receive new graph info from root 7 t I* send new data to GUI to display new graph 7 void send_parameter () void receive_f inaLposition () { { /* initialize p 7 pmsg sp; j port_send (&Root, &p, NULL, PARAM); port_recv (&Root, &sp, NULL, PARAM); In order to run the same application on a network of Sun Sparc workstations, the hardware and software environment are changedfrom the Project Information window (Figure 6.1). Then the application is built again. Depending on the target environment, Chapter 6. Programming A Network of Workstations 96 Parsec uses different system libraries and macros during compilation and linking. In Figure 6.2, the Host system is the workstation which is assigned as standard I/O for the application. A l l other options remain the same. The mapping tool assigns processes to different workstations and the loader tool then generates a script to load processes on those workstations. The application was tested in both the environments. The perfor-Hardware Platform Programming Environment Time (in Sec.) Transputers Parsec 0.8 Sun (Sparc) Parsec 5.2 Sun (Sparc) P V M 3 20.8 Table 6.4: Spring: Performance mance of the application in the two environments is shown in Table 6.4. Chapter 7 Conclusions In this thesis, we examined two major issues influencing communication in multicomput-e r : \u00E2\u0080\u00A2 a mapping of a parallel program onto a parallel architecture \u00E2\u0080\u00A2 a runtime system for optimized communication. A mapping tool and runtime system were developed for a transputer system and for networks of Sun Sparc workstations. These tools are integrated with Parsec, a parallel programming environment for multicomputers being developed at U B C . The mapping problem in a reconfigurable transputer network is a challenging prob-lem. The mapping tool has been implemented on the 75 node transputer system which supports a wide range of network topologies including binary tree, ternary tree, mesh, hypercube, and butterfly. A graph-oriented mapping strategy is proposed in that a heuris-tics controlled greedy algorithm is used to map various process graphs. The algorithm attempts to maximize the number of dilation one edges. The mapping tool is a static compilation tool that converts a parallel program structure into detailed platform-specific configuration information. Once the mapping is done, the logical routes are analyzed to identify the communication characteristics such as the multiplexing over physical links and the scope of each communication binding. The configuration information is then made available to the runtime system. 97 Chapter 7. Conclusions 98 The Parsec runtime system on the transputer system is built on top of Trollius. Trol-lius offers numerous communication primitives of varying functionality and performance. Depending on the configuration information, the runtime system selects the most opti-mal Trollius primitives. If the communication is neighbour-to-neighbour or local and if the physical links are non-multiplexed, then the runtime system uses the most efficient hardware-level communication primitives. Parsec provides two easy-to-use communi-cation primitives: portsend and portjrecv. They support point-to-point and collective communication, employing both synchronous and buffered semantics. The Parsec primitives are compared with raw Trollius functions. At the network level their performance is comparable to that of Trollius functions. However, at hardware-level the Parsec primitives add extra overhead but relieve the user from setting up virtual circuits. Both the mapping tool and the runtime system were ported to a network of Sun Sparc workstations to provide location transparency.. A static load balancing algorithm has been implemented to allocate processes to processors such that the load is evenly balanced. The runtime system uses reliable T C P communication. The Parsec primitives are compared with the P V M 3 primitives. The Parsec timings were consistently faster (15% to 100% faster depending on the message size) than the P V M 3 timings except when the machines were on different subnets and the messages were large (of the order of tens of kilobyte). Finally, an example is presented to demonstrate how a parallel application is developed on a transputer network, and moved to a workstation environment without changing the code. Chapter 7. Conclusions 99 7.1 Future W o r k The work can be extended in each of the three areas: mapping problem, runtime system, and tools for different hardware-platforms. It is unlikely that an efficient algorithm will be found for the general mapping problem. Hence, different algorithms and heuristics can be employed for different topologies. A basic framework for the mapping process has been provided. Various algorithms can be implemented using this framework. By giving different choices for mapping, users can select the best achievable mapping for a wider class of topologies. Our transputer system has programmable crossbars switches in that multiple crossbar can be connected to one another giving better connectivity and thus increasing the range of topologies that can be mapped onto the system. The graph structures presented in Chapter 4 are used for a single crossbar connection between any two transputers. The algorithm can be modified to accommodate multiple crossbar architecture. This requires reconfiguration of the transputer network but has a potential for better mapping results. Our work can be extended to provide heterogeneity at an architectural level in a workstation environment. Currently, our tools are ported to Sun Sparc workstations and can easily accommodate different platforms like HPs, DECs, SGIs and RS6000s. Since different machines are best suited for different types of applications, incorporating them in Parsec can be an attractive choice to application programmers in various areas including graphics, image processing, numerical analysis and scientific computing. Developing parallel applications in these areas will justify the usefulness of our system. Bibl iography [1] Thomas H . Cormen, Charles E. Leiserson, and Ronald L. Rivest, \"Introduction To Algorithms\", The MIT press and McGraw-Hill Publication. [2] V . Kumar, A . Grama, A . Gupta, G. Karypis, \"Introduction To Parallel Computing\", Benjamin/Cummings Publishing Company. [3] M . Garey abd D. Johnson, \"A Guide to the Theory of NP-Completeness\", W.H.Freeman and Co., 1979. [4] F. Thomas Leighton. \"Introduction to Parallel Algorithms and Architectures: Ar-rays.Tress.Hypercubes\" 1991. [5] Swamy, Thulasiraman, \"Graphs, Networks and Algorithms\", A Wiley Interscience Publication. [6] W. Richard Stevens, \"UNIX Network Programming\", Prentice-Hall Publication, 1990. [7] Logical Systems, \"Transputer Toolset, vol. 1 - Toolset Reference\", Parallel Software Solutions, January 1992. [8] Logical Systems, \"Transputer Toolset, vol. 2 - C Library Reference\", Parallel Soft-ware Solutions, January 1992. [9] D. Fielding et. al, \"The Trollius Programming Environment for Multicomputers\", In A . Wagner, editor, Transputer Research and Applications 4 ( N A T U G 4), IOS Press, April 1990. [10] Ohio Supercomputer Center, \"Trollius Command Reference\", The Ohio State Uni-versity, March 1992. [11] Ohio Supercomputer Center, \"Trollius C Reference\", The Ohio State University, March 1992. [12] G. Burns, V . Radiya, R. Daoud, R. Machiraju, \" A l l About Trollius\", Occam User's Group Newsletter 1990. [13] D. Pountain and David May, \"A Tutorial Introduction To O C C A M P R O G R A M -M I N G \" , Inmos, BSP Professional Books, March 1988. 100 Bibliography 101 [14] Inmos Ltd., \" O C C A M 2 Toolset User Manual\", March 1991. [15] Andrew Tanenbaum, \"Modern Operating Systems\", Prentice-Hall International Edi-tion, 1992. [16] Andrew Tanenbaum, \"Computer Networks\", Prentice-Hall International Edition, 1991. Mapping related papers [17] Bokhari S., \"On the Mapping Problem\", IEEE Trans, on Computers, Vol. C-30, No.3, March 1981. [18] Berman F. and L. Snyder, \"On Mapping Parallel Algorithms into Parallel Architec-tures\", Proceedings of the 1984 International Conference on Parallel Processing. [19] F. Berman, \"The Mapping Problem In Parallel Computation\". [20] Soo-Young Lee, J . K . Aggarwal, \" A Mapping Strategy for Parallel Processing\", I E E E Trans. On Computers, Vol. C-36, No. 4, Apri l 1987. [21] Woei-Kae Chen, E. F. Gehringer, \"A Graph-Oriented Mapping Strategy for a Hy-percube\", A C M 1988. [22] J . Ramanujam, F. Ercal, P. Sadayappan, \"Task Allocation By Simulated Anneal-ing\". [23] F. Ercal, J . Ramanujam, P. Sadayappan, \"Task Allocation onto a Hypercube by Recursive Mincut Bipartitioning\", Dept of Computer Science, The Ohio State Uni-versity. [24] B. Kernighan and S. Lin, \"An efficient heuristic procedure for partitioning graphs\", Bell System Technical Journal, 49(2), Feb. 1970. [25] H. S. Stone, \"Multiprocessor Scheduling with the Aid of Network Flow Algorithms\", IEEE Trans. On Software Engineering, Vol. SF-3', No. 1, Jan 1977. [26] V . M . Lo, \" Heuristic Algorithm for Task Assignment in Distributed System\", I E E E Trans. On Computers, Vol. 37, No. 2, Nov. 1988. [27] P; Sadayappan, F . Ercal, \" Nearest-Neighbor Mapping of Finite Element Graphs onto Processor Meshes\", I E E E Trans. On Computers, Vol. C-36, No. 12, Dec. 1987. [28] J . Boillat, N . Iselin, and P. Kropf, \" M A R C : A tool for automatic configuration of parallel programs\", University of Berne, Switzerland, from Transputing '91, Sunny-vale, California. Bibliography 102 [29] F. \u00E2\u0080\u00A2 Berman, \"Why is Mapping Hard for Parallel Computer?\", IEEE Parallel Dis-tributed Computing Networks Seminar, 1990. [30] F. Berman, \"Experience with an automatic solution to the mapping problem\", In L. Jamieson, D. Gannon, and R. Douglas,: editors, The Characteristics of Parallel Algorithms, MIT Press, 1987. [31] V . Lo, S. Rajopadhye et al, \" O R E G A M I : Software Tools for Mapping Parallel Com-putations to Parallel Architectures\", Dept. of Computer and Information Science, University of Oregon. [32] Feldcamp D., \"A Hierarchical software development for performance oriented paral-lel programming\", Masters Thesis, Univ. of British Columbia, Dec. 1992. [33] D. Feldcamp, A. Wagner, \"Using the Parsec Environment to Implement a High-Performance Processor Farm\", 1994. [34] H.V. Sreekantaswamy., \"An Integrated Approach to Programming and Performance Modeling of Multicomputers\", Ph.D. Thesis, Univ. of British Columbia, Oct. 1993. [35] U B C Transputer Group, \"A tool environment for parallel and distributed comput-ers\" Working paper, June 1992. [36] N . J . Goldstein, \"A Topology Independent Parallel Development Environment\" Mas-ters Thesis, Univ. of British Columbia, Apri l 1991. [37] Louis H . Turcotte, \"A Survey of Software Environments for Exploiting Net-worked Computing Resources\", a survey report available via anonymous ftp at unix.hensa.ac.uk [38] D. Feldcamp and A . Wagner, \"Parsec - A Software Development Environment for Performance Oriented Parallel Programming\", In S. Atkins and A . Wagner, editors, Transputer Research and Applications 6 ( N A T U G 6), IOS Press, May 1993. [39] A . Singh, J . Schaeffer, and M . Green, \"A Template-Based Approach to the Gener-ation of Distributed Applications Using a Network of Workstations\", IEEE Trans, on Paralle and Distributed Systems, Vol. 2, N o . l , January 1991. [40] Pok Sze Wong, \"The Enterprise Executive\", Department of Computing Science, University of Alberta, Technical Report T R 92-13, Oct 1992. [41] S. Zhou, J . Wang, X . Zheng and P. Delisle, \"UTOPIA: A Load Sharing Facility for Large Heterogeneous Distributed Computer Systems\", Technical Report CSRI-257, Computer Systems Research Institute, University of Toronto, Toronto, Canada, ' Apri l 1992. Bibliography 103 [42] A . Geist, A . Beguelin, J . Dongarra, W. Jiang, R. Manchek, and V . Sundaram, \" P V M 3.0 User's Guide and Reference Manual\", available on netlib. [43] A . Beguelin, J. Dongarra, A . Geist, R. Manchek, and V . Sundaram, \"Graphical Development tools for network-based concurrent supercomputing\", In Proceedings of Supercomputing 91, Albuquerque, 1991.. [44] T. Bemmerl, C. Kasperbauer, M . Mairandres and B. Ries, \"Programming Tools for Distributed Multiprocessor Computing Environments\", Report on TOPSYS and M M K , Technische Universitat Munchen, Germany, October 1991. available via anonymous ftp at topsys.informatik. tu-muenchen.de [45] F. Gregoretti, Z. Segall, \"Programming for Observability Support in a Parallel Pro-gramming Environment\", Department of Computer Science, Carnegie Mellon Uni-versity, Technical Report CMU-CS-85-176. [46] A . Kolawa, \"Parasoft: a comprehensive approach to parallel and distributed com-puting\", In Proceedings of the workshop on Cluster Computing, Tallahassee, F L , Dec. 1992. [47] A . Singh, J . Schaeffer, M . Green \"A Template-Based Approach to the Generation of Distributed Applications Using a Network of Workstations\", IEEE Transactions on Parallel and Distributed Systems, Vol. 2, No. 1, Jan. 1991. [48] N . Carriero and D. Gelernter, \"Linda in context\", Communcations of the A C M , vol. 32(4) :444-458, Apri l 1989. [49] I. Foster and K . . M . Chandy, \"Fortran M : A language for modular parallel program-ming\" , Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, June 1992. [50], M . Rosing, R. Schnabel, and R. Weaver, \"The DINO parallel programming lan-guage\", Computer Science Department, University of Colorado at Boulder, Techni-cal Report CU-CS-457-90, Apri l 1990. [51] M . Rinard, D. Scales, and M . Lam, \"Jade: A High-Level Machine-Independent Language for Parallel Programming\", I E E E Computer, Vol. 26, No. 6, June 1993. [52] Message Passing Interface Forum, \"Document for a Standard Message-Passing In-. terface\", Technical Report No, CS-93-214, University of Tennessee, November 1993. available on netlib [53] A . Skjellum, S. Smith, N . Doss, A . Leung, and M . Morari, \"The Design and Evolu-tion of Zipcode\", Parallel Computing, 1993, (invited paper, to appear) Bibliography 104 [54] G. Wilson, \"Via: A Structured Message-Passing System for Parallel Computers\", Technical Report No. T R 91-16, Edinburgh Parallel Computing Centre, University of Edinburgh, 1991. [55] A . Skjellum, A . Leung, S. Smith, R. Falgout, C. Still, and C. Baldwin, \"The Mul-ticomputer Toolbox - First-Generation Scalable Libraries\", submitted to HICSS-27, June 1993. [56] A. Quealy, G. Cole, and R. Blech, \"Portable Programming on Parallel/ Networked Computers Using the Application Portable Parallel Library ( A P P L ) \" , N A S A Tech-nical Report, July 1993. [57] Chimp, \"CHIMP Concepts\", Edinburgh Parallel Computing Center, University of Edinburgh, U K , June 91, available on netlib. [58] T. Bolognesi, E . Brinksma, \"Introduction to the ISO Specification Language LO-TOS\", Invited Paper in F O R T E 88 at University of Sterling, Holland, Sept. 1988. [59] R. Duncan, \"A survey of Parallel Computer Architectures\", I E E E Computer, Febru-ary 1990. [60] Gordon Bell, \"Ultracomputers: a terafldp before its time\", Communications of the A C M , 35(8):27-47, August 1992. [61] S. Zhou, \"A Trace-Driven Simulation Study of Dynamic Load Balancing\" IEEE Trans. Software Eng., Vol. SE-14, No. 9, pp. 1327-1341, Sept 1988. [62] S. Zhou, \"An Experimental Study of Load Balancing Performance,\" Proc. Winter USENIX Conference/Washington, D.C. , pp.73-82, January 21-24, 1987. [63] D. Eager, E. Lazowska, and J . Zahorjan, \"Adaptive Load Sharing in Homogeneous Distributed Systems\" IEEE Trans. Software Eng., Vol. SE-12, No. 5, pp. 662-675, May 1986. [64] D. Eager, E. Lazowska, and J . Zahorjan, \"A Comparison of Receiver-Initiated and Sender-Initiated Dynamic Load Sharing.\" Performance Evaluation, Vol. 6, No. 1, pp. 53-68, Apri l , 1986. A p p e n d i x A Transputer System Layout The transputer system in the Department of Computer Science at U B C is shown below. There are four components residing in five physical boxes. Each component has one transputer connected to Xanadu, the host machine, on a data port. The transputers are either hardwired with one another or are connected to the crossbars. Figure A.6 shows the entire system as a 8x9 mesh with two extra nodes (T a and T&). Solid lines are hard-links between two transputers and dotted lines indicate intermediate crossbar connections. xbl i T b 0 2 3 T64 1 1 0 4 0 ~! 1 xanadu! 2 3 1 T 6 5 0 2 3 i T 6 6 o 2 3 i T 6 7 o J T 7 0 \u00C2\u00B0 2 3 2 1 3 Figure A . l : Component 1 (Box 1) xb6 105 Appendix A. Transputer System Layout 106 xb6 xb9 _30J -xb6 10 12 14 1 3 5 7 9 11 13 15 2 TO 3 2 T 1 3 2 j 2 3 2 T 3 3 1 0 1 0 1 0 1 0 i X 4 0 i X 5 0 i T 6 0 i T 7 0 2 T g 3 2 3 2 T 1 0 3 2 T 1 1 3 1 0 1 0 1 0 1 0 i T 1 2 0 1 T 1 3 0 2 3 l - f l 4 \u00C2\u00B0 ' xanadu 0! \u00E2\u0080\u00A2 1 Ta 2 0 3 1 | 10 12 14 1 3 5 7 .9 11 13 15 16 17 xb7 xb7 xb6 16 17 18 19 20 21 22 23 2 T16 3 2 T17 3 2 T l g 3 2 T 1 9 3 1 0 1 0 1 0 1 0 i T 2 0 0 i T 2 1 0 1 T 2 2 0 i T 2 3 o 2 3 2 3 2 3 2 3\" i 24 I 2 5 126 |27 128 |29 130 !31 xb8 Figure A.2: Component 0 (Box 2) Appendix A. Transputer System Layout \u00E2\u0080\u00A2 xb2 22- \u00E2\u0080\u0094i xb8 2 T 2 4 3 2 T25 3 2 T26 3 2 T 2 7 3 1 0 1 0 1 0 1 0 i T 2 8 0 1 0 i T 3 0 0 i T 3 1 0 xb9 \" \u00E2\u0080\u00A2 3 xanadu, xb9 16 17 18 19 20 21 22 23 2 T 3 2 3 2 T33 3 2 T34 3 2 T35 3 1 0 1 0 1 0 1 0 1 T36 0 i T 3 7 0 i T 3 8 0 1 T39 0 2 3 2 3 2 3 2 3 24 25 26 27 28 29 30 31 xb2 Figure A.3: Component 3 (Box 3) Appendix A. Transputer System Layout 108 xb2 8 9 10 11 12 13 14 15 2 T40 3 2 x 4 1 3 2 T 4 2 3 2 T43 3 1 T 4 4 \u00C2\u00B0 1 T45 0 1 T 4 6 \u00C2\u00B0 1 T 4 7 0 2 3 2 3 2 3 2 3 8 9 10 11 12 13 14 15 xb3 xb4 xb3 3 T48 2 3 T 4 9 2 3 T52 2 3 T53 2 3 T50 2 3 T51 2 3 T54 2 3 TSS 2 \u00E2\u0080\u00A2 2 xanadu \u00E2\u0080\u00A2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 xb3 Figure A.4: Component 2 (Box 4) Appendix A. Transputer System Layout 109 xb5 V T56 2 1 T57 2 1 T58 2 1 T59 2 16 17 18 19 20 21 22 23 xb4 16 17 18 19 20 21 22 23 1 T60 2 1 T61 2 1 T62 2 1 T63 2 \u00E2\u0080\u00A2 xbl Figure A.5: Component 2 (Box 5) Appendix A. Transputer System Layout 110 TO \u00E2\u0080\u0094 i \u00E2\u0080\u0094 3 | T8 \u00E2\u0080\u0094 i \u00E2\u0080\u0094 T16 T24 T32 IT T40 T48 T56 o 1 0 1 T64 Tl T9 T17 37 T25 T33 37 T41 \u00E2\u0080\u0094 i \u00E2\u0080\u0094 x T49 T57 - i \u00E2\u0080\u0094 T65 T2 LZ_. T10 37 T18 T26 ' T34 37 T42 T50 37 T58 T66 T3 ,3 Til 37 T19 37 T27-37 T35 37 T43 T51 3T T59 T67 T4 T12 37 T20 T28 T36 i T44 37 T52 ; 2 T60 37 T68 T5 T13 37 T21 T29 37 T37 5 7 T45 T53 T61 T69 T6 31 T14 \u00E2\u0080\u0094 i \u00E2\u0080\u0094 1> T22 \u00E2\u0080\u0094r\u00E2\u0080\u0094 3 J 2 T30 T38 T46 T54 T62 T70 T7 \u00E2\u0080\u0094 i \u00E2\u0080\u0094 | 3 T15 T23 i T31 37 T39 37 T47 37 T55 IT T63 171 Ta x A N A D U Tb Figure A.6: Transputer System as a 8x9 mesh "@en . "Thesis/Dissertation"@en . "1995-05"@en . "10.14288/1.0051204"@en . "eng"@en . "Computer Science"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "Communication optimization in Parsec : A parallel programming environment for multicomputers"@en . "Text"@en . "http://hdl.handle.net/2429/3549"@en .